多區域一致性:魚與熊掌兼得!
https://www.synadia.com/blog/multi-cluster-consistency-models
摘要 Abstract
Using stretch clusters and virtual streams to improve latency and availability in global NATS deployments. Includes a walk-through and associated repository as a practical example.
本文介紹瞭如何利用擴展集羣和虛擬流來提升全球 NATS 部署的延遲和可用性。文中包含詳細的步驟説明和相關的代碼庫,以提供實際示例。
引言 Introuction
As field CTO at Synadia I have had the chance to work with some of the most interesting customer use cases for deploying NATS globally over multiple locations. Whether it’s the need for the data to be closer to the users to meet latency requirements, or the need to be resilient to a disaster such as a site or cloud provider regional outage, or even for regulatory requirements, many companies are looking to deploy their applications over multiple availability zones, sites, regions or cloud providers. And when you step into these kinds of geographically distributed deployments you need to worry about the distribution, replication and consistency of your data, both for ‘reads’ and ‘writes’.
作為 Synadia 的現場首席技術官,我有幸接觸到一些非常有趣的客户用例,這些用例涉及在全球多個地點部署 NATS。無論是為了滿足延遲要求,使數據更靠近用户;還是為了應對諸如站點或雲提供商區域性中斷等災難,甚至是出於監管要求,許多公司都在尋求將其應用程序部署到多個可用區、站點、區域或雲提供商。而當您進行這類地理分佈式部署時,您需要關注數據的分佈、複製和一致性,包括“讀取”和“寫入”操作。
In case you’re not familiar with consistency models in distributed data stores, just know that one of the ways distributed data stores can be classified is by their distributed consistency model: they can be either ‘eventually’ consistent or ‘immediately’ consistent. Both models have their advantages and inconveniences: immediately consistent systems can offer features such as distributed shared queuing for distribution of messages between consumers or ‘compare and set’ operations for concurrency access control that are not possible with eventually consistent systems. On the other hand, eventually consistent systems can offer lower latency and better availability.
如果您不熟悉分佈式數據存儲中的一致性模型,只需知道分佈式數據存儲的分類方式之一是根據其分佈式一致性模型:它們可以是“最終一致性”的,也可以是“立即一致性”的。兩種模型各有優缺點:立即一致性系統可以提供一些特性,例如用於在消費者之間分發消息的分佈式共享隊列,或用於併發訪問控制的“比較並設置”操作,而這些特性在最終一致性系統中是無法實現的。另一方面,最終一致性系統可以提供更低的延遲和更高的可用性。
This blog post will review the spectrum of options offered by NATS JetStream in terms of replication and consistency when deployed over multiple availability zones, sites, regions or cloud providers and how you can have access to both eventual or immediate consistency at the same time.
本文將回顧 NATS JetStream 在跨多個可用區、站點、區域或雲提供商部署時,在複製和一致性方面提供的各種選項,以及如何同時實現最終一致性和即時一致性。
What is described here applies just as well to a deployment over multiple data centers, multiple regions within the same cloud provider, or multiple cloud providers or any combination thereof, but for the sake of simplicity, we will use the term ‘regions’ (named ‘east’, ‘central’ and ‘west’ in the examples) in the rest of this blog post.
本文所述內容同樣適用於跨多個數據中心、同一雲提供商內的多個區域、多個雲提供商或其任意組合的部署。為簡便起見,本文後續部分將使用“區域”(示例中分別命名為“東部”、“中部”和“西部”)這一術語。
The concepts in this blog are purely about high availability and local data access and storage even in the face of disasters (entire regions going down or getting isolated from the other regions), if you need to extend NATS JetStream service to the edge and to places where the network connectivity is not always guaranteed, for example vehicles or mobile devices connecting over cellular networks, you should be looking at using Leaf Nodes instead.
本文主要探討高可用性以及即使在災難發生時(例如整個區域宕機或與其他區域隔離)也能實現本地數據訪問和存儲。如果您需要將 NATS JetStream 服務擴展到網絡邊緣以及網絡連接無法始終得到保證的地區(例如通過蜂窩網絡連接的車輛或移動設備),則應考慮使用葉節點。
集羣式 JetStream Clustered JetStream
Within a cluster of NATS servers, JetStream offers immediate consistency using a RAFT voting protocol between the servers that replicate a stream. When a client application receives the publication acknowledgement it is assured that the message has been safely replicated to (and persisted by) the majority of the servers.
在 NATS 服務器集羣中,JetStream 使用 RAFT 投票協議在複製流的服務器之間提供即時一致性。當客户端應用程序收到發佈確認時,即可確保消息已安全複製到(並持久化)大多數服務器。
The replicas for a particular stream are picked from the set of JetStream enabled servers in the cluster. So for example, if you have a cluster of 9 servers only 3 of them will be involved in the message storing and RAFT voting of an R3 (3 replicas) stream. NATS’s location transparency means that the client application can be connected to any server in the cluster and still be able to publish and consume from the stream.
特定流的副本從集羣中啓用 JetStream 的服務器集合中選擇。例如,如果您有一個包含 9 台服務器的集羣,則只有其中 3 台服務器會參與 R3(3 個副本)流的消息存儲和 RAFT 投票。NATS 的位置透明性意味着客户端應用程序可以連接到集羣中的任何服務器,並且仍然能夠發佈和使用流。
JetStream allows you to control the placement of the stream replicas using placement tags. For example, you can enhance availability by placing your servers in different availability zones within the same region/data center. You can then ensure using stream placement tags that the stream doesn’t get placed on two servers in the same availability zone. You can also adjust the replication degree up and down at any time without interrupting the service to the stream, and even change the placement tags of the stream to move it to a different set of servers (also without interrupting the service).
JetStream 允許您使用放置標籤來控制流副本的位置。例如,您可以通過將服務器放置在同一區域/數據中心的不同可用區來提高可用性。然後,您可以使用流放置標籤來確保流不會放置在同一可用區中的兩台服務器上。您還可以隨時調整複製程度,而不會中斷流的服務,甚至可以更改流的放置標籤,將其移動到不同的服務器組(也不會中斷服務)。
圖中展示了一個包含 3 台服務器、跨越雲區域中 3 個可用區的集羣。
Multi-Cluster JetStream 多集羣 JetStream
When you want to extend the JetStream system to multiple cloud providers/regions/data centers, you can use the JetStream Gateway feature to create a Super-Cluster. This feature allows you to connect clusters together such that you would have one cluster per cloud provider/region/data center. The location transparency of NATS and JetStream still applies in Super-Clusters: a client application can be connected to any server in any cluster and still transparently be able to publish and consume from the streams regardless of where the stream’s replicating servers are located.
當您想要將 JetStream 系統擴展到多個雲提供商/區域/數據中心時,可以使用 JetStream Gateway 功能創建超級集羣。此功能允許您將多個集羣連接在一起,從而為每個雲提供商/區域/數據中心創建一個集羣。NATS 和 JetStream 的位置透明性在超級集羣中仍然適用:客户端應用程序可以連接到任何集羣中的任何服務器,並且無論流的複製服務器位於何處,客户端應用程序仍然可以透明地發佈和使用流。
Drawing of a Super-Cluster spanning 3 regions.
跨越三個區域的超級集羣示意圖。
超級集羣中的操作延遲 Latency of operations in a Super-Cluster
This location transparency of NATS Super-Clusters is however still subject to the laws of physics and network latencies: operations on a stream located in a different cluster will have higher latency than operations on a stream located in the same cluster as the client application.
NATS 超級集羣的這種位置透明性仍然受物理定律和網絡延遲的影響:對位於不同集羣中的數據流進行操作的延遲將高於對與客户端應用程序位於同一集羣中的數據流進行操作的延遲。
讀取操作 Read operations
JetStream also has built-in mirroring or sourcing between streams: a stream can either mirror all the messages (or a subset, using subject-based filtering) from a single stream (in which case the message sequence numbers are preserved), or it can source from one or more streams (in which case the message sequence numbers are not preserved) for example to aggregate between streams. This mirroring/sourcing is done in a reliable ‘store and forward’ manner, meaning that sourcing/mirroring streams (i.e. the nodes replicating the streams) can be shut down or disconnected from the source/mirror for a period of time and will automatically catch any messages they may have missed.
JetStream 還內置了數據流之間的鏡像或源數據共享功能:一個數據流可以鏡像來自單個數據流的所有消息(或使用基於主題的過濾來鏡像子集)(在這種情況下,消息序列號將被保留),也可以從一個或多個數據流獲取消息(在這種情況下,消息序列號不會被保留),例如用於在不同數據流之間進行聚合。這種鏡像/源數據共享以可靠的“存儲轉發”方式完成,這意味着源數據流/鏡像數據流(即複製數據流的節點)可以關閉或與源數據流/鏡像數據流斷開連接一段時間,並且會自動捕獲它們可能錯過的任何消息。
Beyond controlling the placement of stream replicas within a cluster, placement tags also allow you to control the placement of stream and replicas across clusters. You can specify which cluster a stream should be located in (e.g. a stream containing PII for European users can be set to be located in a cluster in the EU), and even change the placement tags of an existing stream to move it to a different cluster, without interrupting the service.
除了控制流副本在集羣內的位置之外,放置標籤還允許您控制流及其副本在不同集羣之間的放置。您可以指定流應位於哪個集羣(例如,可以將包含歐洲用户個人身份信息 (PII) 的流設置為位於歐盟的集羣中),甚至可以更改現有流的放置標籤,將其移動到不同的集羣,而不會中斷服務。
In this mode of deployment you can have a stream located in one regional cluster and create mirrors of this stream in other regional clusters, which is the classic way to scale and provide faster read access to the client applications by having them use the mirror stream of the clusters they are connected to (this happens automatically for KV get() operations), at the expense of a certain amount of ‘incoherence’ which is unavoidable any time any kind of ‘cache’ (in this case the mirror stream) is used. This ‘eventual coherency’ is due to the fact that it takes a non-null (though typically very small, but could be longer in the case of network or hardware outages) amount of time for the mirrors to be updated with new message addition/deletion in the stream being mirrored. It is sometimes amalgamated with the term ‘eventual consistency’ but technically it is not the same thing: the ‘writes’ happen only on the (immediately consistent) origin stream, therefore they are serialized and there is only one view of the stream at any given time, and the mirrors are eventually coherent with the origin stream. This is different from an eventual consistent system where the ‘writes’ can happen at the same time in different regions and the system has to deal with the fact that there can be multiple views of the data (e.g. in a different order) at any given time.
在這種部署模式下,您可以將流放置在一個區域集羣中,並在其他區域集羣中創建該流的鏡像。這是擴展和為客户端應用程序提供更快讀取訪問的經典方法,方法是讓客户端應用程序使用其連接的集羣的鏡像流(對於 KV get() 操作,此過程會自動完成)。但代價是,當使用任何類型的“緩存”(在本例中為鏡像流)時,都會不可避免地出現一定程度的“不一致性”。這種“最終一致性”源於這樣一個事實:鏡像需要一定時間(雖然通常很短,但在網絡或硬件故障的情況下可能會更長)才能更新被鏡像流中新增/刪除的消息。它有時與“最終一致性”混用,但嚴格來説兩者並不相同:寫入操作僅發生在(即時一致的)源流上,因此它們是串行的,在任何給定時間都只有一個源流視圖,最終鏡像與源流保持一致。這與最終一致性系統不同,在最終一致性系統中,寫入操作可以同時發生在不同的區域,系統必須處理在任何給定時間可能存在多個數據視圖(例如,順序不同)的情況。
寫入操作 Write operations
Deploying mirrors helps scale and provides low latency for read operations, it does not however help scaling or provide high availability between the regions when it comes to ‘writes’: the origin stream is located on a cluster that is in a single region, if that region goes down entirely while other regions can still read from their mirrors of the stream, the client applications will not be able to write to the stream until the region comes back up.
部署鏡像有助於擴展並降低讀取操作的延遲,但對於寫入操作而言,它並不能幫助擴展或提供區域間的高可用性:源流位於單個區域的集羣上,如果該區域完全宕機,而其他區域仍然可以從其鏡像讀取流,則客户端應用程序將無法寫入流,直到該區域恢復運行。
即時一致的多區域擴展集羣 Immediately Consistent Multi-region Stretch Clusters
When you need immediate consistency between regions, regardless of any particular region going down, you can still do that with JetStream thanks to its implementation of RAFT which works even between regions. This is done by creating a ‘stretch’ cluster. A stretch cluster is called as such when the cluster nodes are all located in different regions and therefore the cluster is ‘stretched’ between the regions. In order to be able to create a stretch cluster you need at least 3 regions. You then add this stretch cluster to your existing super-cluster (one cluster per region) and use stream placement tags to create streams that are stored in the stretch cluster. Those stretched streams will be immediately consistent between regions, at the expense of much higher latency of synchronous operations on them. They will also be highly available as long as only one of the regions goes down, and if you stretch to 5 regions and stream replication of 5 then you can survive two regions going down. Note that much higher latency doesn’t necessarily mean much lower throughput (assuming there’s enough bandwidth), as long as your applications can leverage asynchronous publish operations.
當您需要在不同區域之間保持即時一致性時,即使某個特定區域發生故障,JetStream 也能憑藉其 RAFT 實現實現這一點,RAFT 即使在區域間也能正常工作。這可以通過創建“擴展”集羣來實現。當集羣節點全部位於不同的區域時,集羣就被稱為擴展集羣,因為它“跨越”了這些區域。要創建擴展集羣,您至少需要 3 個區域。然後,您可以將此擴展集羣添加到現有的超級集羣(每個區域一個集羣),並使用流放置標籤創建存儲在擴展集羣中的流。這些擴展流將在不同區域之間保持即時一致性,但代價是同步操作的延遲會顯著增加。只要只有一個區域發生故障,它們就能保持高可用性;如果您擴展到 5 個區域並進行 5 倍流複製,則可以承受兩個區域的故障。請注意,更高的延遲並不一定意味着更低的吞吐量(假設帶寬足夠),只要您的應用程序能夠利用異步發佈操作即可。
You can combine this with the mirroring/sourcing feature to create mirrors of the streams on the stretch cluster into the regional clusters in order to have low read latency, but the latency on write operations will always be proportional to the latency between the regions.
您可以結合鏡像/源功能,將擴展集羣上的流鏡像到區域集羣,從而降低讀取延遲。但寫入操作的延遲始終與區域間的延遲成正比。
In practice, network conditions such as latency, packet loss, and bandwidth will dictate the limits of the applicability of a stretch cluster, if the RTT latencies between regions are high then operations will take much longer to complete and some client applications may start timing out waiting on synchronous calls. If the connectivity between the regions continuously changes, it could temporarily affect the stream’s availability as well as at least 2 out of the 3 nodes must be up and reachable for RAFT votes to succeed.
實際上,延遲、丟包率和帶寬等網絡狀況將決定擴展集羣的適用範圍。如果區域間的往返時間 (RTT) 延遲較高,則操作完成時間會大大延長,某些客户端應用程序可能會因等待同步調用而超時。如果區域間的連接持續變化,可能會暫時影響流的可用性。此外,RAFT 投票必須至少有 2 個節點處於運行狀態且可達,才能成功。
實際案例 Real-world example
With the proper choice of your ‘regions’ and provisioning of the network connection between them, you can still get pretty good latency of write and read operations on those stretched streams. If you are interested in the details of an actual production implementation of a stretch cluster spanning multiple cloud providers where the P99 write latency under load is < 20ms, you can view Derek Collison’s in-depth talk at the P99 conference.
通過合理選擇“區域”並配置區域間的網絡連接,您仍然可以獲得非常低的跨域數據流讀寫延遲。如果您想了解跨多個雲提供商的跨域集羣的實際生產部署細節(其中 P99 寫入延遲在負載下小於 20 毫秒),可以觀看 Derek Collison 在 P99 大會上的深入演講。
包含跨越三個區域的擴展集羣的超級集羣示意圖
最終一致的多區域“虛擬流” Eventually Consistent Multi-region ‘virtual stream’
While immediate consistency is the highest quality of service, there are many use cases where you know from the business or application logic that it is not needed. Basically, if you know it is not possible for the same ‘key’ to be modified from two different places (regions) at the same time (while there is an outage or network split) then you do not need immediate consistency.
雖然即時一致性是最高服務質量,但在許多用例中,根據業務或應用程序邏輯,您知道並不需要即時一致性。基本上,如果您知道在發生中斷或網絡分裂的情況下,同一個“鍵”不可能同時在兩個不同的位置(區域)被修改,那麼您就不需要即時一致性。
Thanks to new JetStream features introduced in version 2.10 you can now create a ‘virtual stream’ that is globally distributed (i.e. to all regions) meaning that client applications to transparently publish to and read from with the low latency of interacting with a local (to the region) stream regardless of the region they are connected to, while still retaining eventual consistency between the regions with the single caveat that global ordering of the messages on that virtual stream is not guaranteed.
得益於 JetStream 2.10 版本中引入的新功能,您現在可以創建一個全局分佈(即覆蓋所有區域)的“虛擬流”。這意味着客户端應用程序可以透明地發佈和讀取消息,其延遲與本地(區域)流的交互一樣低,而無需考慮它們連接到哪個區域。同時,仍然保持區域間的最終一致性,但需要注意的是,無法保證該虛擬流上消息的全局順序。
I say the stream is ‘virtual’ because unbeknownst to the client applications they are interacting with a number of streams (two per region) that source from each other.
我説這個流是“虛擬的”,是因為客户端應用程序並不知道它們正在與多個流(每個區域兩個)進行交互,而這些流彼此之間是相互的。
工作原理 How it works
At a very high level, in each region there is a ‘write stream’ and a ‘read stream’, and the read streams source from the write streams.
從宏觀層面來看,每個區域都有一個“寫入流”和一個“讀取流”,讀取流源自寫入流。
The client applications publish to the write stream and read from the read stream for the region they are connected to, this happens transparently for the application using the Core NATS subject mapping and transformation feature, which (as of 2.10) can also be cluster-scoped.
客户端應用程序向其所連接的區域的寫入流發佈消息,並從讀取流讀取消息。對於應用程序而言,這一切都是透明的,這得益於 Core NATS 的主題映射和轉換功能(自 2.10 版本起,該功能還可以應用於集羣範圍)。
While this is conceptually very simple, the actual implementation is a little bit complicated by the fact that Core NATS messages flow freely between the clusters in a Super-Cluster (and between Leaf Nodes unless some kind of filtering is applied at the authorization level), combined with the fact that you can not have more than one stream listening on the same subject. Also, you can not create ‘loops’ in the sourcing between streams (i.e. stream A sources from stream B and stream B sources from stream A).
雖然概念上非常簡單,但實際實現卻略顯複雜。這是因為 Core NATS 消息在超級集羣中的各個集羣之間(以及葉節點之間,除非在授權層應用某種過濾)可以自由流動,而且不能有多個流監聽同一個主題。此外,流之間的源信息也不能形成“循環”(例如,流 A 源自流 B,而流 B 又源自流 A)。
So how is this possible? By using a number of the new features introduced in NATS version 2.10:
那麼,這一切是如何實現的呢?通過使用 NATS 2.10 版本中引入的多項新功能:
- The introduction of subject mapping and transformation features at the stream level (i.e. as part of the stream definition level as opposed to the Core NATS account level).
在流級別(即流定義級別,而非 NATS 核心賬户級別)引入了主題映射和轉換功能。 - The existing Core NATS subject mapping and transformation has been extended with the ability to define ‘cluster-scoped’ mappings and transformations.
現有的 NATS 核心主題映射和轉換功能已得到擴展,能夠定義“集羣範圍”的映射和轉換。 - The relaxation of some of the stream sourcing and subject mapping and transformation rules including allowing the dropping of a wildcard subject token as part of the transformation (unless the mapping is part of a cross-account import/export).
放寬了部分流源和主題映射及轉換規則,包括允許在轉換過程中丟棄通配符主題標記(除非映射是跨賬户導入/導出的一部分)。
寫入虛擬流 Writes to the virtual stream
For each region, there is a ‘write’ stream located in that regional cluster that captures the messages published on subjects prepended with a subject token designating the region this stream is servicing. The stream listens to subjects that contain a token identifying the region.
每個區域都有一個位於該區域集羣中的“寫入”流,用於捕獲發佈在主題上的消息,這些主題以指定該流所服務的區域的主題標記開頭。該流監聽包含標識該區域的標記的主題。
For a simple example for a virtual stream foo capturing messages published on subjects matching foo.> (i.e. any subject starting with the token foo), in the region west, the write stream could be called foo-write-west and listen on foo.west.> (you can change the order of the subject tokens and use wildcards to suit your needs).
例如,假設有一個名為 foo 的虛擬流,用於捕獲發佈在匹配 foo.> 的主題上的消息(即任何以標記 foo 開頭的主題),在 west 區域中,寫入流可以命名為 foo-write-west,並監聽 foo.west.>(您可以根據需要更改主題標記的順序並使用通配符)。
Once you have done that in all your regions you can JS publish (from anywhere) a message on foo.west.> and it will be persisted in the write stream in region west. But that means the client application has to know which region it is connected to in order to know which subject name to publish to. This can be remediated by setting up some Core NATS subject mappings (which are defined at the account level) and defining a cluster-scoped subject mapping per region such that in our example there is a subject mapping from foo.> to foo.west.> that applies only for cluster west, which means that any application connected to the west cluster that publishes a message on subject starting with foo will transparently be the same as if they had published it starting with foo.west.
在所有區域中完成此操作後,您可以從任何位置通過 JavaScript 發佈消息到 foo.west.>,該消息將被持久化到 west 區域的寫入流中。但這意味着客户端應用程序必須知道它連接到哪個區域,才能知道要發佈到哪個主題名稱。這可以通過設置一些核心 NATS 主題映射(在帳户級別定義)併為每個區域定義集羣範圍的主題映射來解決,這樣在我們的示例中,就存在一個從 foo.> 到 foo.west.> 的主題映射,該映射僅適用於 west 集羣,這意味着連接到 west 集羣的任何應用程序,如果發佈以 foo 開頭的主題消息,其效果將與以 foo.west 開頭的主題消息的效果完全相同。
最終將寫入操作複製到虛擬流中。 Eventually replicate the writes to the virtual stream
The second set of streams underlying the virtual stream are the ‘read’ streams, which source the ‘write’ streams, and strip the token indicating the region of origin from the subject.
虛擬流底層的第二組流是“讀取”流,它們作為“寫入”流的源,並從主題中移除指示來源區域的標記。
So using the same simple example on region ‘west’ there would be a stream foo-read-west that doesn’t listen to any subjects and sources from the stream foo-write-east, foo-write-central and foo-write-west and then strips the region name token by applying a subject transform from foo.*.> to foo.> (i.e. dropping the second token of the subject name). This means that the messages in the ‘read’ streams are under subjects starting with foo, the same subject the publishing application used (you can still tell which region the message was published in from a message header).
因此,以“west”區域為例,會存在一個名為 foo-read-west 的流,它不監聽任何主題,並從 foo-write-east、foo-write-central 和 foo-write-west 流中獲取消息,然後通過應用主題轉換(從 foo.*.> 到 foo.>,即刪除主題名稱的第二個標記)來移除區域名稱標記。這意味着“讀取”流中的消息位於以 foo 開頭的主題下,與發佈應用程序使用的主題相同(您仍然可以從消息頭中判斷消息發佈到哪個區域)。
Because of the reliable store-and-forward stream sourcing mechanism, you are ensured that all the ‘read’ streams will eventually contain all of the messages published on all of the ‘write’ streams, although not necessarily in the same order.
由於採用了可靠的存儲轉發流源機制,可以確保所有“讀取”流最終都會包含所有“寫入”流上發佈的所有消息,儘管順序不一定相同。
從虛擬流讀取數據 Reading from the virtual stream
Except for streams where the ‘direct get’ option is enabled (e.g. KV buckets) where direct get operations are automatically directed to any of the nodes within the local cluster replicating a mirror of the stream, if a client application wants to interact with a locally mirrored or sourced stream it needs to know the name of local stream, which means that it needs to know which region it is connected to. Avoiding this constraint is just like transparently dealing with publications to the virtual stream and can be done by setting a few (cluster-scoped) subject mapping transformations for the account at the Core NATS level.
除了啓用了“直接獲取”選項的流(例如 KV 存儲桶)之外,如果客户端應用程序想要與本地鏡像或源流交互,則需要知道本地流的名稱,這意味着它需要知道自己連接到哪個區域。避免此限制與透明地處理髮布到虛擬流的操作類似,可以通過在 Core NATS 級別為帳户設置一些(集羣範圍的)主題映射轉換來實現。
Besides the aforementioned direct get requests the way client applications ‘read’ (or consume) messages from a stream is through creating JetStream consumers (shared or not) and that is implemented over a number of JetStream API subjects which (unless JS domains are used) start with $JS.API, and also contain either a stream name or a consumer name as a token of that subject. Such that requests to create consumers on a stream foo are transparently transformed into requests to create consumers on the local <region>-read-foo stream instead.
除了上述直接獲取請求之外,客户端應用程序從流中“讀取”(或消費)消息的方式是通過創建 JetStream 消費者(共享或非共享)來實現的,這是通過多個 JetStream API 主題實現的,這些主題(除非使用 JS 域)以 $JS.API 開頭,並且包含流名稱或消費者名稱作為該主題的令牌。這樣,對流 foo 上創建消費者的請求將被透明地轉換為對本地 <region>-read-foo 流上創建消費者的請求。
So for example: define a cluster-scoped subject mapping from "$JS.API.CONSUMER.CREATE.foo.*" to "$JS.API.CONSUMER.CREATE.foo-read-west.{{wildcard(1)}}" on cluster west such that any application connected to that cluster and creating a consumer on stream foo will create a consumer on stream foo-read-west.
例如:在集羣 west 上定義一個集羣範圍的主題映射,將 "$JS.API.CONSUMER.CREATE.foo.*" 映射到 "$JS.API.CONSUMER.CREATE.foo-read-west.{{wildcard(1)}}",這樣,任何連接到該集羣並在流 foo 上創建消費者的應用程序,都會在流 foo-read-west 上創建一個消費者。
虛擬流的侷限性 What you can NOT do with a virtual stream
- 虛擬流的保留策略不能是“工作隊列”或“興趣”(即只能是“限制”)。
The retention policy of the virtual stream can not be a ‘working queue’ or ‘interest’ (i.e. only ‘limits’). - 除非您確定不會同時(或在腦裂期間)從兩個不同的區域修改同一個鍵,否則它不適用於鍵值存儲桶。
It does not work for KV buckets unless you know that you are not modifying the same key at the same time (or during a split brain) from two different regions. - 流消費者是“按區域”分配的,這意味着虛擬流上沒有全局命名的持久消費者,而是有多個區域消費者。
Stream consumers are ‘per region’, meaning you do not have a global named durable consumer on the virtual stream, but multiple regional ones. - 無法從虛擬流中刪除單個消息,刪除操作只會應用於本地“讀取”流,不會傳播(也不應該傳播,因為消息序列號在不同區域之間不一致)。
Deleting individual messages from a virtual stream is not possible, the delete operation will only apply to the local ‘read’ stream and are not propagated (and neither should they be, as the message sequence numbers are not homogeneous between regions). - 無法在虛擬流上執行比較和設置操作,因為您會從一個流讀取數據並寫入另一個流,而消息序列號在它們之間不會保留。
Compare-and-set operations are not possible on the virtual stream as you would be reading from one stream and writing to another and the message sequence numbers are not preserved between them.
實際演練 Walkthrough
In this example, we’re going to walk through setting up a local Super-Cluster and creating a virtual stream ‘foo’ Make sure to install (or upgrade to) the latest version of the NATS server and of the nats CLI tool on your local machine.
在這個示例中,我們將逐步介紹如何設置本地超級集羣並創建虛擬流“foo”。請確保在本地計算機上安裝(或升級到)最新版本的 NATS 服務器和 nats CLI 工具。
git clone https://github.com/synadia-labs/eventually-consistent-virtual-global-stream.git
訪問: 這個 GitHub 倉庫
設置 The setup
This walkthrough will create and start locally a total of 9 nats-servers organized in 3 clusters east, central and west of 3 nodes each interconnected as a Super-Cluster. Once those servers are started it will create all of the ‘read’ and ‘write’ streams for all 3 regions.
本教程將在本地創建並啓動總共 9 台 NATS 服務器,這些服務器分為 3 個集羣,分別位於東部、中部和西部,每個集羣由 3 個節點組成,並互連成一個超級集羣。服務器啓動後,將為所有 3 個區域創建所有“讀取”和“寫入”流。
You will then be able to play with the virtual stream foo using nats by connecting to different local clusters and using and publishing or reading with the (virtual) stream foo as if it were a single globally replicated stream.
之後,您可以使用 NATS 連接到不同的本地集羣,並使用(虛擬)流 foo 進行操作,例如發佈或讀取數據,就像使用單個全局複製流一樣。
服務器配置 Server configurations
The individual server configuration files are straightforward. Each server establishes route connections to its 2 other peers in the cluster, and the clusters are connected via gateway connections. In this example, all of the individual server’s configuration files import a single mappings.cfg file containing all of the Core NATS account level subject mapping transforms, which in this case are all cluster-scoped. If you were running your servers in the ‘operator’ security mode, those mappings would be stored (in the account resolver) as part of the account(s) JWT(s) instead.
各個服務器的配置文件非常簡單。每台服務器都與其集羣中的其他 2 個對等節點建立路由連接,集羣之間通過網關連接。在本例中,所有服務器的配置文件都導入一個包含所有核心 NATS 帳户級別主題映射轉換的 mappings.cfg 文件,這些轉換在本例中均為集羣範圍。如果您的服務器運行在“操作員”安全模式下,則這些映射將作為帳户 JWT 的一部分存儲在帳户解析器中。
mappings = {
"foo.>":[
{destination:"foo.west.>", weight: 100%, cluster: "west"},
{destination:"foo.central.>", weight: 100%, cluster: "central"},
{destination:"foo.east.>", weight: 100%, cluster: "east"}
],
"$JS.API.STREAM.INFO.foo":[
{destination:"$JS.API.STREAM.INFO.foo-read-west", weight: 100%, cluster: "west"},
{destination:"$JS.API.STREAM.INFO.foo-read-central", weight: 100%, cluster: "central"},
{destination:"$JS.API.STREAM.INFO.foo-read-east", weight: 100%, cluster: "east"}
],
"$JS.API.CONSUMER.DURABLE.CREATE.foo.*":[
{destination:"$JS.API.CONSUMER.DURABLE.CREATE.foo-read-west.{{wildcard(1)}}", weight: 100%, cluster: "west"},
{destination:"$JS.API.CONSUMER.DURABLE.CREATE.foo-read-central.{{wildcard(1)}}", weight: 100%, cluster: "central"},
{destination:"$JS.API.CONSUMER.DURABLE.CREATE.foo-read-east.{{wildcard(1)}}", weight: 100%, cluster: "east"}
],
"$JS.API.CONSUMER.CREATE.foo.*":[
{destination:"$JS.API.CONSUMER.CREATE.foo-read-west.{{wildcard(1)}}", weight: 100%, cluster: "west"},
{destination:"$JS.API.CONSUMER.CREATE.foo-read-central.{{wildcard(1)}}", weight: 100%, cluster: "central"},
{destination:"$JS.API.CONSUMER.CREATE.foo-read-east.{{wildcard(1)}}", weight: 100%, cluster: "east"}
],
"$JS.API.STREAM.MSG.GET.foo":[
{destination:"$JS.API.STREAM.MSG.GET.foo-read-west", weight: 100%, cluster: "west"},
{destination:"$JS.API.STREAM.MSG.GET.foo-read-central", weight: 100%, cluster: "central"},
{destination:"$JS.API.STREAM.MSG.GET.foo-read-east", weight: 100%, cluster: "east"}
],
"$JS.API.STREAM.MSG.DIRECT.foo":[
{destination:"$JS.API.STREAM.DIRECT.GET.foo-read-west", weight: 100%, cluster: "west"},
{destination:"$JS.API.STREAM.DIRECT.GET.foo-read-central", weight: 100%, cluster: "central"},
{destination:"$JS.API.STREAM.DIRECT.GET.foo-read-east", weight: 100%, cluster: "east"}
],
"$JS.API.STREAM.MSG.DELETE.foo":[
{destination:"$JS.API.STREAM.MSG.DELETE.foo-read-west", weight: 100%, cluster: "west"},
{destination:"$JS.API.STREAM.MSG.DELETE.foo-read-central", weight: 100%, cluster: "central"},
{destination:"$JS.API.STREAM.MSG.DELETE.foo-read-east", weight: 100%, cluster: "east"}
],
"$JS.API.CONSUMER.MSG.NEXT.foo.*":[
{destination:"$JS.API.CONSUMER.MSG.NEXT.foo-read-west.{{wildcard(1)}}", weight: 100%, cluster: "west"},
{destination:"$JS.API.CONSUMER.MSG.NEXT.foo-read-central.{{wildcard(1)}}", weight: 100%, cluster: "central"},
{destination:"$JS.API.CONSUMER.MSG.NEXT.foo-read-east.{{wildcard(1)}}", weight: 100%, cluster: "east"}
],
"$JS.ACK.foo.>":[
{destination:"$JS.ACK.foo-read-west.>", weight: 100%, cluster: "west"},
{destination:"$JS.ACK.foo-read-central.>", weight: 100%, cluster: "central"},
{destination:"$JS.ACK.foo-read-east.>", weight: 100%, cluster: "east"}
]
}
啓動服務器
您可以使用提供的簡單腳本啓動整個超級集羣。
source startservers
This script also defines 3 nats contexts to allow you to easily select which cluster you want to connect to: sc-west, sc-central and sc-east.
該腳本還定義了 3 個 NAT 上下文,以便您可以輕鬆選擇要連接的集羣:sc-west、sc-central 和 sc-east。
定義本地流 Defining the local streams
After a few seconds the Super-Cluster should be up and running, and then define for the first time all of the required local streams that are configured using JSON files and there is a simple convenience script to define them all.
幾秒鐘後,超級集羣應該啓動並運行,然後首次定義所有必需的本地流,這些流使用 JSON 文件進行配置,並且有一個簡單的便捷腳本可以定義所有這些流。
source definestreams
Taking the west cluster as an example below are the JSON stream definitions for both streams.
以西部集羣為例,以下是兩個流的 JSON 流定義。
The local ‘write’ stream is quite straightforward: it is named "foo-write-west" and all it needs to do is listen on the subjects "foo.west.>":
本地 write 寫入 流非常簡單:它被命名為 "foo-write-west",它只需要監聽主題 "foo.west.>" 即可。
{
"name": "foo-write-west",
"subjects": [
"foo.west.>"
],
"retention": "limits",
"max_consumers": -1,
"max_msgs_per_subject": -1,
"max_msgs": -1,
"max_bytes": -1,
"max_age": 3600000000000,
"max_msg_size": -1,
"storage": "file",
"discard": "old",
"num_replicas": 3,
"duplicate_window": 120000000000,
"placement": {
"cluster": "west"
},
"sealed": false,
"deny_delete": false,
"deny_purge": false,
"allow_rollup_hdrs": false,
"allow_direct": false,
"mirror_direct": false
}
Note that in this example a max-age limit of 3600000000000 (1 hour) set on the ‘write’ streams, meaning that the maximum length of a regional outage or split-brain that can be recovered from without any message write loss is 1 hour. You need a limit to ensure that the ‘write’ streams don’t just grow forever as they only need to hold data for as long as the outage lasts, adjust this limit to fit your specific requirements.
請注意,本例中 write 流的 max-age 限制為 3600000000000(1 小時),這意味着在不丟失任何消息寫入的情況下,區域性中斷或腦裂的最大恢復時間為 1 小時。您需要設置此限制以確保 wirte 流不會無限增長,因為它們只需在中斷期間保存數據即可。您可以根據具體需求調整此限制。
The local ‘read’ stream doesn’t listen to any subjects and sources all the ‘write’ streams (see the sources array) and performs a simple subject transformation to drop the token in the subject name that contains the name of the region of origin (see the subject_transform stanza).
本地 read 流不監聽任何主題,並且從所有 read 流獲取數據(參見 sources 數組),並執行簡單的主題轉換,以刪除主題名稱中包含來源區域名稱的標記(參見 subject_transform 部分)。
{
"name": "foo-read-west",
"retention": "limits",
"max_consumers": -1,
"max_msgs_per_subject": -1,
"max_msgs": -1,
"max_bytes": -1,
"max_age": 0,
"max_msg_size": -1,
"storage": "file",
"discard": "old",
"num_replicas": 3,
"duplicate_window": 120000000000,
"placement": {
"cluster": "west"
},
"subject_transform": {
"src":"foo.*.>",
"dest":"foo.>"
},
"sources": [
{
"name": "foo-write-west",
"filter_subject": "foo.west.>"
},
{
"name": "foo-write-east",
"filter_subject": "foo.east.>"
},
{
"name": "foo-write-central",
"filter_subject": "foo.central.>"
}
],
"sealed": false,
"deny_delete": false,
"deny_purge": false,
"allow_rollup_hdrs": false,
"allow_direct": false,
"mirror_direct": false
}
So using the region ‘west’ as an example a message published on foo.test by an application connected to the ‘west’ cluster will be first stored with the subject foo.west.test in the foo-write-west stream and the stream foo-read-west sources from foo-write-west and strips the second token of the subject such as the message ends up being stored in that stream with the subject foo.test.
因此,以 west區域為例,連接到 west 集羣的應用程序在 foo.test 上發佈的消息將首先以 foo.west.test 為主題存儲在 foo-write-west 流中,而流 foo-read-west 從 foo-write-west 獲取消息,並去除主題的第二個標記,因此該消息最終以 foo.test 為主題存儲在該流中。
Drawing of the transformation of the subject of a message published on foo.test in region west as it makes its way from a publishing to a consuming client application.
繪製在 foo.test 上發佈的 west 區域的消息主題從發佈客户端應用程序到消費客户端應用程序的轉變過程。
與全局虛擬流交互 Interacting with the global virtual stream
You can use nats --context to interact with the stream as would a client connecting to the different clusters.
您可以使用 nats --context 與流進行交互,就像客户端連接到不同的集羣一樣。
For example let’s connect to the ‘west’ cluster and publish a message on the subject foo.test:
例如,讓我們連接到“west”集羣,併發布一條主題為 foo.test 的消息:
nats --context sc-west req foo.test 'Hello world from the west region'
Using nats req rather than nats pub here in order to see the JetStream publish acknowledgement just like a client application would when using the JetStream publish() call and checking that the PubAck does not contain an error.
這裏使用 nats req 而不是 nats pub,是為了像客户端應用程序使用 JetStream publish() 調用時那樣,查看 JetStream 發佈確認信息,並檢查 PubAck 是否包含錯誤。
We can then check that the message has indeed propagated to all the regions, in this example using the nats stream view command (that creates an ephemeral consumer on the stream and then iterate over it to get and display the messages).
然後,我們可以檢查消息是否確實已傳播到所有區域。在本例中,我們使用 nats stream view 命令(該命令會在流上創建一個臨時消費者,然後遍歷該消費者以獲取並顯示消息)。
nats --context sc-west stream view foo
You can see that the message stored in the global virtual ‘foo’ stream is indeed there with the subject foo.test which we used earlier to publish the message. Let’s check that the message has also made it to the other clusters:
您可以看到,存儲在全局虛擬流“foo”中的消息確實存在,其主題為foo.test,我們之前就是用這個主題發佈消息的。讓我們檢查一下該消息是否也已發送到其他集羣:
nats --context sc-central stream view foo
和
nats --context sc-east stream view foo
You can also even do a nats stream info on the virtual stream (this will show you the info about your local ‘read’ stream), but note how nats stream ls doesn’t show the global virtual stream, but rather all of its (non-virtual) underlying local streams.
你甚至可以對虛擬流執行 nats stream info 命令(這將顯示本地“讀取”流的信息),但請注意,nats stream ls 命令顯示的不是全局虛擬流,而是其所有(非虛擬)底層本地流。
模擬災難 Simulating disasters
You can simulate whole regions going down by killing all of the nats-server processes for a region, there are some simple convenience scripts in the repository to kill or restart regions easily.
你可以通過終止某個區域的所有 nats-server 進程來模擬整個區域的宕機。倉庫中提供了一些簡單的便捷腳本,可以輕鬆地終止或重啓區域。
終止單個區域 Killing one region
For example: let’s first kill the central region cluster
例如:我們先終止中央區域的集羣。
source killcentral
Then publish message from or ‘east’
然後發佈來自/或 east 的消息
nats --context sc-east req foo.test 'Hello world from the east region'
Check that the message made it to ‘west’
檢查一下信息是否已送達 west。
nats --context sc-west stream view foo
Then restart ‘central’
然後重新啓動 central
source startcentral
It may take up to a couple of seconds for the recovery to complete then check that the message is now there in ‘central’
恢復過程可能需要幾秒鐘才能完成,然後檢查消息是否已出現在 central 位置。
nats --context sc-central stream view foo
同時殺死兩個區域使其癱瘓,模擬裂腦。 Killing two regions to go down at once and simulating a split brain
The two failure scenarios are similar and related: a split brain from the point of view of the region getting isolated is no different from both of the other two regions going down at the same time.
這兩種故障場景相似且相關:從區域隔離的角度來看,腦裂與另外兩個區域同時宕機並無本質區別。
The difference being that in the case of split brain, the two other regions that can still see each other continue to operate normally (including processing new ‘writes’) and the isolated regions ends up in the same ‘limited’ mode of operation as in the case when two regions do down at the same time.
不同之處在於,在腦裂的情況下,仍然可以相互通信的兩個區域會繼續正常運行(包括處理新的寫入操作),而隔離的區域最終會進入與兩個區域同時宕機時相同的“受限”運行模式。
As soon as the network partition gets resolved or as the missing regions come back up the two parts of the brain will replicate missed messages between themselves and eventually become consistent again (though not necessarily in the same order).
一旦網絡分區得到解決或缺失的區域恢復運行,大腦的兩個部分就會相互複製丟失的信息,最終再次保持一致(儘管順序可能有所不同)。
In the case of two regions going down at the same time or of being the smaller part of the split brain the remaining region can still operate but in a ‘limited’ fashion, as not all functionality will be available since there will be an inability for the remaining nodes to elect a JetStream ‘meta leader’.
如果兩個區域同時宕機,或者某個區域是腦裂中較小的部分,剩餘區域仍能運行,但功能會受到限制,因為剩餘節點無法選舉 JetStream 的“元領導者”,導致部分功能不可用。
- Publications to the stream will still work, the only way publications to stream in a regions would stop working is if 2 of the 3 servers in the region (or 3 out of 5) go down at the same time.
向流發佈消息仍然有效,只有當區域中的 3 個服務器中的 2 個(或 5 個服務器中的 3 個)同時宕機時,向流發佈消息才會停止。 - Get operations (e.g. what the KV ‘get’ operation uses) will still work.
獲取操作(例如,KV 的“get”操作)仍然有效。 - Getting messages from already existing consumers (at the time the second regions goes down) on the stream will still work, and locally published messages will be seen in the ‘read’ stream right away.
從流上已有的消費者(在第二個區域宕機時)獲取消息仍然有效,本地發佈的消息會立即出現在“讀取”流中。 - However, creating new consumers (or new Streams) will not work.
但是,創建新的消費者(或新的流)將無法正常工作。
First kill both ‘west’ and ‘east’
首先關掉 west 和 east
source killwest; source killeast
Publish a new message on ‘central’ (as if it was isolated region)
在 central (如同一個孤立區域)發佈一條新消息
nats --context sc-central req foo.test 'Hello world from the central region'
Then bring down the ‘central’ region and ‘east and ‘west’ back up
然後把 central 區域、east 和 west 區域都拉下來。
source killcentral; source startwest; source starteast
Wait up to a couple seconds and publish another message from one of those two regions
稍等幾秒鐘,然後從這兩個區域之一發布另一條消息。
nats --context sc-east req foo.test 'Hello again from the east region'
Check you can create a new consumer and see that message from the other region
請檢查您是否可以創建一個新的消費者,並查看來自其他區域的消息。
nats --context sc-west stream view foo
And finally resolve the split brain by restarting ‘central’
最後通過重啓 central 來解決裂腦問題。
source startcentral
After a few seconds you can see that all the messages where are now present in all the ‘read’ streams, though not necessarily in the same order by comparing the output of
幾秒鐘後,你可以看到所有消息現在都出現在所有 read 流中,儘管順序不一定相同,這可以通過比較輸出結果來判斷。
nats --context sc-west stream view foo
與
nats --context sc-central stream view foo
如果您仍然需要全局順序怎麼辦? What if you still want global ordering?
If you want to retain the ability to handle client writes locally and yet still want global ordering of the messages, you can use a stretch cluster to home an ‘ordering’ stream and have the local read streams mirror that stream. The write streams remain the same and that ‘ordering’ stream sources from them. Compared to simply having the stream located in the stretch cluster and the read streams mirroring it, and having the client applications just publish directly to the stretched stream, this provides lower ‘write’ latency and higher availability but does take away the ‘compare-and-set’ functionality (e.g. the KV Update operation) that you still retain when writing directly to the stretched stream.
如果您希望保留在本地處理客户端寫入操作的能力,同時又希望消息保持全局排序,可以使用擴展集羣來託管一個 ordering 排序 流,並讓本地讀取流鏡像該流。寫入流保持不變,而 ordering 排序 流則從這些寫入流中獲取數據。與簡單地將流放置在擴展集羣中,讀取流鏡像該流,並讓客户端應用程序直接發佈到擴展流相比,這種方法可以降低 write 寫入 延遲並提高可用性,但會失去直接寫入擴展流時仍然保留的 compare-and-set 比較並設置 功能(例如鍵值更新操作)。
結論 Conclusion
When it comes to multi-region/cloud/site/etc… active-active consistent ‘global’ deployments, NATS JetStream not only has all of the needed functionality built-in but also has extensive flexibility when it comes to replication, mirroring, sourcing, and generally creating local consistent copies of the data including the ability to have (on a per-stream basis) the choice between immediate or eventual global consistency. And leveraging some of the new features of 2.10, you can make eventually consistent globally distributed streams in a manner that is completely transparent to the client applications, such that the client application doesn’t even need to know (e.g. need to be configured with a region name) which region it is deployed in, and yes still have both reading from, and writing to, the stream handled by the local regional NATS servers (thereby with low latency).
對於多區域/雲/站點等多活動的 global 主動一致性部署,NATS JetStream 不僅內置了所有必要的功能,而且在數據複製、鏡像、數據源以及創建本地一致性副本方面也具有極大的靈活性,包括能夠(基於每個數據流)選擇立即或最終的全局一致性。利用 2.10 版本的一些新特性,您可以創建最終一致的全局分佈式數據流,而客户端應用程序對此完全透明,甚至無需知道(例如,無需配置區域名稱)其部署在哪個區域,並且仍然可以從本地 NATS 服務器讀取和寫入數據流(從而實現低延遲)。
When it comes to global distributed immediate or eventual data consistency with JetStream you can indeed have your cake and eat it too!
使用 JetStream 實現全局分佈式立即或最終數據一致性,您確實可以兩全其美!