Distributed data structures
My questions are What are the big challenges of designing distributed data structures (even harder than those of concurrent data structures)? Sequence data structures are useful when you’ll need to add or remove data at a specified index in a list or array.
Distributed data structures
This preview version is provided without a service-level agreement, and it’s not recommended for production workloads. Certain features might not be supported or might have constrained capabilities.
The Fluid Framework provides developers with distributed data structures (DDSes) that automatically ensure that each connected client has access to the same state. The APIs provided by DDSes are designed to be familiar to programmers who’ve used common data structure s before.
This article assumes that you are familiar with Introducing distributed data structures on fluidframework.com.
A distributed data structure behaves like a local data structure. Your code can add data to it, remove data, update it, etc. However, a DDS is not a local object. A DDS can also be changed by other clients that expose the same parent container of the DDS. Because users can simultaneously change the same DDS, you need to consider which DDS to use for modeling your data.
Meaning of ‘simultaneously’
Two or more clients are said to make a change simultaneously if they each make a change before they have received the others’ changes from the server.
Choosing the correct data structure for your scenario can improve the performance and code structure of your application.
DDSes vary from each other by three characteristics:
- Basic data structure : For example, key-value pair, a sequence, or a queue.
- Client autonomy : An optimistic DDS enables any client to unilaterally change a value and the new value is relayed to all other clients. But a consensus DDS only allows a change if it is accepted by other clients by a consensus process.
- Merge policy : The policy that determines how conflicting changes from clients are resolved.
Below we’ve enumerated the Data structures and described when they may be most useful.
Key-value data
These DDSes are used for storing key-value data. They are optimistic and use a last-writer-wins merge policy. Although the value of a pair can be a complex object, the value of any given pair cannot be edited directly; the entire value must be replaced with a new value containing the desired edits, whole-for-whole.
Key-value scenarios
Key-value data structures are the most common choice for many scenarios.
Common issues and best practices for key-value DDSes
- Storing a counter in a SharedMap will have unexpected behavior. Use the sharedcount er instead.
- Storing arrays, lists, or logs in a key-value entry may lead to unexpected behavior because users can’t collaboratively modify parts of one entry. Try storing the array or list data in a SharedSequence or SharedInk.
- Storing a lot of data in one key-value entry may cause performance or merge issues. Each update will update the entire value rather than merging two updates. Try splitting the data across multiple keys.
Sequences
These DDSes are used for storing sequential data. They are optimistic. Sequence data structures are useful when you’ll need to add or remove data at a specified index in a list or array. Unlike the key-value data structures, sequences have a sequential order and can handle simultaneous inserts from multiple users.
- SharedNumberSequence : a sequence of numbers.
- SharedObjectSequence : a sequence of plain objects.
Sequence scenarios
Common issues and best practices for sequence DDSes
- Store only immutable data as an item in a sequence. The only way to change the value of an item is to first remove it from the sequence and then insert a new value at the position where the old value was. But because other clients can insert and remove, there’s no reliable way of getting the new value into the the desired position.
Strings
The SharedString DDS is used for unstructured text data that can be collaboratively edited. It is optimistic.
String scenarios
To learn more about DDSes and how to use them, see the following sections of fluidframework.com:
Distributed data structures in Java, We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or …
Distributed data structures in Java
I’m going to develop my own message queue implementation in Java and I need to distribute the queue content across multiple servers so that it would provide reliability and redundancy.
In addition to that I need to persist the queue content into the file system.
Can somebody tell me what is the most suitable distributed data structure implementation to hold my queue content?
Note: That data structure must provide me the message ordering facility. That means I need to receive messages according to the order they arrived. Also while reading a message, it should be in a ‘locked’ state so that other consumers cant read it until the first consumer completes the reading process
Have you looked at any of the many existing message queue implementation s for java? Wikipedia lists many open source implementations . It seems to me that an existing, thoroughly tested message queue is the best place to hold your queue content 🙂
If you absolutely want to write your own, then starting with the open source solution that most fits your needs would probably answer most of your questions about what data structures work well.
Concepts of Distributed databases, A Distributed database is defined as a logically related collection of data that is shared which is physically distributed over a computer network on …
Concurrent data structures vs. Distributed data structures
In the context of multi-processor/multi-threaded systems, there are plenty of well-studied concurrent data structures , including stacks, queues , linked lists , etc. Here is an excellent survey on concurrent data structures by Mark Moir and Nir Shavit.
Even though they use a » shared memory» model similar to the one used by concurrent data structures, I can only find information on a few distributed data structures : those data structures designed specifically for distributed systems. Such data structures are typically characterized by replication and certain consistency models. The examples I have found include
- An Optimized Conflict-free Replicated Set (arXiv, 2012)
- Randomized Shared Queues (PODC, 2001)
- What are the big challenges of designing distributed data structures (even harder than those of concurrent data structures)?
- Did I miss other distributed data structures in the literature?
What are the big challenges of designing distributed data structures (even harder than those of concurrent data structures)?
Some important challenges that practically all distributed data structures face, are handling dynamic changes, implementing a scalable design, and being fault-tolerant.
This includes finding answers to questions such as:
- How can we maintain/repair the properties of the data structure in the presence of churn ? That is, new nodes join and old nodes leave the network over time.
- Can we design the data structure such that it is robust against faults?
- How can we deal with the congestion on the communication links caused by many parallel requests?
- Do the connections required per node and the necessary length of messages scale (at most) logarithmically with the system size?
- Is it possible to design a system that is «spam resistant» in the sense that it can withstand attacks by adversarial nodes?
There are also locality issues since, in a distributed system, each node runs its own instance of a distributed algorithm and has only a local view of the network due to being directly connected to only a small number of other nodes. (Typically you would want a node degree of $O(\log n)$ to make the system scalable.) These issues come into play when maintaining global state such as counting the number of data items, finding the maximum, etc.
Did I miss other distributed data structures in the literature?
DHTs : To give you some pointers, you might want to look at distributed hash tables (DHT) such as Chord, CAN, Tapestry, and Pastry.
Skip Graphs : Since you mentioned skip lists, you might be interested in skip graphs, which is a data structure providing range-queries and $O(\log n)$-time operations for lookups, inserts, etc. The advantage of a skip graph (vs a skip list) is that a skip graph contains an expander as a subgraph with high probability. This implies that routing can be done efficiently (i.e. link congestion is low) and that the skip graph remains connected even if a lot of nodes fail.
Algorithms for data structures in distributed system, The hash table data structure can be easily spread across multiple machines with a simple algorithm to distribute the keys: machine_to_query = item_key % …
Distributed tree data structure
In my project, I’m using somewhat a publisher/subscriber pattern.
I would like to have a tree data structure on my publisher. Everytime I modify anything in the tree (be it a structural change or a modification of node values ), the change set is published to its subscriber.
These do have a local copy of the tree and change the internal structure upon reception of the changesets. When connecting to the publisher, any subscriber should first ask for a deep copy of the entire tree.
Does anybody know about an existing java library which does the above?
Okay, since I found a «solution» until now, i’ll just post the general thoughts here.
I created a class implementing the java.util.Map interface. Lets call this class DMPublisher whereas K is the key type and V the type for values in the map.
There is another class also implementing the map interface, DMSubscriber .
The publisher class starts listening on a socket until it is closed explicitly. The subscriber connects to this socket when created.
The publisher class has an attribute of a real hash map used as cache. For every manipulating method on the publisher, the corresponding item in the cache is changed accordingly. Updates are now sent to all the subscribers through the mentioned above socket.
Beside the map interface, the subscriber is also using the observer pattern listeners can be attached to. All updates received from the socket are now dispatched to all these listeners.
Upon connection, the publisher sends all the data currently being held in the cache to its subscribers in order to have a consistent state.
Both classes are synchronized for multi threading support.
If anyone is interested in the source code, feel free to contact me.
What is a Distributed File System (DFS)?, A Distributed File System (DFS) is a file system that is distributed on multiple file servers or multiple locations. It makes the programs to access or …