Demystifying the D.A.G.G.E.R. Testnet Results

Introduction

The initial phase of the Testnet for GenesysGo's Directed Acyclic Gossip Graph Enabling Replication Protocol (D.A.G.G.E.R.) stands as a remarkable testament to the synergy of collaboration and innovation that’s possible in Web3. With operators from across the globe dedicating their resources, we've been able to simulate real-world network conditions and gather valuable data. Our heartfelt thanks go out to everyone who has contributed their time and compute to this crucial stage of our project. The benchmarks captured through versions 0.2.0 - 0.3.1 have achieved compelling marks for decentralized network performance, as detailed in our comparative analysis with Filecoin. But numbers alone mean less if we are not helping communicate their importance. To better discuss the strides we've made in this first testnet phase, it's essential to understand the "why" and "what" behind each performance metric. So, let's dive into easy-to-understand explanations of the D.A.G.G.E.R. Testnet's benchmarks and make them more accessible to everyone.

Maximizing Throughput & Peak TPS

The network's peak transaction processing speed (TPS) has been recorded at an astounding 50,000 TPS on a 32-core machine with 256GB of RAM. This benchmark, achieved under optimal conditions across clusters of 5, 10, and 15 Equinix m3.large instances, demonstrates the raw power of the D.A.G.G.E.R. protocol when operating in an ideal network environment. 

Knowing the network's peak transaction performance is crucial because it tells us what the system is capable of when conditions are just right. Just as a sports car is tested for its maximum speed to understand its potential, we push D.A.G.G.E.R. to its limits to see how fast it can go (processing transactions). Just like a sports car should be tested on a racetrack, we placed D.A.G.G.E.R. on the datacenter equivalent of a network racetrack. It’s important to note this test was about raw transactions processing and, therefore, executed only transactions - not file ingestion. We want to be clear in how we communicate the difference between processing transactions only and processing transactions that are also ordered for file uploads (which we will discuss).

Realizing a high-performance ceiling in ideal conditions is a way of validating the inner workings of the code and algorithms and their impacts on hardware resources. It’s how we observe the consensus protocol in action without all the noise from the real world. It’s also a great way to see what the bottleneck is in a perfect scenario for the consensus engine itself. The takeaway is that the upper limit of consensus performance is so high that it will be unlikely to present itself as an issue prior to the more expected, more demanding aspects of uploads, file downloads, and sharing data (synchronizing) information across all the node operators. Since D.A.G.G.E.R. has a metadata ledger (the DAG with all the information about where files are stored) and the Shard database (the actual fragments of files themselves), both will need to be shared across all operators while maintaining consensus.

Real-World Resilience

While it's great to know how fast the network can go under perfect conditions, it's even more important to understand how it performs in day-to-day reality—when things are not perfect. After all, we don't drive our sports car on a racetrack every day. The ability to handle a lot of information swiftly, even when the network is congested, mirrors how well a distributed system can serve its users during peak demand. 

This is why our D.A.G.G.E.R. Hammer Testnet front-end has both a key press game (simulating transactions) and a Dropbox-style file uploader. The Dropbox-style style uploader allows anyone in the world to toss files into the D.A.G.G.E.R. testnet to simultaneously create transactions for data uploading while hammering keys to produce additional transaction overhead. While this public randomness ensues, we also have been running internal scripts that hammer the same cluster in intervals as part of our stress testing plans. This was the chaos we were hoping for, and we were able to capture it successfully.

Under sustained, real-world conditions (Testnet phase 1), the network has demonstrated impressive resilience, handling an average of approximately 3,000 transactions per second. However, during periods of peak demand, or "surge" scenarios, the D.A.G.G.E.R. network is capable of handling up to 20,000 to 38,000 transactions per second, depending on the configuration and adjustments present at the time. Surge tests are typically blasts of anywhere from one to five million transactions within one minute, which expose degradations we can then study. This performance showcases the robustness of D.A.G.G.E.R. as it operates with a dynamic node cluster size of 20-30 independent operators while managed transactions, file uploads, amidst operators joining and leaving.

Consensus performance within this more realistic testing environment continues to be exceptionally high. As we move forward through future testnets, with an increase in the operator count and the size and frequency of file uploads/downloads, we anticipate these figures to represent more accurately the resilience and capacity of D.A.G.G.E.R. under real-world use.

Trial by Fire

As the D.A.G.G.E.R. Testnet evolves, we will continue making enhancements but also gain deeper insights into the network's performance under various conditions by setting it on fire. This is part of testnet phase 1 - breaking things. We have gathered more nuanced performance metrics during these trials, so let’s break them down with some explanation as to why they’re important.

Network Synchronization and Finality

The time required to download a snapshot and catch up during high TPS has been impressively brief, taking just seconds to complete. Snapshots are like tiny start-kits for nodes to learn just enough about the network to be able to join. Once they join, the being retrieves the entire metadata ledger and offers storage space for shards (synchronizing). By virtue of how we designed the architecture, our snapshots are currently small and quickly downloadable, and operators are able to join quickly. 

In an ideal network setting, the link between a node serving a snapshot and a node receiving a snapshot is incredibly fast. Our findings reveal that a snapshot size of 1GB can be transferred in a second or two over a 25Gbps uplink, followed by just a few seconds to unpack and catch up when using an Equinix m3.large instance.

However, that in itself is an attack surface called network churn - which is when too many operators join at once. We handle this by counting epochs (an elapsed amount of time) before an operator can join, slowing down the number that can join at once. This backpressure has tested well thus far, and we continue to settle on 5-6 epochs as a waiting cycle (which means anywhere between 30 minutes to 1 hour to rejoin the network). 

Block synchronization times vary based on node latency, averaging between 30ms and 300ms. Think of block synchronization as synchronizing watches. Just as some watches sync faster than others, nodes (or digital watchers) in D.A.G.G.E.R. sync at different speeds, typically between the blink of an eye (30 milliseconds) and the slow clapping of hands (300 milliseconds). The average latency for a trip around the world on an internet cable is an approximate range from 200 ms-400 ms depending on the route and quality of the network (sometimes worse). Being within median range here is acceptable, considering the current Testnet operators are set up in locations including Canada, Germany, Tokyo, the United States, and others.

When it comes to block validation, which is the process of verifying that the information on the block is correct and follows the rules, some server hardware configurations (namely CPU) are much faster than others. Block validation times range from sub-500 nanoseconds to 20ms (testing with AMD Epyc 7502Ps), and will vary depending on CPU capabilities. Block validation speed is measured as an independent Wield node performing a validation check on its own block after receiving one.

Altogether, these figures highlight the efficiency of D.A.G.G.E.R.'s consensus mechanism and its ability to maintain a swift path to finality. Finality occurs when a transaction becomes an accepted part of the network's ledger, which includes several layers: transactions are verified to form blocks, blocks come together to create bundles, and the bundles finalizing signal a tick forward in the measurement of an epoch. An epoch in the D.A.G.G.E.R. network is a key metric we measure. It represents a complete cycle that includes numerous bundles – to be specific, 200 bundles form an epoch. This figure was determined through comprehensive testing, which involved experimenting with different configurations of transactions, blocks, and bundles. Understanding epochs helps us appreciate the rhythmic update cycles of the network's ledger, keeping data current and consistent across nodes spread across the globe.

The time it takes to reach this point of agreement, or finality, can vary. It can be as quick as 70 milliseconds in the most optimal conditions, and on average occurs in about 273 milliseconds. In some cases, such as when the network is unusually busy or the data is large, it might take up to 650 milliseconds. A real-world equivalent of about ~273 milliseconds is about the time it takes for a human to blink. On average, a single blink for a human occurs in just a quarter to a third of a second, which translates to roughly 250 to 400 milliseconds. This everyday action is so quick and natural that most people don't even notice the time it takes.

The geographic location of nodes, the size of the transactions, and real-time network conditions can all impact these times. D.A.G.G.E.R. is designed to handle this variability, ensuring that regardless of these factors, the state of the network is updated and agreed upon in a transparent and swift manner. This ability for all nodes to quickly get up-to-date with all finalities is known as synchronization. It is crucial in a distributed system, and it means that everyone has the latest information simultaneously, whether they're making transactions, sharing files, or building an application that depends on highly available data. 

Enhancing Data Operations

When it comes to data storage operations, the network's internal runtime to ingest and finalize a 1MiB file ranges from 0.1 seconds to 0.7 seconds, excluding external latencies. In simpler terms, 'ingesting' refers to the network initially receiving and processing a 1MiB file — akin to someone handing over a document to you. 'Finalizing' the file, on the other hand, is like stamping the document with a seal of approval, indicating that it's been verified and is now securely stored and ready for retrieval whenever needed.

Erasure coding is the splitting of a file into tiny bits for security and the spread around the network. The time to erasure code 1MiB of data is approximately 0.018ms per core (assuming a 3ghz processor), showcasing the tiny impact of this process on performance when files are small. However, it grows mostly linear (with only slight improvement) in proportion to file size increasing. That means for a 100MiB file we would expect to see slightly less than ~0.0018s (1.8ms) erasure coding time. Keep in mind this is just one 3Ghz thread on one machine, and machines have many more than one thread – and the network has many more than one node. Measuring these metrics is important for a data storage protocol since horizontal scaling this workload is critical - these milliseconds can add up if not handled well. 

In terms of retrieval, a 1MiB file can be fetched in as little as 1-3 seconds using the D.A.G.G.E.R. Hammer Demo site, emphasizing the network's user-friendly data access and reconstruction process.

Scaling and Resource Utilization

The addition of new nodes to the network has not shown a noticeable negative impact on performance, maintaining bundle finalization times within expected ranges. Vertical scaling, through increased CPU and RAM, is anticipated to enhance throughput further. We are still in the process of capturing new metrics from the recent increase in the minimum requirement for CPU size (upping to 16 threads, which we call “vertical scaling”). 

To put bandwidth into some perspective, if we look at a single Wield node operator that is sustaining 20,000 user TPS, they would see an average CPU utilization of 34% (for a 3ghz CPU) and memory usage of around 4GB-6GB, which scales with the number of transactions processed. Bandwidth usage during these periods ranges from 150-300Mb/s, demonstrating the network's moderate resource consumption during high activity. These are TPS-only metrics, as we are still in the process of gathering more file upload and download metrics and will share our findings in the future.

Conclusion

These advanced metrics from the D.A.G.G.E.R. Testnet provide a comprehensive view of the network's performance, from peak capabilities to sustained real-world conditions. As we continue to push the boundaries of what's possible with decentralized networks, these insights are invaluable in guiding our optimization efforts for both the core protocol and associated applications like shdwDrive.

By sharing these metrics in a distinct manner from our previous communications, we aim to offer a fresh perspective on the network's progress and potential. We also emphasize that most, if not all, of these metrics, will continue to change and evolve as we improve D.A.G.G.E.R. and diligently pursue the most realistic network test conditions possible. The GenesysGo team remains dedicated to transparency and innovation as we prepare for the next phases of the D.A.G.G.E.R. Testnet, and we look forward to sharing more breakthroughs and learning with everyone in our upcoming “D.A.G.G.E.R. Testnet Learnings Part 2” blog.