Sriram Madapusi Vasudevan

Treatise

Sriram Madapusi Vasudevan — Sat, 30 Aug 2025 21:26:04 GMT

Some mental notes I’ve been formulating for myself—set down so they carry more weight than thoughts whizzing around my brain. No particular order.

Learn to let go

Get lost in the pursuit of excellence without fixating on where it leads. Find joy in the flow; no one is born with a destination. Remember, excellence is not a choice. Incremental progress is acceptable as long as it's in service of one's ethos.

Be a relentless learner

You don’t need immediate application. Identifying as a learner shifts your mentality and compounds future skill. Serendipity shows up when you tinker. An afternoon of action can spark outsized joy. (Example: my q-cli-neovimplugin experiment: https://github.com/sriram-mv/q-cli-neovim)

Zero ego

Don’t boast about humility. Embody it. Live it.

Define your goals

Define goals; don’t be bound by them. Wander, but stay directionally aligned.

Radiate calm

Relax. Look around. Make a call. —Jocko Willink
Calm comes from being fully comfortable as yourself. There is only one you, and the identity evolves.

Moving to the next "Ergo" keyboard

Sriram Madapusi Vasudevan — Sun, 06 Jul 2025 21:40:36 GMT

I have been on the Kinesis advantage 2 train for the last 2 years and decided to upgrade to the Kinesis Advantage 360 in hopes of better usability, since I really liked the split layout and the option to have bluetooth.

So far, I'm liking it. I have the Kalih Silent Quiet Pink switches on, which means I don't have to bottom out keys anymore and the strain on my wrists seems to be lower. It's also a quieter experience :)

I need to learn about the layers, but already did some basic remaps through the "Clique" software vended by Kinesis.

Delete -> Control

L-Ctrl -> Command

I'll probably do a few more remaps so that I can trigger my tmux prefix key in one go.

Overall, I'm looking forward to broader shoulders while working. If only, I had a trackball stitched on this thing, I could do away with my mouse too.

3 Things that I see happening in the world of Agents

Sriram Madapusi Vasudevan — Thu, 26 Jun 2025 04:40:18 GMT

These are thoughts that have been brewing for me, and I wanted to take a moment to write it down as a stream of consciousness. They are based on my interactions with Agentic AI - i.e Amazon Q CLI.

Don't wait on perfecting your prompt

Just start building. This is something that I see often out in the wild on coming up with the perfect list of rules, writing up the immaculate README.md file to submit as context, and what not. I'd say go the opposite route. Build and learn prompting along the way. Yes, there may be best practices on this, but one needs to understand why those are best practices. Thankfully, in the age of agentic AI, the feedback loop is so short that you can see the patterns emerging right before your eyes. Use those learnings, embody them, and then - and only then - do you condense them into rules.

Be a hawk

We are not yet in the outlandish world of building everything with 1-shot prompts. Yes, they may get you 80% of the way there, but the last 20% is going to take inordinate amounts of time. Focusing on solvable problems first and trying to spot patterns that the LLM is taking is going to be extremely rewarding as these AI agent systems age. They will get better, but so will your instincts. Do not try to submit a prompt and go drink coffee, at least not till you feel you have developed an intuition for how these things work. One can also limit the blast radius by reducing the permissions of what the agent can do or the kind of credentials you give it access to.

Agents get stuck too

There are a number of ways in which agents get stuck: 1/ Context window limits are reached 2/ They encounter an interactive prompt 3/ Launching a process takes over the foreground 4/ Reasoning Collapse 5/ Too many tools available via MCP (Model Context Protocol). There is no panacea for all of the above; some of it is solvable through prompting. ‌

‌One interesting way that I have been dealing with this, especially with running multiple instances of these terminal agents, is to watch for notifications (🔔 - Yes, the terminal bell). Any time I do not hear back within a meaningful amount of time. It's time to check on the progress being made and redirect if need be.

You'll be hearing more from me on this topic soon!

Optimizing Message Throughput in High-Volume Queue Systems: Lessons from the Trenches

Sriram Madapusi Vasudevan — Sat, 10 May 2025 20:06:00 GMT

In large-scale data ingestion systems, small architecture choices can have dramatic performance implications.

During my time at AWS CloudWatch, we were in the midst of a migration from our legacy metric stack to a spanky new one. I was the on call engineer as our alarms blared: end-to-end latency spikes had breached a critical threshold. A quick partitioning tweak later, those noise-making spikes vanished and throughput climbed 30% on the same hardware. In this deep-dive, you’ll see exactly how I diagnosed a flawed “uniform message” assumption and turned it into high-volume reliability.

The System Architecture

The data pipeline processed messages from a number of queues, each of which had its own priority setting.

The architecture looked like below:

Queue processing architecture

The distributed queue consumer worked with a simple algorithm.

Poll through queues in listed priority order and read available messages.
Add messages to an internal processing queue till the buffer reaches maximum capacity.
Flush the buffer.

Long polling was employed for higher priority queues to ensure more messages were picked up to minimize end-end latency, whereas the lower priority queues were polled for a shorter period of time in order to prevent priority inversion. This was crucial for maintaining SLA of the broader data pipeline.

Message Processing Flow

Messages in this system usually fit into two categories of operations:

Adding new items to an index
Deleting items from an index.

These messages would be read by the queue consumer, processed, and then sent to downstream systems for indexing with the results. Simple and clear-cut—or so it appeared.

Problem: Message Size Variance

Everything functioned as anticipated during early production deployment and initial testing. As scale grew, though, we started to see occasional end-to-end latency spikes, especially at the top of every hour when some message types would flood. Extensive research revealed a basic assumption ingrained in our design: we had assumed messages across several queues would be approximately same in size. Actually, delete messages were far bigger than add ones.

Delete batch sizes were much larger and spiky in nature.

This size difference set off a chain reaction:

Processing larger messages (deletes) took more time.
Many large delete messages arriving simultaneously would mean that the buffer filled with more deletes instantly before cycling back to a higher priority queue.
This resulted in periodic latency spikes in "lumpy" processing patterns.

Most troubling was that these processing imbalances were really countering our priority queue consumer design. During peak times, large, low-priority messages were practically crowding out smaller, higher-priority ones.

Root Cause: Mental Model Mismatch

The queue consumer filled its buffer under the assumption that all messages had a uniform distribution in the operations. In reality, the number of operations contained within each message varied dramatically - some delete messages contained 50+ operations, while add messages typically contained 1-5 operations.

Assumed Traffic Shape vs Real Traffic Shape

This mismatch between our mental model (messages as atomic units of work) and reality (messages as variable-sized containers of work) was the manifestation of the performance bottleneck.

Solution: Message Partitioning and Normalization

The analysis of message size distributions indicated that normalization is necessary to ensure uniform processing features regardless of the message arrival rate.

The next steps were clear:

Split oversized messages into smaller, consistently sized chunks while maintaining message integrity.

Tuning: Finding the Optimal Threshold

However, the core challenge was selecting a threshold such that below conditions were satisfied.

Minimizes splits for large add operations (which are rare)
Maximizes splits for delete operations (which are large and batched at the top of the hour)

Enter, statistical analysis!

Let:

\( A \): Add message size distribution
\( D \): Delete message size distribution
\( P_{95}(A) \): 95th percentile of adds (95% ≤ this value)
\( P_{5}(D) \): 5th percentile of deletes (95% ≥ this value)

Optimal Threshold:

\[ T = \max\left(P_{95}(A),\ P_{5}(D)\right) \]

Note, this partitioning needs to be implemented upstream. This way the core algorithm for the priority queue consumer does not need to be changed and we maintain separation of concerns overall.

Results: Dramatic Performance Improvement

The impact of this seemingly simple change was profound.

After implementing message partitioning:

Processing became more consistent across all message types.
The hourly latency spikes disappeared entirely.
System throughput improved by approximately 30%.

Most importantly, batches sent to downstream systems now contained a more balanced mix of operations, even during peak periods. The system maintained its priority guarantees while eliminating the processing bottlenecks caused by message size variance.

Anecdotally, delete messages were now split into 3+ smaller messages, while add messages were rarely split. This normalization of message sizes ensured that the queue consumer was working with much more uniform units of work.

Broader Applications for Large-Scale Ingestion Systems

While this specific solution addressed a particular issue in a search indexing system, the principles apply broadly to large-scale ingestion systems:

Question assumptions about uniformity: Many system designs assume uniform processing characteristics that don't hold at scale. Identifying and challenging these assumptions is crucial.
Look for normalization opportunities: Normalizing work units (whether message size, processing time, or resource consumption) can dramatically improve predictability and throughput.
Use data to guide partitioning: The specific thresholds for our partitioning logic came from actual production data. This data-driven approach ensured we were optimizing for real-world conditions, not theoretical scenarios.
Solve problems at the right layer: By implementing partitioning upstream from the queue consumer, we avoided complicating the core processing logic.
Think in terms of work units: Rather than treating messages as atomic units, conceptualize in terms of discrete work units. This mental model opens up opportunities for optimization.

Monitoring and Metrics

To even have data to come up with above hypothesis, I recommend tracking these key metrics:

Message size distribution: Understand the variance in your message sizes across different message types.
Processing time per message: Identify correlations between message size and processing time.
Queue depth over time: Detect patterns in how queues build up and drain.
End-to-end latency: The ultimate indicator of system health.

With these metrics in place, you can identify size-related bottlenecks and determine appropriate thresholds for message partitioning.

Conclusion

Building high-performance, large-scale ingestion systems requires moving beyond textbook approaches and adapting to real-world complexities. The message partitioning solution I've described exemplifies how seemingly small optimizations can have outsized impacts on system performance.

What makes this approach particularly powerful is its simplicity and broad applicability. You don't need complex algorithms or expensive resources to implement message partitioning-just a clear understanding of your workload characteristics and a willingness to challenge assumptions.

If you're facing similar challenges with uneven processing in your high-volume queue systems, I encourage you to consider whether message partitioning might be the right solution for you.

Vibe Coding: A Misunderstood Approach to Software Development

Sriram Madapusi Vasudevan — Tue, 06 May 2025 04:52:43 GMT

Vibe coding has received unwarranted criticism recently, with many dismissing it as a gimmicky approach to software creation. In reality, it's an effective method to accelerate development while maintaining good practices throughout the process.

Parallelism

Run 4 instances of Claude Code or Amazon Q chat simultaneously, assigning each to work on specific features based on a well-designed specification. Leverage AI to help enhance your specifications with detailed requirements. Remember that the quality of input directly determines the quality of output. Even advanced models like Claude 3.7 with Thinking mode have limitations - AI complements but doesn't replace original human thought.

Defined Feedback Loops through tests

After defining specifications, write tests upfront to codify how modularity is enforced. These unit tests establish the foundation of your defined user experience. Once you've set up the tests and configured how to invoke them, you can allow AI to complete the implementation. Well-defined feedback loops substantially increase your chances of success.

Important: Explicitly instruct AI systems not to modify tests and to focus solely on creating source code.

System Prompts and Guardrails

Develop a comprehensive contributing guide for your repository, including a README and guidelines on how software should be written for your team. This increases the likelihood that AI-generated code will stylistically match
your existing codebase. Enhance this with system prompts and clear guardrails specifying what practices to avoid.

Spot Bad Patterns

Incorporate a reflection step into your AI workflow by maintaining a troubleshooting guide in your repository. This helps the AI break out of problematic patterns or loops where it might otherwise get stuck.

Interrupt

Timely interruption is crucial. Treat your AI assistant as a collaborator you're brainstorming with, and provide immediate guidance when you notice it heading in an undesirable direction.

Security-First

Include security requirements in your initial specifications and prompt AI to highlight potential security concerns in its implementations. Have AI explicitly document security considerations for each feature it develops. Regularly scan AI-generated code with security tools and conduct threat modeling sessions to identify vulnerabilities that might have been introduced.

Version Control

If you are to let AI loose on making changes for features, remember to use version control and specifically commit changes that are for a given feature. This allows you "the orchestrator" to retain control and find exact cause of issues in the future. ala git bisect

Tom And Jerry Orchestra GIFfrom Tom And Jerry GIFs

Happy Orchestrating!

Threading the Needle: A Quirky Guide to Python's Concurrent Programming Tools

Sriram Madapusi Vasudevan — Sun, 27 Apr 2025 05:25:28 GMT

I was trying to see what were the threading primitives that python had to offer and thought it would be an interesting blog post to use them all.

Let's write a quirky story for each.

Thread: The Parallel Pizza Party

Imagine you're hosting a pizza party, but you only have one phone to call for delivery. In the old days, you'd call one pizza place, wait for the delivery, then call the next place. Your guests would be waiting forever!

With threads, it's like having multiple phones. You can call "Pizza Palace" for pepperoni and "Cheesy Delights" for cheese pizza simultaneously. Both orders are processed at the same time, and your pizzas arrive much faster.

In our code, each thread is like a separate phone call, downloading a different file simultaneously. The .join() method is like waiting for all pizza deliveries to arrive before starting the party.

import threading
import time

def download_file(file_name):
    print(f"Starting download of {file_name}...")
    # Simulate file download
    time.sleep(2)
    print(f"Download of {file_name} complete!")

# Create threads for downloading multiple files simultaneously
thread1 = threading.Thread(target=download_file, args=("data1.csv",))
thread2 = threading.Thread(target=download_file, args=("data2.csv",))

# Start the threads
thread1.start()
thread2.start()

# Wait for both downloads to complete
thread1.join()
thread2.join()

print("All downloads completed!")

Lock: The Single Bathroom Dilemma

Picture a house party with only one bathroom. Without any system, chaos ensues as people try to use it simultaneously (awkward!). The solution? A simple lock on the door.

When someone needs the bathroom, they check if it's available. If it is, they lock the door, do their business, and then unlock it when they leave. If it's occupied, they wait until it's free.

import threading
import time

# Shared resource - bank account
account_balance = 1000
lock = threading.Lock()

def make_withdrawal(amount):
    global account_balance
    
    # Acquire lock before accessing shared resource
    lock.acquire()
    try:
        if account_balance >= amount:
            # Simulate processing time
            time.sleep(0.1)
            account_balance -= amount
            print(f"Withdrew ${amount}. Remaining balance: ${account_balance}")
        else:
            print(f"Failed to withdraw ${amount}. Insufficient funds.")
    finally:
        # Always release the lock
        lock.release()

RLock: The Nested Meeting Room

Imagine you're a manager who books a conference room for a team meeting. During that meeting, you realize you need a private conversation with one team member, so you book the same room for a one-on-one right after. With a regular lock, you'd have to end the team meeting, release the room, and then re-book it. That's inefficient!

An RLock (Reentrant Lock) is like having special booking privileges - you can "book" the same room multiple times without releasing it first, as long as you're the one who booked it originally.

import threading

class FileManager:
    def __init__(self):
        self.lock = threading.RLock()  # Reentrant lock
        self.file_data = {}
    
    def update_file(self, file_name, content):
        with self.lock:
            print(f"Updating {file_name}")
            # This method calls another method that also acquires the lock
            self._write_to_file(file_name, content)
    
    def _write_to_file(self, file_name, content):
        # With RLock, this can acquire the lock again without deadlock
        with self.lock:
            self.file_data[file_name] = content
            print(f"Written to {file_name}: {content}")

In our code, the update_file method acquires the lock, then calls _write_to_file, which also tries to acquire the lock. With an RLock, this works fine because the same thread can acquire the lock multiple times.

Condition: The Coffee Shop Conundrum

Picture a busy coffee shop with baristas (producers) making coffee and customers (consumers) waiting for their orders. When there are no coffees ready, customers wait. When a coffee is ready, the barista calls out "Order up!" and a waiting customer grabs it. If all the pickup counter is full, baristas wait until there's space before making more coffees.

A Condition variable is like this coffee shop system - it allows threads to wait until a specific condition is met, and then be notified when it is.

import threading
import time
import random

# Simulate a producer-consumer pattern for a message queue
message_queue = []
MAX_QUEUE_SIZE = 5
condition = threading.Condition()

def producer():
    global message_queue
    for i in range(10):
        # Acquire the condition lock
        with condition:
            # Wait if the queue is full
            while len(message_queue) >= MAX_QUEUE_SIZE:
                print("Queue full, producer waiting...")
                condition.wait()
            
            # Add a message to the queue
            message = f"Message-{i}"
            message_queue.append(message)
            print(f"Produced: {message}")
            
            # Notify consumers that a new message is available
            condition.notify()

In our code, the producer waits when the queue is full and notifies consumers when a new message is available. Consumers wait when the queue is empty and notify producers when they've consumed a message, creating a coordinated dance of production and consumption.

Semaphore: The Limited Pool Passes

Imagine a community pool with only three swimming lanes. The lifeguard gives out exactly three passes. When you want to swim, you must get a pass. If all passes are taken, you wait until someone finishes swimming and returns their pass.

A Semaphore is like this pass system - it allows a specific number of threads to access a resource simultaneously.

import threading
import time
import random

# Simulate a connection pool with limited connections
class DatabaseConnectionPool:
    def __init__(self, max_connections=3):
        # BoundedSemaphore ensures we never release more than we acquire
        self.connection_semaphore = threading.BoundedSemaphore(max_connections)
        self.connections = [f"Connection-{i}" for i in range(max_connections)]
        self.lock = threading.Lock()

In our code, the BoundedSemaphore ensures that only three database connections can be used at once. If all connections are in use, new requests wait until a connection becomes available. The "bounded" part ensures we never accidentally create more than three connections, just like the lifeguard would never hand out a fourth swimming pass.

Event: The Grand Opening

Picture a new store's grand opening. Customers line up outside, waiting for the doors to open. The store manager (the main thread) is inside preparing everything. When everything is ready, the manager flips the "OPEN" sign (sets the event), and all the waiting customers can enter at once.

An Event is like this "OPEN" sign - it allows multiple threads to wait until a specific event occurs, then all proceed once it does.

import threading
import time
import random

# Simulate a system startup with dependent services
system_ready = threading.Event()

def service_startup(service_name, startup_time):
    print(f"{service_name} is starting up...")
    time.sleep(startup_time)  # Simulate startup time
    print(f"{service_name} has started successfully!")
    
    if service_name == "Database":
        # Database is the critical service, signal when it's ready
        print("Critical service is online, system can now process requests")
        system_ready.set()  # Set the event flag to True

In our code, client threads wait for the system to be ready (the "OPEN" sign). Once the database service is up, it sets the event, allowing all waiting clients to proceed with their requests simultaneously.

Timer: The Absent-Minded Professor

Meet Professor Forgetful, who always gets so absorbed in his research that he forgets to save his work. His clever assistant set up an automatic reminder that pops up every 5 minutes saying, "Professor, save your work!"

A Timer is like this automatic reminder - it executes a function after a specified delay, without blocking the main program.

import threading
import time

def auto_save(document_name):
    print(f"Auto-saving document: {document_name}")
    # Schedule the next auto-save in 5 seconds
    timer = threading.Timer(5.0, auto_save, args=(document_name,))
    timer.daemon = True  # Allow the program to exit even if timer is alive
    timer.start()

In our code, the auto-save function runs, then schedules itself to run again in 5 seconds, creating a recurring reminder that saves the document while the user continues working.

Barrier: The Synchronized Swimmers

Imagine a team of synchronized swimmers. Each swimmer performs their individual routine, but at certain points, they all need to meet in the center of the pool before starting the next sequence together.

A Barrier is like this synchronization point - it ensures that all threads reach a certain point before any of them proceed to the next step.

import threading
import time
import random

def simulate_distributed_calculation(worker_id, barrier, results):
    print(f"Worker {worker_id} starting phase 1 calculation...")
    # Simulate phase 1 calculation
    time.sleep(random.uniform(1, 3))
    results[worker_id] = random.randint(1, 100)
    print(f"Worker {worker_id} finished phase 1 with result: {results[worker_id]}")
    
    # Wait for all workers to complete phase 1
    barrier.wait()

In our code, each worker thread performs its phase 1 calculation at its own pace. The barrier ensures that all workers complete phase 1 before any of them move on to phase 2, which needs the combined results from all workers in phase 1.

We're at the end of the quirky stories! Hope you enjoyed it.

Log Structured Merge Trees

Sriram Madapusi Vasudevan — Mon, 17 May 2021 02:30:06 GMT

LSMTs are primarily used in databases where the write load is much heavier than the read load.

There are 4 primary concepts

In-memory memtables and WAL
SSTables (Sorted String tables) on Disk
Compaction
Bloom Filters

LSMTs based databases are usually based on logs, writes just involve writing to a log in append manner. However just appending in this manner makes it so that reading is much harder, one would have to go through the entire logs to read, which becomes O(n) operation. Not really scalable.

What if the logs were sorted? Does'nt the read operation now to go O(logn). That is the underlying concept for LSMT based Databases.

Memtable

The in-memory memtable has a finite amount of memory allocated to it. All writes directly append to a memtable. (Note: the entries into memtable are sorted directly). The mem-table data structure could use something like AVL Trees or Red-Black Trees to maintain the sorting order and still be O(logn) for writes.

As the size of the memtable reaches the max memory allocated for it. The memtable is flushed to disk.

Writes are always written to WAL (Write Ahead Log), since the memtable is entirely in memory, which means if there is a database crash. The WAL allows for a smooth recovery.

Sorted String Tables

As writes are flushed to disk from memtable, they are written as sorted string tables. When reads come through, the database decides to search within 'x' Sorted String Tables. Since the string tables are sorted, the read performance becomes O(xlogn)

This can be made better by storing an SSTable index which could tell us which SSTables to search for based on input event.

Compaction

What if there is a separate process that runs in the background, which merges the 'x' Sorted String Tables (SST) into smaller number of SSTs? This would dramatically reduce the number of searches to make across SSTs. Lets say the compaction reduces the number of SSTs from 'x' to 'y'.

Note: Compactor also removes stale entries and updates to latest write for an entry on merge.

Then read performance is now O(ylogn) where y << x.

Bloom Filters

Can we do even Better? Yes! As the number of SSTables increase, the read performance will start to take a beating.

Bloom filters to the rescue. Bloom filters are a probablisitic data structure which tell us if an entry is present in a list which high accuracy. If compacted SSTables have bloom filters attached to them, we would now get O(1) operation on identifying if a SSTable holds an entry.

This reduces the read performance back to y * O(1) + O(logn)

Woohoo! Another blog post done.

Dynamo: Amazon's Highly Available Key-Value Store

Sriram Madapusi Vasudevan — Mon, 29 Jun 2020 02:32:01 GMT

Paper link

Introduction

Reliability and Scalability of a system is dependent on how application state is managed.
Initially started with just a focus on applications which require just "Primary Key" access.
Data is partitioned and replicated using consistent hashing.
Replica consistency maintained by a quorom like technique and decentralized replica synchronization protocol.

Why Dynamo?

Most production systems store their state in relational databases. For many of the more common usage patterns of state persistence, however, a relational database is a solution that is far from ideal. Most of these services only store and retrieve data by primary key and do not require the complex querying and management functionality offered by an RDBMS.

Dynamo Assumptions

Simple key value store with no relational schema.
Store relatively small objects.
Lower consistency requirements.

Design Considerations / Tenets

Always write-able.
Latency Sensitive.
Conflict resolutions for multiple copies of data only happens during a read. Either the data-store or the application can manage these conflict resolutions. Application can choose to "merge" different versions whereas the data-store can use a simpler scheme such as last write wins.
Incremental stability: Addition of new storage nodes should be pain free.
Symmetry: All nodes share the same responsibilities.
Decentralization
Heterogeneity

Opinion: Symmetry and Heterogeneity as tenets do not seem to go well together and contradict each other. More on this later.

Discussion

Dynamo can be characterized as a zero-hop DHT, where each node maintains enough routing information locally to route a request to the appropriate node directly.

Dynamo API Interface

get(key) : Locates the object replicas associated with the key in the storage system and returns a single object or a list of objects with conflicting versions along with a context.
put(key, context, object) : Determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk. The context encodes system metadata about the object that is opaque to the caller and includes information such as the version of the object. The context information is stored along with the object so that the system can verify the validity of the context object supplied in the put request.

Opinion: If context is opaque to the caller, how is version information computed, does every write operation necessitate a read? Edit: Further in the paper, the authors do clarify that that is the case.

Partitioning Algorithm

A variant of the consistent hashing algorithm is used. In the "normal" consistent hashing algorithm each storage node is only responsible for one point (portion) of the ring.

Each storage node gets assigned to multiple points in the ring. To address this, "virtual nodes" are used. A virtual node looks like a single node in the system, but each node can be responsible for more than one virtual node.

Replication

Each item is replicated to "N" hosts which is configured per Dynamo instance.

Note: The paper makes a reference to a "Co-ordinator" node without being clear on the nomenclature. Consistent hashing makes use of a ring as the output range of underlying hash function. Nodes are assigned positions on the ring. On hashing of an item, it reveals that the position on the ring. A node is responsible for the region of the ring between itself and its predecessor. Therefore the item would need to land on the first position which is larger than the hashed items' position. The node where an item lands on is called a "Co-ordinator" Node.

Co-ordinator node also replicates the key that it needs to be in charge of, to "N-1" clockwise successor nodes.
Each key has list of nodes that are responsible for it. This list of nodes is called
"preference list". Because of the presence of virtual nodes (co-located on the same host), the preference list usually only contains distinct physical nodes.

Data Versioning

Dynamo uses Vector clocks in order to capture causality between different versions of the same object.

Each object version has its own vector clock. Each vector lock is list of (node, counter) pairs. If first and second versions of objects have vector clocks such that counters present on the first object are higher than the counters present on the second object, it means the versions are in conflict are not ancestors!

Version Evolution of object over time.

What happens if the number of object versions and the size of the vector clocks grows too large?

Theoretically will not happen because writes are handled by one of top "N" nodes in the preference list.
Vector clock sizes are limited to a fixed size and older entries are removed.

Consistency Protocol

R is the minimum number of nodes that must participate in a successful read operation. W is the minimum number of nodes that must participate in a successful write operation. Setting R and W such that R + W > N yields a quorum-like system.

Tuning these parameters allow Dynamo to be a high performance read engine if required. eg: R=1, W=N

Put Request

Coordinator generates the vector clock for the new version and writes the new version locally. The coordinator then sends the new version (along with the new vector clock) to the N highest-ranked reachable nodes. (Usually Top N nodes from the key's preference list) If at least W-1 nodes respond, then the write is considered successful.

Get Request

Coordinator requests all existing versions of data for that key from the N highest-ranked reachable nodes in the preference list for that key, and then waits for R responses before returning the result to the client.

The catch is that Dynamo does not operate a true quorum, but a "sloppy quorom". i.e all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hashing ring.

Temporary Failure Handling

If a node 'A' that is supposed to be responsible for a write of the key is down. Another node 'B' that is within the Nth successor range receives the write and stores it as a "hinted replica". Once 'A' is back up, 'B' delivers the hint to 'A' and deletes it from its local store.

Permanent Failure Handling

What if the node holding all the "hinted replica" dies?

Dynamo uses an anti-entropy (replica synchronization) protocol to keep the replicas synchronized.

How?

Merkle hash trees - They are an efficient data structure for comparing large amounts of data.
Each node maintains a separate Merkle tree for each key range (the set of keys covered by a virtual node) it hosts. This allows nodes to compare whether the keys within a key range are up-to-date. In this scheme, two nodes exchange the root of the Merkle tree corresponding to the key ranges that they host in common. If the root hashes are different, the hashes of the children are exchanged, till its determined at what level the hashes are different and from there gather the list of keys that are "out of sync".

Membership

Command line based tool triggers addition/removal of a node from the ring. Other nodes are notified through a gossip protocol that a node joined/left the ring.
When a node starts for the first time, it it chooses its set of tokens (virtual nodes in the consistent hash space) and maps nodes to their respective token sets.

Opinion: How does it choose the portion of the consistent hash space that it is responsible for? without that information a node could choose a portion of the consistent hash space that is receiving zero traffic. The performance on the entire Dynamo system may not change at all because of an addition of a new node. Edit: Further in the paper, authors mention that there were multiple strategies attempted.

External Discovery

Nodes have separate mechanism to discover certain special "seed" nodes. This helps with avoiding partitions in the system, Eventually all nodes will get to know about new members in the system through the gossip protocol.

Opinion: This goes against one of the tenets mentioned in the earlier part of the post of symmetry. It's perhaps to be interpreted as symmetry of some smaller subset of operations, since there are special actors across the nodes.

Node Failure

There is no global view of the status of a node "A". It is only determined by other nodes that wish to communicate with "A". Explicit node arrival/departures from the system are determined as sufficient.

Common Dynamo Configurations

The common (N,R,W) configuration used by several instances of Dynamo is (3,2,2). These values are chosen to meet the necessary levels of performance, durability, consistency, and availability SLAs.

Optimizations

Nodes maintain an object buffer in-memory. Every write operation is written to that buffer. Separate writer thread flushes the writes to disk periodically.
Whenever a node gets a read operation, the nodes check the in-memory buffer before accessing disk. If present, the object is returned from the in-memory buffer. Otherwise, the object is retrieved from disk.

Durability is a trade-off. The hosts with the in-memory buffer can crash. To overcome this, the Co-ordinator node instructs one of the "N" nodes responsible for the key to perform a "durable" write. This does not effect performance since the writes only wait for W responses. (This works because: W < N).

Partition Strategies

Strategy 1: T random tokens per node and partition by token value
Strategy 2: T random tokens per node and equal sized partitions
Strategy 3: Q/S tokens per node, equal-sized partitions

Strategy 1

Unequal partition ranges, because the partition ranges are just defined by laying the tokens in the hash space in order. Every two continuous tokens define a range.
Goes back to earlier point of not being able to add more nodes to handle extra request load, since the performance is non-deterministic.

Strategy 2

Q equally sized partitions.
Q is usually set such that Q >> N and Q >> S*T, where S is the number of nodes in the system.
Partition scheme does not depend on the tokens.

Strategy 3

Each node is assigned Q/S tokens where S is the number of nodes in the system. When a node leaves the system, its tokens are randomly distributed to the remaining nodes such that these properties are preserved. Similarly, when a node joins the system it "steals" tokens from nodes in the system in a way that preserves these properties.

Strategy 3 comes out to be the best.

Size of membership information per node is the least.
Faster bootstrapping/recovery.
Ease of Archival.

Opinion:

The above section requires some more in-depth analysis to showcase why Strategy 3 is the best. Expect an update.

Divergent Object Versions

Network Partitions
Excessive concurrent writers

Client or Server driven Co-ordination

Load-Balancer forwards a request it receives uniformly to any node in the ring. Write requests require a Co-ordinator node (top node from the preference list for the key). Read requests do not.
Clients can poll a dynamo node to figure out which nodes are in the preference list for their key of choice. The requests can then be routed directly to that node, this avoids a network hop and a need for a load balancer.

Background vs Foreground

Only grant and allow background tasks in time slices depending on the measurement of how much load foreground tasks are currently causing (latencies for disk operations, failed database accesses due to lock-contention and transaction timeouts, and request queue wait times). The "admission controller" manages this process.

Conclusion

Primary advantage of Dynamo is that it provides the necessary knobs using the three parameters of (N,R,W) to tune their instance based on their needs.
For new applications that want to use Dynamo, some analysis is required during the initial stages of the development to pick the right conflict resolution mechanisms that meet the business case appropriately
Dynamo adopts a full membership model where each node is aware of the data hosted by its peers. To do this, each node actively gossips the full routing table with other nodes in the system.

Back to Blogging

Sriram Madapusi Vasudevan — Sun, 21 Jun 2020 19:33:37 GMT

It seems almost a lifetime ago that I used to have a blog. Kick starting this blog to have an outlet for my thoughts where the platform for sharing my thoughts is exclusively owned by me. Meet you at the next post!

Can you Tic Tac Toe?

Sriram Madapusi Vasudevan — Thu, 05 Nov 2015 00:00:00 GMT

Well that was a question that I asked myself.

I was inspired by this blogpost: http://neverstopbuilding.com/minimax

Without much further ado, Here's the algorithm. (I have strived to explain, what's going at every step in the form of comments)

First of all, we need to establish a function, such that when given a board, we determine if either 'X' or 'O' won.

In my case, I assume any n * n board and all unfilled spots in the board are shown as '-'
So an empty 3 * 3 board would look like this:

[['-', '-', '-'], ['-', '-', '-'], ['-', '-', '-']]

def win(board):
    # look for horizontal, vertical and diagonal matches
    size = len(board)
    count = 0
    criss = []
    cross = []
    for i in board:
        curr_row = []
        curr_col = []
        for j in range(0, len(i)):
            if board[count][j] != '-':
                # horizontal
                curr_row.append(board[count][j])
            if board[j][count] != '-':
                # vertical
                curr_col.append(board[j][count])
            if (count == j):
                # top-left to bottom-right diagonal
                if board[count][j] != '-':
                    criss.append(board[count][j])
            if ((count + j + 1)) == size:
                # top-right to bottom-left diagonal
                if board[count][j] != '-':
                    cross.append(board[count][j])
        count = count + 1
        if len(curr_row) == size:
            if set(curr_row) == {'X'}:
                return 'X'
            elif set(curr_row) == {'O'}:
                return 'O'

        if len(curr_col) == size:
            if set(curr_col) == {'X'}:
                return 'X'
            elif set(curr_col) == {'O'}:
                return 'O'
        if len(criss) == size:
            if set(criss) == {'X'}:
                return 'X'
            elif set(criss) == {'O'}:
                return 'O'

        if len(cross) == size:
            if set(cross) == {'X'}:
                return 'X'
            elif set(cross) == {'O'}:
                return 'O'

Now, that was more or less straight forward. Coming to the interesting part, the recursive move function.

def move(board, player):
    '''
    :param board - a list of lists, such that its a square matrix
    :param player - The current player 'X' or 'O'
    returns a tuple
    if the player is 'X', and is in winning position,
    then we return (1, next_position_to_play)
    eg: (1, (2, 2))
    if the player is 'X', and is in a position to draw at best,
    then we return (0, next_position_to_play)
    eg: (0, (1, 1))
    if the player is 'O', and is in winning position,
    then we return (-1, next_position_to_play)
    eg: (-1, (2, 1))
    if the player is 'O', and is in a position to draw at best,
    then we return (0, next_position_to_play)
    eg: (0, (1, 0))
    if the board is full, we return the game as drawn.
    eg: (0, -1)
    '''
    size = len(board) * len(board)
    # size is total number of spots on the board
    count = 0
    i_count = 0
    for i in board:
        for j in range(0, len(i)):
            if board[i_count][j] == '-':
                count = count + 1
        i_count = i_count + 1
    if count == size:
        # The board is empty, and there are tons of choices to make
        # but its probably a safe bet, to get the center spot.
        return 0, (len(board)/2, len(board)/2)
    if count == 0:
        # Full Board
        return 0, -1
    nextplayer = 'O' if player == 'X' else 'X'
    winner = win(board)

    if winner == 'X':
        # if the winner is X, then current player is 'O'
        # But does it lead to a winning situation for 'X'? Yes
        # return +1, -1
        # we are maximizing for X (+1), and -1 since game is over
        return 1, -1
    elif winner == 'O':
        # if the winner is O, then current player is 'X'
        # But does it lead to a winning situation for 'O'? Yes
        # return -1, -1
        # we are minimizing for O (-1), and -1 since game is over
        return -1, -1
    list_indexes = []
    res_list = []
    i_count = 0
    for i in board:
        for j in range(0, len(i)):
            if board[i_count][j] == '-':
                list_indexes.append((i_count,j))
        i_count = i_count + 1
    # list_indexes contains all the indexes where element is '-'
    for position in list_indexes:
        i, j = position
        # Go through every empty position on the board, and
        # assign the current player to it.
        board[i][j] = player
        # recursively call move on the current board,
        # after we filled one more spot.
        # with the player parameter of move being the nextplayer
        ret, _ = move(board, nextplayer)
        # Remember, ret could be -1, 0 or 1.
        res_list.append(ret)
        # important to go back and mark postion tried as '-' again.
        board[i][j] = '-'
    if player == 'X':
        # We are maximizing for X (remember note from earlier).
        # we return 1 if X is in a winning position. if there is a
        # winning position, we  find the (first occurence) index of
        # the 1, and use that as index on list_indexes.
        # if not, most likely all of res_list is just zeros,
        # so will just pick list_indexes[0] to be our next move.
        max_elem = max(res_list)
        return max_elem, list_indexes[res_list.index(max_elem)]

    else:
        # We are minimizing for O (remember note from earlier).
        # We look for -1 in res_list, find its index, and use that
        # index on list_indexes.
        # if not, most likely all of res_list is just zeros,
        # so will just pick list_indexes[0] to be our next move.
        min_elem = min(res_list)
        return min_elem, list_indexes[res_list.index(min_elem)]

Phew, that took some time to figure out.

Now lets test it out.

In [3]: move([['-','-','-'],['-','-','-'],['-','-','-']],'X')
Out[3]: (0, (1, 1))

In [4]: move([['-','-','-'],['-','X','-'],['-','-','-']],'O')
Out[4]: (0, (0, 0))
In [5]: move([['O','-','-'],['-','X','-'],['-','-','-']],'X')
Out[5]: (0, (0, 1))

In [6]: move([['O','X','-'],['-','X','-'],['-','-','-']],'O')
Out[6]: (0, (2, 1))

In [7]: move([['O','X','-'],['-','X','-'],['-','O','-']],'X')
Out[7]: (0, (1, 0))

In [8]: move([['O','X','-'],['X','X','-'],['-','O','-']],'O')
Out[8]: (0, (1, 2))

In [9]: move([['O','X','-'],['X','X','O'],['-','O','-']],'X')
Out[9]: (0, (0, 2))
In [10]: move([['O','X','X'],['X','X','O'],['-','O','-']],'O')
Out[10]: (0, (2, 0))

In [11]: move([['O','X','X'],['X','X','O'],['O','O','-']],'X')
Out[11]: (0, (2, 2))

In [12]: move([['O','X','X'],['X','X','O'],['O','O','X']],'O')
Out[12]: (0, -1)

And it ends in a draw!

Hopefully, That was interesting.

I'm documenting the algorithms I work on here: https://github.com/sriram-mv/108

Introducing Smoothie

Sriram Madapusi Vasudevan — Thu, 21 May 2015 00:00:00 GMT

Who knew that drinking a smoothie, could spark a simple idea for a weekend project. :)

The idea is pretty simple, a way to allow decorated python functions to callback to a certain handler based on the exception thrown by the wrapped function.

A decorator can establish that. Combine that with a class and you can now save the original function before it was decorated, and call that if need be.

In [1]: from smoothie.king import Dispenser
In [2]: def err_callback(*args, **kwargs):
   ...:         print("Error handled")
   ...:

In [3]: juice = Dispenser()

In [4]: @juice.attach(exception=IndexError,
   ...:               callback=err_callback)
   ...: def vending_machine():
   ...:         drinks = ['Tea','Coffee', 'Water']
   ...:         return drinks[4]
   ...:

In [5]: vending_machine()
Error handled
In [6]: juice.original('vending_machine')
Out[6]: 

In [8]: juice.original('vending_machine')()
-------------------------------------------------------------
IndexError                  Traceback (most recent call last)
 in ()
----> 1 juice.original('vending_machine')()
 in vending_machine()
      3 def vending_machine():
      4         drinks = ['Tea','Coffee', 'Water']
----> 5         return drinks[4]
      6

IndexError: list index out of range

Extremely simple to use.

Code: https://github.com/sriram-mv/smoothie

Pull requests are welcome!

P.S Travis CI runs on every pull request as well.