Saturday, November 17, 2007

An old talk by Verner Vogels on Distributed Systems

Long ago, Amazon Bangalore held an event where Verner Vogels had been invited to speak on designing distributed systems on a planetary scale. I had taken notes which surfaced when cleaning up the house. I decided to put them up online:

1. Use scalable primitives (RPC breakable easily)
2. Cache near the edges
3. Hierarchies and functional partitioning
4. Use aggregation, data fusion
5. Do not conceal Heterogenity
6. Be strict in what you emit, liberal in what you accept.
7. Avoid strong consistency properties
-Never expect your system to be stable
-Assume that nodes are leaving, joining, failing
Control:
For control to work, the system needs to be deterministic (hard)
-Apply a top down approach to controlling
-Cannot use force to put them into a model.
-"Real life in essence is probabilistic". Let go of Control

Self organizing systems:
-Positive feedback, or negative feedback

Robustness in Biologicial systems
---------------------------------

-Redendancy
-Feedback
-Modularity
-Loose coupling
-purging
-Apoptosis (programmed cell death. 50-70 billion cells commit suicide)
-Spatial compartmentalization
-Extended Phenotype
Scaling the organization
------------------------
-Organization needs to be bottom up.
-Functional units need to behave like organisms, can take care of themselves.
-Nodes recycle all the time
-Reboot becomes a tool
-Stability of organism is key, even if cell des

Continuous introspection:
Nodes responsible for themselves, not outside monitoring

The power of Epidemics
----------------------
Probabilistic model: eventual consistency
A synchronous communication pattern
Autonomous and decentralized actions
Robust with respect to message loss/node failure
Rigorous mathematical underpinnings

Epidemic algorithms and protocols:

Choose a random subset of operations
2 phases:
Phase I:
1-> N/2
Initial rate of growth factor of 2
Half way factor of 1.4
Near end factor of 1
Phase II:
nearly all nodes infected
O(logn) # rounds needed to infect entire population

Failure detection Service:
Local for a last update to a node's site
If timestamp is not update, you know of disconnection
Probabilistic, reliable
-buffer received messages
-garbage collection suffers from scalability problem

Distributed State Maintenance:
-------------------------------
State Engine: a distributed database table
Leaves are like rows
-lives on net
Randomized Rumour spreading
use:
-autonomous , asynchronous behaviour
-Let go of control, deterministic techniques