Wednesday, March 4, 2015

P2P Systems

Why study P2P systems?
* First distributed systems that seriously focused on scalability wrt # nodes
* Widely used in cloud computing

Widely-deployed P2P systems (Industrial P2P)
1. Napster
2. Gnutella
3. Fasttrack
4. Bittorrent

Academia P2P systems
1. Chord
2. Pastry
3. Kelips
and lot more in academia

NAPSTER
* Napster clients store the actual files.
* There are several napster client, which are called as peers
* Napster servers at napster.com just contain information about the files and location of the peer (store a directory i.e. filenames with peer pointers) where the file actually resides.

When a client/peer comes up:
1. It connects to a napster server and uploads a list of music that it wants to share
2. Server maintains a list of <filename, ip-address, port#> tuples in the form of ternary tree (analogous to binary tree with 3 children). Server stores no files.

When a client searches for a file:
1. It sends server the keywords to search with
2. Server searches its directory list with the keywords
3. Server returns a list of hosts <ip-address, port#> tuples to the client
4. Client pings each host in the list to find transfer rates
5. Client fetches file from best host
Except for returning the directory list, the server is not involved in any other communication between peers.
All communication uses TCP which is reliable and ordered networking protocol.

Joining a P2P system
1. Send a http request to well-known URL for that P2P service
2. Request routed (after DNS lookup) to introducer, a well-known that keeps track of recently joined nodes in p2p system
3. Introducer initializes new peers' neighbor table

Problems:
1. Centralized server a source of congestion
2. Centralized server single point of failure
3. No security - plain text messages and passwords
4. Guilty of copyright infringement

GNUTELLA
* Eliminate servers
* Client machines search and retrieve amongst themselves
* Clients act as servers too, called servants
* All peers/servents are connected in a overlay graph (called overlay since each path is an implicit Internet path)
* Each peer stores their own files and also peer pointers

Search:
1. Gnutella routes different messages within the overlay graph
a. Query (search)
b. QueryHit (response to query)
c. Ping (to probe network for other peers)
d. Pong (reply to ping, contains address of another peer)
e. Push (for initiating file transfer)

Message Header Format:
1. Descriptor ID (16 bytes)
2. Payload descriptor (1 byte)
3. TTL (1 byte) : Decremented at each hop and message dropped when TTL=0. This is to avoid message going around indefinitely in the overlay graphs.
4. Hops (1 byte)
5. Payload length (4 bytes)

Avoiding excessive traffic:
1. To avoid duplicate transmissions, each peer maintains a list of recently received messages
2. Query forwarded to all neighbors except peer from which it received
3. Each query (identified by Descriptor ID), forwarded only once
4. QueryHit routed back only to peer from which Query received with same DescriptorID
5. For flooded messages, duplicates with same DescriptorID and PayloadDescriptor are dropped
6. QueryHit with DescriptorID for which Query not seen is dropped

After receiving QueryHit message:
* Requestor chooses best QueryHit Responder and initiates a HTTP GET request to respondor's ip-address:port. The GET request has range field for partial file transfers
* Responder then replies with file packets after this message. However, Responders can be behind a firewall which would block the direct GET request from the Requestor. Gnutella handles this by "Push" messages when direct GET HTTP request fails. The Push message is routed in the reverse QueryHit path and the Responder then sends the file packets back. Note that firewall only blocks incoming messages, but has no issues sending outgoing messages.

Problems:
1. Ping pong constituted 50% traffic. Solution was to multiplex (i.e. when multiple ping/pong messages are received, only send one out), cache and reduce frequency of ping/pong
2. Repeated searches with same keyword. Solution is to cache Query and QueryHit messages
3. Modem-connected hosts do not have enough bandwidth for passing Gnutella traffic. Solution is to use central server as proxy for such peers.
4. Large number freeloaders (a peer which downloads files, but never uploads files).

FASTTRACK
* Hybrid between Gnutella and Napster
* Takes advantage of "healthier" participants in the system
* Underlying technology in Kazaa, KazaaLite, Grokster
* Proprietary protocol, but some details available
* Like Gnutella, but with some peers designated as super nodes. Super nodes store a directory listing of nearby peers <file-name, peer-ponter> similar to Napster
* Supernode membership changes over time
* Any peer can become and stay super node, provided it has earned enough reputation/contribution.
* Reputation scheme : P2PEcon
* A peer searches by contacting a nearby super node, instead of flooding Query traffic throughout overlay graph.

BITTORRENT
* Similar to other P2P networks, but provides incentives for not being freeloaders.
* Tracker per file (receives heartbeats, joins and leaves from peers)
* When a peer connects, it gets tracker which lists all the peers that contain that file.
* It then gets peers. A peer can be a seed (full file) or leecher (few blocks)
* File split into blocks with block size between 32KB - 256 KB
* Download Local Rarest First block policy  : prefer early download of blocks that are least replicated among neighbors
* Incentives are provided by Tit-for-Tat bandwidth usage scheme : Provide blocks to neighbors that provided it the best download rates.
* Choking : Limits number of neighbors to which concurrent uploads are going on. Typically <=5 are best neighbors. Everyone else is choked. Periodically (~10 s) update the best neighbor list and periodically (~30s) unchoke a random neighbor so as to keep unchecked list fresh.

CHORD
* DHT (Distributed Hash Table)
- Hash table allows you to insert, lookup and delete objects with keys. HT stores objects in buckets. O(1) for operations
- A DHT allows to do the same in distributed setting, where objects=files; files have unique keys (like filename). However, instead of storing objects in buckets, objects are stored at nodes/machines in a cluster.
- Performance
1. Load balancing : Just like regular HT, all buckets/nodes should be equally balanced.
2. Fault-tolerance : When modes enter and leave the system, we don't want to lose objects
3. Efficiency of lookups and inserts
4. Locality

- Napster, Gnutella, Fasttrack and Bittorrent are kind of DHTs, but their efficiency of lookups and inserts is not that good.
- Memory, Lookup latency, #messages for lookup
a. Napster : O(1) @client/O(N)@server, O(1), O(1)
b. Gnutella: O(N), O(N), O(N)
c. Chord : O(logN), O(logN), O(logN)

* Intelligent choice of neighbors to reduce latency and message cost of routing (lookups/inserts)
* Uses Consistent hashing on node's or peer's address
- SHA-1 (ip-address, port) -> 160 bit string
- Truncated to m bits, which is called peer id. Not unique but conflicts are unlikely, especially with large m
- Can then map peers to one of the 2^m logical points on a circle

* Peer pointers : Each peer/node maintains the following information/pointers
- Immediate successor.
- Finger tables. The number of entries in finger table is "m" and ith entry (which is successor peer/node id on the ring) in finger table is governed by the following rule
id >= n + 2^i (mod 2^m)

* How are files stored?
- Filenames also mapped using same consistent hash function
-- SHA-1 (filename) -> 160 bit string
-- File is stored at first peer with id greater than its key (mod 2^m)
- For example cnn.com/index.html maps to key K42 is stored at first peer/node with ID > 42 on the ring
- P2P systems can be used to store any kind of objects : html files for cooperative web caching, file sharing applications including mp3 files

- With consistent hashing and K keys and N peers, each peer stores O(K/N) keys.

* How does search work?
- At node n, send query for key k to largest successor or finger table entry <= k. If none exists, send query to successor(n)

* Failures in Chord
- happen when peers fail. The finger table entry routing will not be able to reach a peer, when it's successor node fails. One solution is to maintain r multiple successor entries. In case of failure, use successor entries.
r = 2 log N
- happens when the node storing the file itself might. Solution is to replicate the file.
- Churn : Nodes join and leave/failure
-- When a new peer joins
a. Introducer directs N40 to N45 (and N32)
b. N32 updates successor to N40
c. N40 initializes successor to N45, and inits fingers from it
d. N40 periodically talks to neighbors to update finger table (Stabilization protocol runs at each node to get successor and finger table entries from its successors, eventually might even update its finger table entries with the newly joined node)
e. Some of the keys between N45 and N32 need to be copied over N40
-- #of messages for peer joins = O(logN)*O(logN)

* Virtual nodes
- Hash get non-uniform leading to bad load balancing
- Treat each node as multiple virtual nodes behaving independently
- Each joins the system
- Reduces variance of load imbalance

* Virtual ring and consistent hashing are used in Cassandra, Riak, Voldemort, DynamoDB, and other key-value stores
* For more information : www.pdos.mit.edu/chord/

PASTRY
* Assign ids to nodes, just like Chord (using virtual rings)
* Leaf set - each node knows its predecessor(s) and successor(s)
* Routing tables based on prefix matching (think of hypercube) and is O(log N) and hops are short in the underlying network
* The first m bits of the ID are matched and routing table entry is 1 bit matching, 2 bits matching, ..., m bits matching. This is unlike chord routing

Pastry Locality
* For each prefix, among all potential with a matching prefix, the neighbor with the shortest round-trip is selected
* Since shorter prefixes have many mode candidates (spread out throughput the internet), the neighbors for shorter prefixes are likely to be closer than the neighbors for longer prefixes
* Thus in prefix routing, early hops are short and later hops are longer
* Yet overall stretch, compared to direct Internet path, stays short

KELIPS
* constant lookup cost to DHT
* concept of affinity groups, k affinity groups where k = sqrt (N)
* Each node hashed to a group (hash mod k), where hash can be SHA-1
* Node's neighbors
- All other nodes in its own affinity group (on an average sqrt(N)
- One contact node per foreign affinity group (k-1 such foreign affinity groups)
- Total = 2 sqrt(N) - 1

Kelips Files and Metadata
* File can be stored at any nodes
* Decouple file replication/location (outside kelips) from file querying (inside kelips)
* Each filename hashed to a group
-- All nodes in the group replicate pointer information i.e. <filename, file location>
-- Affinity group doesn't store files

Kelips lookups
* Lookup
-- Find file affinity group by hashing the filename
-- From the node's(querying node) neighbors list go to contact for the file affinity group (just one hop)
-- Failing that try another of your neighbors to find a contact (maximum of two hops)
-- Note lookups are one hope or fewer
-- Memory cost is O(sqrt(N) : 1.93 MB for 100K nodes, 10M files
-- Fits in RAM of most workstations/laptops today (COTS machine)

Kelips Soft State
* Membership lists
-- Gossip-based membership, within each affinity group and also across affinity groups
-- O(log N) dissemination time
-- Unlike Chord and Pastry which have to look at other successors/predecessors in the even its immediate successor failure, Kelips lookup has lot more options in asking nodes within its own affinity group for the required affinity group OR it can ask node in a different affinity group for the required affinity group

* File metadata
-- needs to be periodically refreshed from source node in gossip style, otherwise the file metadata information times out and eventually gets removed from the system. The advantage is when a file is deleted, it doesn't need to be explicitly from all the nodes in the affinity group

* Slightly higher memory and background bandwidth cost, but lookup cost is O(1)