Cloud Computing: Map Reduce

What is MapReduce?

Map and Reduce terms are borrowed from functional language like Lisp
Map is a metafunction which applies a function to a list and produces another list as output
Map processes each record sequentially and independently

Reduce is a metafunction which applies a function to a group of records in batches

Eg: Sum of squares
Map applies square function while reduce applies add function

Eg: Wordcount
Map processes individual records to generate intermediate key/value pairs
input: <filename, file text>
output: <key, value> pairs

i/p dataset can be split/shard to multiple Map tasks to parallely process a large number of individual records to generate intermediate key/value pairs

Reduce processes and merges all intermediate values associated per key
input: <key, value> pairs. Since multiple Map tasks process independently, there can multiple records with the same key
output: <key, value> pairs with unique key

Reduce can be parallelized with multipe reduce tasks with each key assigned to one reduce task and parallely processing and merging all intermediate values by partitioning keys

Popular Partitioning of keys for reduce tasks is Hash partitioning
i.e. key is assigned to reduce task # = hash(key) % number of reduce servers

Map Reduce was invented by Google, but it was not open sourced.
Yahoo open-sourced implementation of Map Reduce and that became Apache Hadoop.
Hortonworks was spun-off of Yahoo and does lot of code development

MapReduce Examples

1. Distributed Grep

i/p : large set of files
o/p : lines that match pattern
map : emits a line if it matches the supplied pattern
reduce : copies the intermediate data to o/p

2. Reverse Web-link graph

i/p : Web graph in the form tuples (a,b) where page a -> page b
o/p : for each page, list of pages that link it i.e. all pages which point to page b
map : process web log and for each i/p <source, target>, it outputs <target, source>
reduce : emits <target, list(source)>

3. Count of URL access frequency

i/p : log of accessed URLs eg: proxy server
o/p : for each URL, % of total accesses for that URL
map : process web log and outputs <URL,1>
reduce : (multiple) emits <URL, URL_count>

The above is similar to WordCount, but we need % of total accesses for each URL. Hence we need to chain another MapReduce job
map : processes <URL, URL_count> (o/p of above MapReduce job) and outputs <1, (<URL, URL_count>)>
The reason for setting the key as 1 is so that there is only one Reduce task
reduce : (only one) sums up URL_count's to caluculate overall_count in one pass and emits multiple <URL, URL_count/overall_count> in second pass

4. Sort

In the implementation of MapReduce, by default
map task's o/p is sorted (eg: quicksort)
reduce tasks's i/p is sorted (eg: mergesort)
i/p : series of (key, value) pairs
o/p : sorted <value>s

map : (key, value) -> (value, _) : Sorted by default engine
reduce : (key, value) -> (key, value) : Sorted by default engine
However, paritioning function should be range instead of hash function so that each contiguous range can be assigned

MapReduce Scheduling

1. Parallelize map - determine # map tasks and how dataset is split or sharded
2. Transfer data from map to reduce - paritioning function
3. Parallelize reduce - schedule reduce tasks
4. Implement storage for map input, map output, reduce input and reduce output

Ensure "barrier" between Map phase and Reduce phase i.e. ensure than no Reduce starts before all Maps are finished

YARN Scheduler:

Used in Hadoop 2.x+
Yet Another Resource Negotiator
Treats each server as a collection of containers (Container = some CPU + some memory)
Has 3 main components
1. Global Resource Manager (RM) for scheduling the containers to application tasks/jobs
2. Per-server Node Manager (NM) - daemon and server-specific functions
3. Per-application (job) Application Manager - container negotiation with RM and NMs; detection of task failure of that job

MapReduce Fault-tolerance

1. Server failure

a. NM heartbeats to RM. If server fails, RM notifies all affected AMs and AMs take action
b. NM keeps track of each task running at its server. If task fails while in-progress, mark the task as idle and restart it
c. AM heartbeats to RM. On failure, RM restarts AM which then syncs up with its running tasks

2. RM failure : Use old checkpoints and bring up secondary RM

3. Heartbeats also used to piggyback container requests, this avoiding extra messages

Slow Servers or stragglers

Due to bad disk, network bandwidth, CPU or memory
This is handled by keeping track of progress of each task belonging to the same job. If any task is detected to be a straggler task, it is replicated on another server with a speculation that the replica might get lucky and finish faster. The task is considered done when first replica completes, and the other replica is killed after completion.

Locality

Single cloud has hierarchical topology;
GFS/HDFS stores 3 replicas of each of chunks (eg : 64 MB in size) - 2 on a rack, 1 on different rack
Mapreduce attempts to schedule a map task on the following order
a. a machine that contains a replica of corresponding input data
b. on the same rack as a machine containing the failing data
c. anywhere

Attributes of cloud computing -scale

1. Servers
2. SLAs, latency
3. Performance

Open Source
Contributed by Yahoo : Hadoop, ATS (Apache Tracking Server)
Used and more contributions to technologies brought from outside : Storm, HBase, Hive

Cloud Computing

Sunday, February 8, 2015

Map Reduce