What is MapReduce?
Map and Reduce terms are borrowed from functional language like LispMap is a metafunction which applies a function to a list and produces another list as output
Map processes each record sequentially and independently
Reduce is a metafunction which applies a function to a group of records in batches
Eg: Sum of squares
Map applies square function while reduce applies add function
Eg: Wordcount
Map processes individual records to generate intermediate key/value pairs
input: <filename, file text>
output: <key, value> pairs
i/p dataset can be split/shard to multiple Map tasks to parallely process a large number of individual records to generate intermediate key/value pairs
Reduce processes and merges all intermediate values associated per key
input: <key, value> pairs. Since multiple Map tasks process independently, there can multiple records with the same key
output: <key, value> pairs with unique key
Reduce can be parallelized with multipe reduce tasks with each key assigned to one reduce task and parallely processing and merging all intermediate values by partitioning keys
Popular Partitioning of keys for reduce tasks is Hash partitioning
i.e. key is assigned to reduce task # = hash(key) % number of reduce servers
Map Reduce was invented by Google, but it was not open sourced.
Yahoo open-sourced implementation of Map Reduce and that became Apache Hadoop.
Hortonworks was spun-off of Yahoo and does lot of code development
MapReduce Examples
1. Distributed Grep
i/p : large set of fileso/p : lines that match pattern
map : emits a line if it matches the supplied pattern
reduce : copies the intermediate data to o/p
2. Reverse Web-link graph
i/p : Web graph in the form tuples (a,b) where page a -> page bo/p : for each page, list of pages that link it i.e. all pages which point to page b
map : process web log and for each i/p <source, target>, it outputs <target, source>
reduce : emits <target, list(source)>
3. Count of URL access frequency
i/p : log of accessed URLs eg: proxy servero/p : for each URL, % of total accesses for that URL
map : process web log and outputs <URL,1>
reduce : (multiple) emits <URL, URL_count>
The above is similar to WordCount, but we need % of total accesses for each URL. Hence we need to chain another MapReduce job
map : processes <URL, URL_count> (o/p of above MapReduce job) and outputs <1, (<URL, URL_count>)>
The reason for setting the key as 1 is so that there is only one Reduce task
reduce : (only one) sums up URL_count's to caluculate overall_count in one pass and emits multiple <URL, URL_count/overall_count> in second pass
4. Sort
In the implementation of MapReduce, by defaultmap task's o/p is sorted (eg: quicksort)
reduce tasks's i/p is sorted (eg: mergesort)
i/p : series of (key, value) pairs
o/p : sorted <value>s
map : (key, value) -> (value, _) : Sorted by default engine
reduce : (key, value) -> (key, value) : Sorted by default engine
However, paritioning function should be range instead of hash function so that each contiguous range can be assigned
MapReduce Scheduling
1. Parallelize map - determine # map tasks and how dataset is split or sharded2. Transfer data from map to reduce - paritioning function
3. Parallelize reduce - schedule reduce tasks
4. Implement storage for map input, map output, reduce input and reduce output
Ensure "barrier" between Map phase and Reduce phase i.e. ensure than no Reduce starts before all Maps are finished
YARN Scheduler:
Used in Hadoop 2.x+Yet Another Resource Negotiator
Treats each server as a collection of containers (Container = some CPU + some memory)
Has 3 main components
1. Global Resource Manager (RM) for scheduling the containers to application tasks/jobs
2. Per-server Node Manager (NM) - daemon and server-specific functions
3. Per-application (job) Application Manager - container negotiation with RM and NMs; detection of task failure of that job
MapReduce Fault-tolerance
1. Server failure
a. NM heartbeats to RM. If server fails, RM notifies all affected AMs and AMs take actionb. NM keeps track of each task running at its server. If task fails while in-progress, mark the task as idle and restart it
c. AM heartbeats to RM. On failure, RM restarts AM which then syncs up with its running tasks
2. RM failure : Use old checkpoints and bring up secondary RM
3. Heartbeats also used to piggyback container requests, this avoiding extra messages
Slow Servers or stragglers
Due to bad disk, network bandwidth, CPU or memoryThis is handled by keeping track of progress of each task belonging to the same job. If any task is detected to be a straggler task, it is replicated on another server with a speculation that the replica might get lucky and finish faster. The task is considered done when first replica completes, and the other replica is killed after completion.
Locality
Single cloud has hierarchical topology;GFS/HDFS stores 3 replicas of each of chunks (eg : 64 MB in size) - 2 on a rack, 1 on different rack
Mapreduce attempts to schedule a map task on the following order
a. a machine that contains a replica of corresponding input data
b. on the same rack as a machine containing the failing data
c. anywhere
Attributes of cloud computing -scale
1. Servers2. SLAs, latency
3. Performance
Open Source
Contributed by Yahoo : Hadoop, ATS (Apache Tracking Server)
Used and more contributions to technologies brought from outside : Storm, HBase, Hive
No comments:
Post a Comment