Question 6 - Big Data Processing¶

Explain the Map Reduce paradigm and programming model¶

Map Reduce consists of 2 phases -- functions defined by user

Map
- given key-value pairs, produce intermediate key-value pairs
- Combine values of the same key and send it to reducer
Reduce
- further compress the value set of the same key

Created for processing of large data sets.

Inspired by functional programming

The system consists of 1 master and several mappers and reducers.

The master

The master tells each chunk-server to count how many times each song has been played (map-phase)

Locality
- Worker should be close to the GFS replica storing the data
Stragglers
- slow workers -- nearly always some worker is slow
- when program near finished
  - in progress tasks are rescheduled at backup worker
  - done when either backup or original is done
Barrier synchronization / pipelining
- whether we can start reducing while mapping

The master sends pings to workers
If one is idle
- if its running map task, task is marked idle and rescheduled
- if its running reduce task two things can happen
  - if task task is in progress, its rescheduled
  - if task is done -- output written to global storage -- done

Map Reduce is inefficient for applications which reuse intermediate results across multiple computations

Spark uses resilient distributed datasets (RDDs)

let users

RDD does not have to be materialized all the time:

store the „lineage“, information about how it was derived from other datasets (operations on RDDs).
- re-computable

Pregel is tailored to graph computations

Keeps intermediate results like spark

Algorithm termination is based on every vertex voting to halt

Master - worker architecture

Master monitors workers and partitions vertices to workers
Workers execute at each super-step
- report number of active vertices to master at end of step

Uses GFS or BigTable for persistent data

Last update: January 11, 2021