MapReduce is a programming framework popularized by Google and used to simplify data processing across massive data sets. As people rapidly increase their online activity and digital footprint, organizations are finding it vital to quickly analyze the huge amounts of data their customers and audiences generate to better understand and serve them. MapReduce is the tool that is helping those organizations.
Most enterprises deal with multiple types of data (text, rich text, rdbms, graph, etc…) and need to process all this data quickly and efficiently to derive meaningful insights that bring business value to the organization.
With MapReduce, computational processing can occur on data stored either in a filesystem (unstructured) or within a database (structured).
There are two fundamental pieces of a MapReduce query:
"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve.
The canonical example application of MapReduce is a process to count the appearances of each different word in a set of documents:
void map(String name, String document):
// name: document name
// document: document contents
for each word w in document:
void reduce(String word, Iterator partialCounts):
// word: a word
// partialCounts: a list of aggregated partial counts
int result = 0;
for each pc in partialCounts:
result += ParseInt(pc);
Here, each document is split in words, and each word is counted initially with a "1" value by the Map function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to Reduce, thus this function just needs to sum all of its input values to find the total appearances of that word.
An interesting implementation of MapReduce can be found in Hadoop, which uses HDFS to store unstructured data (file system) and leverages the power of MapReduce to parallelize the processing of this data. Hadoop is ideally suited for non time-sensitive batch jobs involving large-scale datasets.
Another very practical implementation called SQL-MapReduce (SQL-MR), created by Aster Data, allows developers to write powerful and highly expressive SQL-MR functions in languages such as Java, C#, Python, C++, and R and push them into the database. These functions can then be invoked using standard SQL through Aster Data's nCluster data-application server to enable ultra-fast, deep analysis of massive data sets.
SQL-MapReduce functions are simple to write and are seamlessly integrated within SQL statements. (For more details, see Aster Data's tutorial on writing with SQL-MapReduce.) They rely on SQL queries to manipulate the underlying data and provide input. The functions can procedurally manipulate such input data and provide outputs that can be further consumed by SQL queries or written into tables within the database.