apache spark intro notes
20 Nov 2017
Apache spark is general purpose cluster computing platform.
It is the most active and the fastest growing distributed data processing (a.k.a big data) frameworks (as of 2017). There are a bunch of reasons for this, the most important being
-
A much more efficient programming model which is a great improvement/replacement of map reduce : Spark has lazy evaluations, you can cache intermediate data sets and intermediate datasets are stored in memory. This gets rid of one of the frequent complaints for map reduce’s sub-optimal performance : data set at every intermediate stage is spilled to disk.
-
Simpler APIs : Spark uses abstractions like RDDs and DataFrames and Datasets to hide a lot of complexity of underlying data processing. This makes programming much simpler. As a database developer, I feel like this is the equivalent of using SQL compared to working with tuples of lists in python. The Spark syntax is much more declarative, because of these abstractions and it gets better at each stage.
-
APIs in multiple languages : Spark has APIs in python, scala and Java, enabling different groups from data engineers, who typically use java and are increasingly using scala, to data scientists and analysts, who are increasingly using python to work with the framework.
-
General engine that supports multiple applications/types of computations : Spark supports multiple clients/applications - map reduce style computations for batch processing, SQL for queries, which are translated to distributed workloads, Spark streaming for streaming and near-real-time applications, and Spark MLLib for running machine learning algorithms in a distributed fashion.
- Batch processing
- interactive queries (SQL)
- Streaming using Spark Streaming
-
MLLib (iterative machine learning - on a cluster)
- Because it works on a core (for example, RDDs are essential to all the above implementations), all use-cases benefit from improvements to the core. With projects like Project Tungsten, these improvements help all applications with new releases.
{quote} For example, when Spark’s core engine adds an optimization, SQL and machine learning libraries automatically speed up as well. Second, the costs associated with running the stack are minimized, because instead of running 5– 10 independent software systems, an organization needs to run only one.
Karau, Holden; Konwinski, Andy; Wendell, Patrick; Zaharia, Matei. Learning Spark: Lightning-Fast Big Data Analysis (Kindle Locations 157-159). O’Reilly Media. Kindle Edition. {quote}
** It has something for everyone :
data engineers : rich apis in scala, python and java. Build pipelines faster
data scientists : apply same concepts of data frames (common in R, Pandas) in a distributed fashion.
One of the common complaints with R is performance issues with really large datasets machine learning : use the libraries you are familiar with, but run them on large cluster (spark does the heavy lifting).
- One significant increase in productivity is spark’s various advantages make it easier to do iterative algorithms and because they all share a similar api, you can combine different programming models as well.
BDAS stack at https://amplab.cs.berkeley.edu/software/ is also pretty interesting (final picture)
level 1 - different applications (SQL, MLLib, Graphx, streaming) spark core - rdds, pgroamming model and more?? underlying support systems (cluster manager, scheduler and more)
Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction. RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel.
Spark enables data scientists to tackle problems with larger data sizes than they could before with tools like R or Pandas.
Spark can create distributed datasets from any file stored in the Hadoop distributed filesystem (HDFS) or other storage systems supported by the Hadoop APIs (including your local filesystem, Amazon S3, Cassandra, Hive, HBase, etc.). It’s important to remember that Spark does not require Hadoop, it simply has support for storage systems implementing the Hadoop APIs.