PolarSPARC

Apache Spark 2.x Quick Notes :: Part - 1


Bhaskar S *UPDATED*09/29/2019


Overview

Apache Spark is a general purpose, high-performance, open-source, unified analytics cluster computing engine for large scale distributed processing of data across a cluster of commodity computers (also referred to as nodes).

The Spark stack provides support for batch processing, interactive querying using sql, streaming, machine learning, and graph processing.

Spark can run on a variety of cluster computing engines such as the built-in standalone cluster manager, Hadoop YARN, Apache Mesos, Kubernetes, or in a cloud environment such as AWS, Azure, or Google Cloud.

Spark can access data from a variety of sources such as local filesystem, Hadoop HDFS, Apache Hive, Apache HBase, Apache Cassandra, etc.

Spark can be accessible through either APIs (Java, Scala, Python, R) or through the provided shells (Scala and Python).

The following diagram illustrates the components of the Spark stack:

Spark Stack
Spark Components

Spark stack consists of the following components:


Installation and Setup (Single Node)

The installation will be performed on a Ubuntu 18.04 based desktop.

Download a stable version of Spark (2.4.4 at the time of this article) from the project site located at the URL spark.apache.org.

We choose to download the Spark version 2.4.4 (pre-built for Hadoop 2.7) for this setup.

The following diagram illustrates the download from the Apache Spark site spark.apache.org:

Spark Download
Spark 2.4.4 Download

Following are the steps to install and setup Spark on a single node cluster:

We have successfully completed the installation and the necessary setup on our single node Spark cluster.



© PolarSPARC