Spark

Spark
Unified Analytics Engine for Big Data

About Spark

Apache Spark™ is a unified analytics engine for large-scale data processing.

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Run workloads 100x faster.
Write applications quickly in Java, Scala, Python, R, and SQL.
Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.

Spark Topics

The following are the things covered under Spark.

Spark Overview

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.

Security

Security in Spark is OFF by default. This could mean you are vulnerable to attack by default.

Running the Examples and Shell

Spark comes with several sample programs. Scala, Java, Python and R examples are in the examples/src/main directory. To run one of the Java or Scala sample programs, use bin/run-example <class> [params] in the top-level Spark directory.

Launching on a Cluster

The Spark cluster mode overview explains the key concepts in running on a cluster. Spark can run both by itself, or over several existing cluster managers.

Configuration

customize Spark via its configuration system

Structured Streaming

processing structured data streams with relation queries (using Datasets and DataFrames, newer API than DStreams)

Course Contents

The following are the course contents offered for Spark

Starting an HDP 3.x Cluster
Introduction to the Hadoop Distributed File System (HDFS)
Demonstration: Understanding Block Storage
Using HDFS Commands
Big Data
Big Data and HDP
Environment Setup
Installing HDP
Installing HDP
Managing Ambari Users and Groups
Managing Ambari Users and Groups
Managing Cluster Nodes
Managing Cluster Nodes
Adding Decommissioning and Recommissioning Worker Nodes

Components of Spark
Downloading and setup (with Hands-On Exercise)
Core Spark - Driver Program & SparkContext, worker nodes, Executor, tasks
Spark standalone application (with Hands-On Exercise)
Spark Vs. Hadoop
Scala API
Python API
Scala Introduction
Scala Programming
Scala Intro
Why Scala
Installation
with Hands-On Exercise
Sample program
Scala execution workflow
Data types
First Scala program
Values and variables
Singleton Object
Functions
Classes & Objects
Constructors
Access modifiers
Control structures
If else
Loops
Arrays
with Hands-On Exercise

File I/O
Database connectivity using JDBC
Use case 1
Maven
SBT
with Hands-On Exercise
Intro to Spark
with Hands-On Exercise

Installation of Spark
Hardware requirements
Software requirements
Configuring and running the Spark cluster
Your first Spark program
Coding Spark jobs in Scala
Tools and utilities for administrators/developers
Scaling out the cluster

Batch versus real-time data processing
Batch processing
Real-time data processing
Architecture of Spark
Architecture of Spark Streaming
Cluster components
Memory configuration & management

Intro to RDD
Partitions
Immutability & Lineage
Types of RDD
Operations on RDD
DataFrame and SparkSQL operations
RDD intro
creating RDDs (with Hands-On Exercise)
RRD operations (with Hands-On Exercise)
Data types
Transformations and functions (with Hands-On Exercise)
Caching (with Hands-On Exercise)

Loading and saving your data (with Hands-On Exercise)
Input sources
Output operations (with Hands-On Exercise)
loading from S3 (with Hands-On Exercise)
loading from HDFS
Hadoop Configuration for Saprk
SPARK SQL
Aggregations
Databases - HBASE
HandsOn Exercise

Spark packaging structure and client APIs
Spark Core
SparkContext and Spark Config
RDD – APIs
Other Spark Core packages
Spark libraries and extensions
Spark Streaming
Spark MLlib
Spark SQL
Spark GraphX
Resilient distributed datasets and discretized streams
Resilient distributed datasets
Motivation behind RDD
Fault tolerance
Transformations and actions
RDD storage
RDD persistence
Shuffling in RDD
Discretized streams
Data loading from distributed and varied sources

Accumulators
Broadcast variables
Numeric RRD operations
Spark runtime architecture
Deploying applications
Packaging code with dependencies
Scheduling
Cluster managers
HandsOn Exercise

Setting Up Spark Cluster
Configuring Spark with SparkConf
Components of execution - Kobs
Finding information

Understanding the Structure of Data and the Need of Spark SQL
Anatomy of Spark SQL
DataFrame Programming
Understanding Aggregations and Multi-Datasource Joining with SparkSQL
Introducing Datasets and Understanding Data Catalogs

Getting Started with the SparkSession (or HiveContext or SQLContext)
Spark SQL Dependencies
Basics of Schemas
DataFrame API
Transformations
Multi-DataFrame Transformations
Plain Old SQL Queries and Interacting with Hive Data
Data Representation in DataFrames and Datasets
Data Loading and Saving Functions
DataFrameWriter and DataFrameReader
Formats
Save Modes
Partitions (Discovery and Writing)
Datasets
Extending with User-Defined Functions and Aggregate Functions (UDFs
Query Optimizer
Debugging Spark SQL Queries
JDBC/ODBC Server

Spark Stream Processing
Data Stream Processing and Micro Batch Data Processing
A Log Event Processor
Windowed Data Processing and More Processing Options
Kafka Stream Processing
Spark Streaming Jobs in Production

Metrics and Debugging
Spark WebUI
Monitoring Spark jobs
Evaluating spark jobs
Memory consumption and resource allocation
Job metrics
Monitoring tool for spark
Debugging & troubleshooting spark jobs

Understanding Machine Learning and the Need of Spark for it
Wine Quality Prediction and Model Persistence
Wine Classification
Spam Filtering
Feature Algorithms and Finding Synonyms

WeCanDoNow

Spark
Unified Analytics Engine for Big Data

About Spark

Apache Spark™ is a unified analytics engine for large-scale data processing.

Spark Topics

Spark Overview

Security

Running the Examples and Shell

Launching on a Cluster

Configuration

Structured Streaming

Course Contents

Download

WeCanDoNow

Useful Links

Contact Us

Spark Unified Analytics Engine for Big Data

About Spark

Apache Spark™ is a unified analytics engine for large-scale data processing.

Spark Overview

Security

Running the Examples and Shell

Launching on a Cluster

Configuration

Structured Streaming

Download

Spark
Unified Analytics Engine for Big Data