Spark Training | BigData training in Chennai

Spark
Unified Analytics Engine for Big Data

About Spark

Apache Spark™ is a unified analytics engine for large-scale data processing.

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

  • Run workloads 100x faster.
  • Write applications quickly in Java, Scala, Python, R, and SQL.
  • Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.

Spark Topics

The following are the things covered under Spark.

Spark Overview

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.

    Security

    Security in Spark is OFF by default. This could mean you are vulnerable to attack by default.

      Running the Examples and Shell

      Spark comes with several sample programs. Scala, Java, Python and R examples are in the examples/src/main directory. To run one of the Java or Scala sample programs, use bin/run-example <class> [params] in the top-level Spark directory.

        Launching on a Cluster

        The Spark cluster mode overview explains the key concepts in running on a cluster. Spark can run both by itself, or over several existing cluster managers.

          Configuration

          customize Spark via its configuration system

            Structured Streaming

            processing structured data streams with relation queries (using Datasets and DataFrames, newer API than DStreams)

              Course Contents

              The following are the course contents offered for Spark

              • Starting an HDP 3.x Cluster
              • Introduction to the Hadoop Distributed File System (HDFS)
              • Demonstration: Understanding Block Storage
              • Using HDFS Commands
              • Big Data
              • Big Data and HDP
              • Environment Setup
              • Installing HDP
              • Installing HDP
              • Managing Ambari Users and Groups
              • Managing Ambari Users and Groups
              • Managing Cluster Nodes
              • Managing Cluster Nodes
              • Adding Decommissioning and Recommissioning Worker Nodes
              • Components of Spark
              • Downloading and setup (with Hands-On Exercise)
              • Core Spark - Driver Program & SparkContext, worker nodes, Executor, tasks
              • Spark standalone application (with Hands-On Exercise)
              • Spark Vs. Hadoop
              • Scala API
              • Python API
              • Scala Introduction
              • Scala Programming
              • Scala Intro
              • Why Scala
              • Installation
              • with Hands-On Exercise
              • Sample program
              • Scala execution workflow
              • Data types
              • First Scala program
              • Values and variables
              • Singleton Object
              • Functions
              • Classes & Objects
              • Constructors
              • Access modifiers
              • Control structures
              • If else
              • Loops
              • Arrays
              • with Hands-On Exercise
              • File I/O
              • Database connectivity using JDBC
              • Use case 1
              • Maven
              • SBT
              • with Hands-On Exercise
              • Intro to Spark
              • with Hands-On Exercise
              • Installation of Spark
              • Hardware requirements
              • Software requirements
              • Configuring and running the Spark cluster
              • Your first Spark program
              • Coding Spark jobs in Scala
              • Tools and utilities for administrators/developers
              • Scaling out the cluster
              • Batch versus real-time data processing
              • Batch processing
              • Real-time data processing
              • Architecture of Spark
              • Architecture of Spark Streaming
              • Cluster components
              • Memory configuration & management
              • Intro to RDD
              • Partitions
              • Immutability & Lineage
              • Types of RDD
              • Operations on RDD
              • DataFrame and SparkSQL operations
              • RDD intro
              • creating RDDs (with Hands-On Exercise)
              • RRD operations (with Hands-On Exercise)
              • Data types
              • Transformations and functions (with Hands-On Exercise)
              • Caching (with Hands-On Exercise)
              • Loading and saving your data (with Hands-On Exercise)
              • Input sources
              • Output operations (with Hands-On Exercise)
              • loading from S3 (with Hands-On Exercise)
              • loading from HDFS
              • Hadoop Configuration for Saprk
              • SPARK SQL
              • Aggregations
              • Databases - HBASE
              • HandsOn Exercise
              • Spark packaging structure and client APIs
              • Spark Core
              • SparkContext and Spark Config
              • RDD – APIs
              • Other Spark Core packages
              • Spark libraries and extensions
              • Spark Streaming
              • Spark MLlib
              • Spark SQL
              • Spark GraphX
              • Resilient distributed datasets and discretized streams
              • Resilient distributed datasets
              • Motivation behind RDD
              • Fault tolerance
              • Transformations and actions
              • RDD storage
              • RDD persistence
              • Shuffling in RDD
              • Discretized streams
              • Data loading from distributed and varied sources
              • Accumulators
              • Broadcast variables
              • Numeric RRD operations
              • Spark runtime architecture
              • Deploying applications
              • Packaging code with dependencies
              • Scheduling
              • Cluster managers
              • HandsOn Exercise
              • Setting Up Spark Cluster
              • Configuring Spark with SparkConf
              • Components of execution - Kobs
              • Finding information
              • Understanding the Structure of Data and the Need of Spark SQL
              • Anatomy of Spark SQL
              • DataFrame Programming
              • Understanding Aggregations and Multi-Datasource Joining with SparkSQL
              • Introducing Datasets and Understanding Data Catalogs
              • Getting Started with the SparkSession (or HiveContext or SQLContext)
              • Spark SQL Dependencies
              • Basics of Schemas
              • DataFrame API
              • Transformations
              • Multi-DataFrame Transformations
              • Plain Old SQL Queries and Interacting with Hive Data
              • Data Representation in DataFrames and Datasets
              • Data Loading and Saving Functions
              • DataFrameWriter and DataFrameReader
              • Formats
              • Save Modes
              • Partitions (Discovery and Writing)
              • Datasets
              • Extending with User-Defined Functions and Aggregate Functions (UDFs
              • Query Optimizer
              • Debugging Spark SQL Queries
              • JDBC/ODBC Server
              • Spark Stream Processing
              • Data Stream Processing and Micro Batch Data Processing
              • A Log Event Processor
              • Windowed Data Processing and More Processing Options
              • Kafka Stream Processing
              • Spark Streaming Jobs in Production
              • Metrics and Debugging
              • Spark WebUI
              • Monitoring Spark jobs
              • Evaluating spark jobs
              • Memory consumption and resource allocation
              • Job metrics
              • Monitoring tool for spark
              • Debugging & troubleshooting spark jobs
              • Understanding Machine Learning and the Need of Spark for it
              • Wine Quality Prediction and Model Persistence
              • Wine Classification
              • Spam Filtering
              • Feature Algorithms and Finding Synonyms
              • The Need for Spark and the Basics of the R Language
              • DataFrames in R and Spark
              • Spark DataFrame Programming with R
              • Understanding Aggregations and Multi- Datasource Joins in SparkR
              • Charting and Plotting Libraries and Setting Up a Dataset
              • Charts
              • Bar Chart and Pie Chart
              • Scatter Plot and Line Graph
              • Designing Spark Applications
              • Lambda Architecture
              • Estimating cluster resource requirements
              • HandsOn Use Case / PoCs on Spark Stack

              Download

              Download Spark course plan

              Designed by BootstrapMade