Level |
Language |
Role |
Product |
Duration |
Course code |
Overview
This comprehensive course, is designed to equip you with the essential skills and knowledge to effectively utilize Databricks for data engineering tasks. The course is structured into eight modules, each focusing on key aspects of data engineering within the Databricks environment.
This course is ideal for data engineers, data scientists, and anyone looking to enhance their data engineering skills using Databricks.By the end of this course, you will have a solid understanding of how to leverage Databricks for various data engineering tasks, from data ingestion and transformation to performance tuning and streaming data processing.
Audience profile
This comprehensive course is meticulously designed for a diverse group of professionals, including:
- Data engineers
Who is looking to deepen their understanding of the Databricks platform and want to efficiently process, manage, and analyze large datasets. - Data scientists
Who desires to leverage the power of Apache Spark for complex data analytics and machine learning tasks. - Data analysts
Who aim to transform data into actionable insights using Databricks and Delta Lake. - Software developers
With an interest in data processing and analytics, who are looking to integrate Databricks into their software solutions. - IT professionals
Who manage big data infrastructure and are considering Databricks as a solution for their organization.
Course Syllabus
Module 1: Explore Azure Databricks
Provision an Azure Databricks workspace
Use the Azure Databricks portal
Azure Databricks workloads
Key concepts
Get to know Apache Spark
Create a Spark cluster
Use Spark in notebooks
Use Spark to work with data files and visualization
Module 2: Data Ingestion and Extraction
Reading CSV data
Reading JSON data
Reading Parquet data
Parsing XML data
Managing complex data structures
Text manipulation and processing text data
Data persistence operations
Module 3: Data Transformation and Manipulation
Fundamental data transformations
Data filtering operations
Join operations
Aggregation operations
Window operations
User-Defined Functions (UDFs)
Managing and handling missing data
Module 4: Delta Lake Tables
Establishing a Delta Lake table
Retrieving data from a Delta Lake table
Modifying data within a Delta Lake table
Integrating data into Delta tables
Change data capture in Delta Lake
Version control in Delta Lake tables
Managing Delta Lake tables
Module 5: Performance Tuning
Performance Tuning in Delta Lake
Enhancing query performance through table partitioning
Data organization via Z-ordering
Accelerating query speeds by bypassing unnecessary data
Minimizing table size using compression
Performance Tuning in Apache Spark
Monitor Spark jobs
Using broadcast variables
Reduce data shuffle
Mitigate data skew
Caching and data persistence
Partitioning and Repartitioning
Optimizing join operations
Module 6: Processing Streaming Data
Streaming Overview
Spark Structured Streaming
Checkpoints in Spark Structure Streaming
Triggers in Spark Structured Streaming
Window operations on event time
Module 7: Databricks Workflows and Pipelines
Data Workflows in Databricks
Constructing workflows in Databricks
Executing and overseeing Databricks workflows
Job cluster Vs All purpose cluster
Task and job parameters in Databricks workflows
Conditional branching within Databricks workflows
Initiating jobs upon file arrival
Establishing alerts and notifications for workflows
Data Pipelines with Delta Live Tables
Delta Live Tables
Delta table Vs Delta Live tables
Delta Live Tables datasets
Building Pipelines
Module 8: Unity Catalog and Git Integration
Enable and using Unity Catalog
What is Unity Catalog?
Unity Catalog Main Administrative Roles
Unity Catalog Objects Model
Enabling Unity Catalog on Azure Databricks
Enable and using Git Integration
Manage source files using Git folders
Storing code in Git through Databricks Repos
Ready to learn?
Our experts are ready to provide you this course to enhance your's capabilities.
Get a quote Courses List