Practical Data Engineering in Databricks

Level Advanced	Language Greek, English
Role Data Engineers, Data Analysts, Data Scientists, Software Developers, IT Professionals	Product Databricks, Azure Databricks
Duration 24 hours	Course code XLA-902

Overview

This comprehensive course, is designed to equip you with the essential skills and knowledge to effectively utilize Databricks for data engineering tasks. The course is structured into eight modules, each focusing on key aspects of data engineering within the Databricks environment.

This course is ideal for data engineers, data scientists, and anyone looking to enhance their data engineering skills using Databricks.By the end of this course, you will have a solid understanding of how to leverage Databricks for various data engineering tasks, from data ingestion and transformation to performance tuning and streaming data processing.

Audience profile

This comprehensive course is meticulously designed for a diverse group of professionals, including:

Data engineers
Who is looking to deepen their understanding of the Databricks platform and want to efficiently process, manage, and analyze large datasets.
Data scientists
Who desires to leverage the power of Apache Spark for complex data analytics and machine learning tasks.
Data analysts
Who aim to transform data into actionable insights using Databricks and Delta Lake.
Software developers
With an interest in data processing and analytics, who are looking to integrate Databricks into their software solutions.
IT professionals
Who manage big data infrastructure and are considering Databricks as a solution for their organization.

Course Syllabus

Module 1: Explore Azure Databricks

Provision an Azure Databricks workspace

Use the Azure Databricks portal

Azure Databricks workloads

Key concepts

Get to know Apache Spark

Create a Spark cluster

Use Spark in notebooks

Use Spark to work with data files and visualization

Module 2: Data Ingestion and Extraction

Reading CSV data

Reading JSON data

Reading Parquet data

Parsing XML data

Managing complex data structures

Text manipulation and processing text data

Data persistence operations

Module 3: Data Transformation and Manipulation

Fundamental data transformations

Data filtering operations

Join operations

Aggregation operations

Window operations

User-Defined Functions (UDFs)

Managing and handling missing data

Module 4: Delta Lake Tables

Establishing a Delta Lake table

Retrieving data from a Delta Lake table

Modifying data within a Delta Lake table

Integrating data into Delta tables

Change data capture in Delta Lake

Version control in Delta Lake tables

Managing Delta Lake tables

Module 5: Performance Tuning

Performance Tuning in Delta Lake

Enhancing query performance through table partitioning

Data organization via Z-ordering

Accelerating query speeds by bypassing unnecessary data

Minimizing table size using compression

Performance Tuning in Apache Spark

Monitor Spark jobs

Using broadcast variables

Reduce data shuffle

Mitigate data skew

Caching and data persistence

Partitioning and Repartitioning

Optimizing join operations

Module 6: Processing Streaming Data

Streaming Overview

Spark Structured Streaming

Checkpoints in Spark Structure Streaming

Triggers in Spark Structured Streaming

Window operations on event time

Module 7: Databricks Workflows and Pipelines

Data Workflows in Databricks

Constructing workflows in Databricks

Executing and overseeing Databricks workflows

Job cluster Vs All purpose cluster

Task and job parameters in Databricks workflows

Conditional branching within Databricks workflows

Initiating jobs upon file arrival

Establishing alerts and notifications for workflows

Data Pipelines with Delta Live Tables

Delta Live Tables

Delta table Vs Delta Live tables

Delta Live Tables datasets

Building Pipelines

Module 8: Unity Catalog and Git Integration

Enable and using Unity Catalog

What is Unity Catalog?

Unity Catalog Main Administrative Roles

Unity Catalog Objects Model

Enabling Unity Catalog on Azure Databricks

Enable and using Git Integration

Manage source files using Git folders

Storing code in Git through Databricks Repos

Ready to learn?

Our experts are ready to provide you this course to enhance your's capabilities.

Get a quote Courses List