Pre-Summer Sale Discount Flat 70% Offer - Ends in 0d 00h 00m 00s - Coupon code: 70diswrap

Databricks Databricks-Certified-Data-Engineer-Associate Dumps

Databricks Certified Data Engineer Associate Exam Questions and Answers

Question 1

In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?

Options:

A.

Checkpointing and Write-ahead Logs

B.

Structured Streaming cannot record the offset range of the data being processed in each trigger.

C.

Replayable Sources and Idempotent Sinks

D.

Write-ahead Logs and Idempotent Sinks

E.

Checkpointing and Idempotent Sinks

Question 2

A data engineer is configuring Unity Catalog in Databricks and needs to assign a role to a user who should have the ability to grant and revoke privileges on various data objects within a specific schema but should not have read/write access over the schema or its objects.

Which role should the data engineer assign to this user?

Options:

A.

USE CATALOG / USE SCHEMA privilege on the schema

B.

Catalog Owner

C.

Table Owner

D.

Schema Owner

Question 3

A data engineer has a Python variable table_name that they would like to use in a SQL query. They want to construct a Python code block that will run the query using table_name.

They have the following incomplete code block:

____(f " SELECT customer_id, spend FROM {table_name} " )

Which of the following can be used to fill in the blank to successfully complete the task?

Options:

A.

spark.delta.sql

B.

spark.delta.table

C.

spark.table

D.

dbutils.sql

E.

spark.sql

Question 4

A data engineer has a Job with multiple tasks that runs nightly. Each of the tasks runs slowly because the clusters take a long time to start.

Which of the following actions can the data engineer perform to improve the start up time for the clusters used for the Job?

Options:

A.

They can use endpoints available in Databricks SQL

B.

They can use jobs clusters instead of all-purpose clusters

C.

They can configure the clusters to be single-node

D.

They can use clusters that are from a cluster pool

E.

They can configure the clusters to autoscale for larger data sizes

Question 5

A data engineer is working in a Python notebook on Databricks to process data, but notices that the output is not as expected. The data engineer wants to investigate the issue by stepping through the code and checking the values of certain variables during execution.

Which tool should the data engineer use to inspect the code execution and variables in real-time?

Options:

A.

Python Notebook Interactive Debugger

B.

Cluster Logs

C.

SQL Analytics

D.

Job Execution Dashboard

Question 6

A data engineer needs to parse only png files in a directory that contains files with different suffixes. Which code should the data engineer use to achieve this task?

A)

as

B)

as

C)

as

D)

as

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Question 7

Which of the following is a benefit of the Databricks Lakehouse Platform embracing open source technologies?

Options:

A.

Cloud-specific integrations

B.

Simplified governance

C.

Ability to scale storage

D.

Ability to scale workloads

E.

Avoiding vendor lock-in

Question 8

A data engineer works for an organization that must meet a stringent Service Level Agreement (SLA) that demands minimal runtime errors and high availability for its data processing pipelines. The data engineer wants to avoid the operational overhead of managing and tuning clusters.

Which architectural solution will meet the requirements?

Options:

A.

Implement a hybrid approach with scheduled batch jobs on custom cloud VMs.

B.

Use an auto-scaling cluster configured and monitored by the user.

C.

Utilize Databricks serverless compute that automatically optimizes resources and abstracts cluster management.

D.

Deploy a dedicated, manually managed cluster optimized by in-house IT staff.

Question 9

Which Databricks Asset Bundle format is valid?

Options:

A.

resources:

jobs:

hello-job:

name: hello-job

tasks:

- task_key: hello-task

existing_cluster_id: 1234-567890-abcde123

notebook_task:

notebook_path: ./hello.py

B.

{

" resources " : {

" jobs " : {

" name " : " hello-job " ,

" tasks " : {

" task_key " : " hello-task " ,

" existing_cluster_id " : " 1234-567890-abcde123 " ,

" notebook_task " : {

" notebook_path " : " ./hello.py "

}

}

}

}

}

C.

configuration = {

" resources " : {

" jobs " : {

" name " : " hello-job " ,

" tasks " : {

" task_key " : " hello-task " ,

" existing_cluster_id " : " 1234-567890-abcde123 " ,

" notebook_task " : {

" notebook_path " : " ./hello.py "

}

}

}

}

}

D.

resources {

jobs {

name = " hello-job "

tasks {

task_key = " hello-task "

existing_cluster_id = " 1234-567890-abcde123 "

notebook_task {

notebook_path = " ./hello.py "

}

}

}

}

Question 10

An organization plans to share a large dataset stored in a Databricks workspace on AWS with a partner organization whose Databricks workspace is hosted on Azure. The data engineer wants to minimize data transfer costs while ensuring secure and efficient data sharing.

Which strategy will reduce data egress costs associated with cross-cloud data sharing?

Options:

A.

Sharing data via pre-signed URLs without monitoring egress costs

B.

Migrating the dataset to Cloudflare R2 object storage before sharing

C.

Configure VPN connection between AWS and Azure for faster data sharing

D.

Using Delta Sharing without any additional configurations

Question 11

A data engineer is developing a small proof of concept in a notebook. When running the entire notebook, cluster usage spikes. The data engineer wants to keep the development experience and get real-time results.

Which cluster meets these requirements?

Options:

A.

All-Purpose Cluster with a large fixed memory size

B.

All-Purpose Cluster with autoscaling

C.

Job Cluster with autoscaling enabled

D.

Job Cluster with Photon enabled and autoscaling

Question 12

What are the transformations typically included in building the Bronze layer ?

Options:

A.

Perform extensive data cleansing

B.

Aggregate data from multiple sources

C.

Business rules and transformations

D.

Include columns Load date/time, process ID

Question 13

A data engineer that is new to using Python needs to create a Python function to add two integers together and return the sum?

Which of the following code blocks can the data engineer use to complete this task?

A)

as

B)

as

C)

as

D)

as

E)

as

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

E.

Option E

Question 14

A data engineer is inspecting an ETL pipeline based on a Pyspark job that consistently encounters performance bottlenecks. Based on developer feedback, the data engineer assumes the job is low on compute resources. To pinpoint the issue, the data engineer observes the Spark Ul and finds out the job has a high CPU time vs Task time.

Which course of action should the data engineer take?

Options:

A.

High CPU time vs Task time means an under-utilized cluster. The data engineer may need to repartition data to spread the jobs more evenly throughout the cluster.

B.

High CPU time vs Task time means efficient use of cluster and no change needed

C.

High CPU time vs Task time means over-utilized memory and the need to increase parallelism

D.

High CPU time vs Task time means a CPU over-utilized job. The data engineer may need to consider executor and core tuning or resizing the cluster

Question 15

A data engineer needs to create a table in Databricks using data from a CSV file at location /path/to/csv.

They run the following command:

as

Which of the following lines of code fills in the above blank to successfully complete the task?

Options:

A.

None of these lines of code are needed to successfully complete the task

B.

USING CSV

C.

FROM CSV

D.

USING DELTA

E.

FROM " path/to/csv "

Question 16

Which of the following describes the storage organization of a Delta table?

Options:

A.

Delta tables are stored in a single file that contains data, history, metadata, and other attributes.

B.

Delta tables store their data in a single file and all metadata in a collection of files in a separate location.

C.

Delta tables are stored in a collection of files that contain data, history, metadata, and other attributes.

D.

Delta tables are stored in a collection of files that contain only the data stored within the table.

E.

Delta tables are stored in a single file that contains only the data stored within the table.

Question 17

In which of the following file formats is data from Delta Lake tables primarily stored?

Options:

A.

Delta

B.

CSV

C.

Parquet

D.

JSON

E.

A proprietary, optimized format specific to Databricks

Question 18

A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table:

as

Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?

Options:

A.

Replace predict with a stream-friendly prediction function

B.

Replace schema(schema) with option ( " maxFilesPerTrigger " , 1)

C.

Replace " transactions " with the path to the location of the Delta table

D.

Replace format( " delta " ) with format( " stream " )

E.

Replace spark.read with spark.readStream

Question 19

A data engineer is maintaining an ETL pipeline code with a GitHub repository linked to their Databricks account. The data engineer wants to deploy the ETL pipeline to production as a databricks workflow.

Which approach should the data engineer use?

Options:

A.

Databricks Asset Bundles (DAB) + GitHub Integration

B.

Maintain workflow_config.j son and deploy it using Databricks CLI

C.

Manually create and manage the workflow in Ul

D.

Maintain workflow_conf ig. json and deploy it using Terraform

Question 20

A data engineer needs to use a Delta table as part of a data pipeline, but they do not know if they have the appropriate permissions.

In which of the following locations can the data engineer review their permissions on the table?

Options:

A.

Databricks Filesystem

B.

Jobs

C.

Dashboards

D.

Repos

E.

Data Explorer

Question 21

Which of the following data lakehouse features results in improved data quality over a traditional data lake?

Options:

A.

A data lakehouse provides storage solutions for structured and unstructured data.

B.

A data lakehouse supports ACID-compliant transactions.

C.

A data lakehouse allows the use of SQL queries to examine data.

D.

A data lakehouse stores data in open formats.

E.

A data lakehouse enables machine learning and artificial Intelligence workloads.

Question 22

A data engineer is writing a script that is meant to ingest new data from cloud storage. In the event of the Schema change, the ingestion should fail. It should fail until the changes downstream source can be found and verified as intended changes.

Which command will meet the requirements?

Options:

A.

addNewColumns

B.

failOnNewColumns

C.

rescue

D.

none

Question 23

A data engineer is designing an ETL pipeline to process both streaming and batch data from multiple sources The pipeline must ensure data quality, handle schema evolution, and provide easy maintenance. The team is considering using Delta Live Tables (DLT) in Databricks to achieve these goals. They want to understand the key features and benefits of DLT that make it suitable for this use case.

Why is Delta Live Tables (DLT) an appropriate choice?

Options:

A.

Automatic data quality checks, built-in support for schema evolution, and declarative pipeline development

B.

Manual schema enforcement, high operational overhead, and limited scalability

C.

Requires custom code for data quality checks, no support for streaming data, and complex pipeline maintenance

D.

Supports only batch processing, no data versioning, and high infrastructure costs

Question 24

A data engineer has been using a Databricks SQL dashboard to monitor the cleanliness of the input data to a data analytics dashboard for a retail use case. The job has a Databricks SQL query that returns the number of store-level records where sales is equal to zero. The data engineer wants their entire team to be notified via a messaging webhook whenever this value is greater than 0.

Which of the following approaches can the data engineer use to notify their entire team via a messaging webhook whenever the number of stores with $0 in sales is greater than zero?

Options:

A.

They can set up an Alert with a custom template.

B.

They can set up an Alert with a new email alert destination.

C.

They can set up an Alert with one-time notifications.

D.

They can set up an Alert with a new webhook alert destination.

E.

They can set up an Alert without notifications.

Question 25

Which of the following is stored in the Databricks customer ' s cloud account?

Options:

A.

Databricks web application

B.

Cluster management metadata

C.

Repos

D.

Data

E.

Notebooks

Question 26

A data engineer is migrating pipeline tasks to reduce operational toil. The workspace uses Unity Catalog and is in a region that supports serverless. The engineer wants Databricks to auto-select instance types, manage scaling, apply Photon, and handle runtime upgrades automatically for job runs.

How should the data engineer meet this requirement while adhering to Databricks constraints?

Options:

A.

Use a Pro SQL warehouse and schedule Python notebook tasks to execute as pipeline steps.

B.

Use an all-purpose cluster with cluster policies to enforce standard sizes and enable autoscaling.

C.

Create a job with a single-task job cluster and manually set the instance families and minimum/maximum workers.

D.

Run the job on a serverless compute for workflows configuration, ensuring Unity Catalog is enabled and regional support is available.

Question 27

An engineering manager uses a Databricks SQL query to monitor ingestion latency for each data source. The manager checks the results of the query every day, but they are manually rerunning the query each day and waiting for the results.

Which of the following approaches can the manager use to ensure the results of the query are updated each day?

Options:

A.

They can schedule the query to refresh every 1 day from the SQL endpoint ' s page in Databricks SQL.

B.

They can schedule the query to refresh every 12 hours from the SQL endpoint ' s page in Databricks SQL.

C.

They can schedule the query to refresh every 1 day from the query ' s page in Databricks SQL.

D.

They can schedule the query to run every 1 day from the Jobs UI.

E.

They can schedule the query to run every 12 hours from the Jobs UI.

Question 28

A data engineer has a Job that has a complex run schedule, and they want to transfer that schedule to other Jobs.

Rather than manually selecting each value in the scheduling form in Databricks, which of the following tools can the data engineer use to represent and submit the schedule programmatically?

Options:

A.

pyspark.sql.types.DateType

B.

datetime

C.

pyspark.sql.types.TimestampType

D.

Cron syntax

E.

There is no way to represent and submit this information programmatically

Question 29

A data engineer wants to schedule their Databricks SQL dashboard to refresh once per day, but they only want the associated SQL endpoint to be running when it is necessary.

Which of the following approaches can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard?

Options:

A.

They can ensure the dashboard’s SQL endpoint matches each of the queries’ SQL endpoints.

B.

They can set up the dashboard’s SQL endpoint to be serverless.

C.

They can turn on the Auto Stop feature for the SQL endpoint.

D.

They can reduce the cluster size of the SQL endpoint.

E.

They can ensure the dashboard’s SQL endpoint is not one of the included query’s SQL endpoint.

Question 30

A data engineer has a Python notebook in Databricks, but they need to use SQL to accomplish a specific task within a cell. They still want all of the other cells to use Python without making any changes to those cells.

Which of the following describes how the data engineer can use SQL within a cell of their Python notebook?

Options:

A.

It is not possible to use SQL in a Python notebook

B.

They can attach the cell to a SQL endpoint rather than a Databricks cluster

C.

They can simply write SQL syntax in the cell

D.

They can add %sql to the first line of the cell

E.

They can change the default language of the notebook to SQL

Question 31

A data engineer is working on a personal laptop and needs to perform complex transformations on data stored in a Delta Lake on cloud storage. The engineer decides to use Databricks Connect to interact with Databricks clusters and work in their local IDE.

How does Databricks Connect enable the engineer to develop, test, and debug code seamlessly on their local machine while interacting with Databricks clusters?

Options:

A.

By allowing direct execution of Spark jobs from the local machine without needing a network connection

B.

By providing a local environment that mimics the Databricks runtime, enabling the engineer to develop, test, and debug code using a specific IDE that is required by Databricks

C.

By providing a local environment that mimics the Databricks runtime, enabling the engineer to develop, test, and debug code using their preferred ide

D.

By providing a local environment that mimics the Databricks runtime, enabling the engineer to develop, test, and debug code only through Databricks ' own web interface

Question 32

A data engineer is setting up access control in Unity Catalog and needs to ensure that a group of data analysts can query tables but not modify data.

Which permission should the data engineer grant to the data analysts?

Options:

A.

SELECT

B.

INSERT

C.

MODIFY

D.

ALL PRIVILEGES

Question 33

Which of the following statements regarding the relationship between Silver tables and Bronze tables is always true?

Options:

A.

Silver tables contain a less refined, less clean view of data than Bronze data.

B.

Silver tables contain aggregates while Bronze data is unaggregated.

C.

Silver tables contain more data than Bronze tables.

D.

Silver tables contain a more refined and cleaner view of data than Bronze tables.

E.

Silver tables contain less data than Bronze tables.

Question 34

A data engineer needs access to a table new_table, but they do not have the correct permissions. They can ask the table owner for permission, but they do not know who the table owner is.

Which of the following approaches can be used to identify the owner of new_table?

Options:

A.

Review the Permissions tab in the table ' s page in Data Explorer

B.

All of these options can be used to identify the owner of the table

C.

Review the Owner field in the table ' s page in Data Explorer

D.

Review the Owner field in the table ' s page in the cloud storage solution

E.

There is no way to identify the owner of the table

Question 35

A data engineer needs to ingest from both streaming and batch sources for a firm that relies on highly accurate data. Occasionally, some of the data picked up by the sensors that provide a streaming input are outside the expected parameters. If this occurs, the data must be dropped, but the stream should not fail.

Which feature of Delta Live Tables meets this requirement?

Options:

A.

Monitoring

B.

Change Data Capture

C.

Expectations

D.

Error Handling

Question 36

Which of the following tools is used by Auto Loader process data incrementally?

Options:

A.

Checkpointing

B.

Spark Structured Streaming

C.

Data Explorer

D.

Unity Catalog

E.

Databricks SQL

Question 37

A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run.

Which of the following tools can the data engineer use to solve this problem?

Options:

A.

Unity Catalog

B.

Delta Lake

C.

Databricks SQL

D.

Data Explorer

E.

Auto Loader

Question 38

Which of the following commands will return the number of null values in the member_id column?

Options:

A.

SELECT count(member_id) FROM my_table;

B.

SELECT count(member_id) - count_null(member_id) FROM my_table;

C.

SELECT count_if(member_id IS NULL) FROM my_table;

D.

SELECT null(member_id) FROM my_table;

E.

SELECT count_null(member_id) FROM my_table;

Question 39

A data engineer needs to conduct Exploratory Analysis on data residing in a database that is within the company ' s custom-defined network in the cloud. The data engineer is using SQL for this task.

Which type of SQL Warehouse will enable the data engineer to process large numbers of queries quickly and cost-effectively?

Options:

A.

Serverless compute for notebooks

B.

Serverless SQL Warehouse

C.

Classic SQL Warehouse

D.

Pro SQL Warehouse

Question 40

A data engineer wants to delegate day-to-day permission management for the schema main.marketing to the mkt-admins group, without making them workspace admins. They should be able to grant and revoke privileges for other users on objects within that schema.

Which approach aligns with Unity Catalog’s ownership and privilege model?

Options:

A.

Transfer ownership of the schema main.marketing to mkt-admins ; owners can manage privileges on the schema and its contained objects.

B.

Grant MANAGE permissions on the metastore to mkt-admins , which allows managing privileges for all schemas and tables globally.

C.

Grant USE SCHEMA on main.marketing , and MODIFY on all tables to mkt-admins , which enables the management of grants within the schema.

D.

Make mkt-admins a workspace-level admins group, then assign SELECT on main.marketing to allow privilege delegation.

Question 41

A data engineer is getting a partner organization up to speed with Databricks account. Both teams share some business use cases. The data engineer has to share some of your Unity-Catalog managed delta tables and the notebook jobs creating those tables with the partner organization.

How can the data engineer seamlessly share the required information?

Options:

A.

Zip all the code and share via email and allow data ingestion from your data lake

B.

Data and Notebooks can be shared simply using Unity Catalog.

C.

Share access to codebase via Github and allow them to ingest datasets from your Datalake.

D.

Share required datasets and notebooks via Delta Sharing. Manage permissions via Unity Catalog.

Question 42

Which TWO items are characteristics of the Gold Layer?

Choose 2 answers

Options:

A.

Read-optimized

B.

Normalised

C.

Raw Data

D.

Historical lineage

E.

De-normalised

Question 43

A data engineering team is using Kafka to capture event data and then ingest it into Databricks. The team wants to be able to see these historical events. Medallion architecture is already in place. The team wants to be mindful of costs.

Where should this historical event data be stored?

Options:

A.

Gold

B.

Silver

C.

Bronze

D.

Raw layer

Question 44

A data engineer needs access to a table new_uable, but they do not have the correct permissions. They can ask the table owner for permission, but they do not know who the table owner is.

Which approach can be used to identify the owner of new_table?

Options:

A.

There is no way to identify the owner of the table

B.

Review the Owner field in the table ' s page in the cloud storage solution

C.

Review the Permissions tab in the table ' s page in Data Explorer

D.

Review the Owner field in the table’s page in Data Explorer

Question 45

A data engineer has joined an existing project and they see the following query in the project repository:

CREATE STREAMING LIVE TABLE loyal_customers AS

SELECT customer_id -

FROM STREAM(LIVE.customers)

WHERE loyalty_level = ' high ' ;

Which of the following describes why the STREAM function is included in the query?

Options:

A.

The STREAM function is not needed and will cause an error.

B.

The table being created is a live table.

C.

The customers table is a streaming live table.

D.

The customers table is a reference to a Structured Streaming query on a PySpark DataFrame.

E.

The data in the customers table has been updated since its last run.

Question 46

A data analyst has created a Delta table sales that is used by the entire data analysis team. They want help from the data engineering team to implement a series of tests to ensure the data is clean. However, the data engineering team uses Python for its tests rather than SQL.

Which of the following commands could the data engineering team use to access sales in PySpark?

Options:

A.

SELECT * FROM sales

B.

There is no way to share data between PySpark and SQL.

C.

spark.sql( " sales " )

D.

spark.delta.table( " sales " )

E.

spark.table( " sales " )

Question 47

A data engineer is attempting to drop a Spark SQL table my_table. The data engineer wants to delete all table metadata and data.

They run the following command:

DROP TABLE IF EXISTS my_table

While the object no longer appears when they run SHOW TABLES, the data files still exist.

Which of the following describes why the data files still exist and the metadata files were deleted?

Options:

A.

The table’s data was larger than 10 GB

B.

The table’s data was smaller than 10 GB

C.

The table was external

D.

The table did not have a location

E.

The table was managed

Question 48

A data engineer is running code in a Databricks Repo that is cloned from a central Git repository. A colleague of the data engineer informs them that changes have been made and synced to the central Git repository. The data engineer now needs to sync their Databricks Repo to get the changes from the central Git repository.

Which of the following Git operations does the data engineer need to run to accomplish this task?

Options:

A.

Merge

B.

Push

C.

Pull

D.

Commit

E.

Clone

Question 49

Which of the following must be specified when creating a new Delta Live Tables pipeline?

Options:

A.

A key-value pair configuration

B.

The preferred DBU/hour cost

C.

A path to cloud storage location for the written data

D.

A location of a target database for the written data

E.

At least one notebook library to be executed

Question 50

Which query is performing a streaming hop from raw data to a Bronze table?

A)

as

B)

as

C)

as

D)

as

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Question 51

A data engineer wants to create a data entity from a couple of tables. The data entity must be used by other data engineers in other sessions. It also must be saved to a physical location.

Which of the following data entities should the data engineer create?

Options:

A.

Database

B.

Function

C.

View

D.

Temporary view

E.

Table

Question 52

Which two components function in the DB platform architecture’s control plane? (Choose two.)

Options:

A.

Virtual Machines

B.

Compute Orchestration

C.

Serverless Compute

D.

Compute

E.

Unity Catalog

Page: 1 / 18
Total 176 questions