Databricks Real Dumps Practice Exam Questions by Dumpswarp

Databricks Certified Data Engineer Associate Exam Questions and Answers

Question 1

An engineering manager uses a Databricks SQL query to monitor ingestion latency for each data source. The manager checks the results of the query every day, but they are manually rerunning the query each day and waiting for the results.

Which of the following approaches can the manager use to ensure the results of the query are updated each day?

Options:

They can schedule the query to refresh every 1 day from the SQL endpoint's page in Databricks SQL.

They can schedule the query to refresh every 12 hours from the SQL endpoint's page in Databricks SQL.

They can schedule the query to refresh every 1 day from the query's page in Databricks SQL.

They can schedule the query to run every 1 day from the Jobs UI.

They can schedule the query to run every 12 hours from the Jobs UI.

Question 2

A data engineer has a Python variable table_name that they would like to use in a SQL query. They want to construct a Python code block that will run the query using table_name.

They have the following incomplete code block:

____(f"SELECT customer_id, spend FROM {table_name}")

Which of the following can be used to fill in the blank to successfully complete the task?

Options:

spark.delta.sql

spark.delta.table

spark.table

dbutils.sql

spark.sql

Question 3

Which SQL keyword can be used to convert a table from a long format to a wide format?

Options:

TRANSFORM

PIVOT

SUM

CONVERT

Question 4

What is the functionality of AutoLoader in Databricks?

Options:

Auto Loader automatically ingests and processes new files from cloud storage, handling batch data with support for schema evolution.

Auto Loader automatically ingests and processes new files from cloud storage, handling only streaming data with no support for schema evolution.

Auto Loader automatically ingests and processes new files from cloud storage, handling batch and streaming data with no support for schema evolution.

Auto Loader automatically ingests and processes new files from cloud storage, handling both batch and streaming data with support for schema evolution.

Question 5

A data engineer has joined an existing project and they see the following query in the project repository:

CREATE STREAMING LIVE TABLE loyal_customers AS

SELECT customer_id -

FROM STREAM(LIVE.customers)

WHERE loyalty_level = 'high';

Which of the following describes why the STREAM function is included in the query?

Options:

The STREAM function is not needed and will cause an error.

The table being created is a live table.

The customers table is a streaming live table.

The customers table is a reference to a Structured Streaming query on a PySpark DataFrame.

The data in the customers table has been updated since its last run.

Question 6

Which of the following can be used to simplify and unify siloed data architectures that are specialized for specific use cases?

Options:

None of these

Data lake

Data warehouse

All of these

Data lakehouse

Question 7

A data engineer has created a new database using the following command:

CREATE DATABASE IF NOT EXISTS customer360;

In which of the following locations will the customer360 database be located?

Options:

dbfs:/user/hive/database/customer360

dbfs:/user/hive/warehouse

dbfs:/user/hive/customer360

More information is needed to determine the correct response

Question 8

A data engineer has a single-task Job that runs each morning before they begin working. After identifying an upstream data issue, they need to set up another task to run a new notebook prior to the original task.

Which of the following approaches can the data engineer use to set up the new task?

Options:

They can clone the existing task in the existing Job and update it to run the new notebook.

They can create a new task in the existing Job and then add it as a dependency of the original task.

They can create a new task in the existing Job and then add the original task as a dependency of the new task.

They can create a new job from scratch and add both tasks to run concurrently.

They can clone the existing task to a new Job and then edit it to run the new notebook.

Question 9

What is the maximum output supported by a job cluster to ensure a notebook does not fail?

Options:

10MBS

25MBS

30MBS

15MBS

Question 10

A new data engineering team has been assigned to work on a project. The team will need access to database customers in order to see what tables already exist. The team has its own group team.

Which of the following commands can be used to grant the necessary permission on the entire database to the new team?

Options:

GRANT VIEW ON CATALOG customers TO team;

GRANT CREATE ON DATABASE customers TO team;

GRANT USAGE ON CATALOG team TO customers;

GRANT CREATE ON DATABASE team TO customers;

GRANT USAGE ON DATABASE customers TO team;

Question 11

A data engineer has a Job that has a complex run schedule, and they want to transfer that schedule to other Jobs.

Rather than manually selecting each value in the scheduling form in Databricks, which of the following tools can the data engineer use to represent and submit the schedule programmatically?

Options:

pyspark.sql.types.DateType

datetime

pyspark.sql.types.TimestampType

Cron syntax

There is no way to represent and submit this information programmatically

Question 12

A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:

DROP TABLE IF EXISTS my_table;

After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.

Which of the following describes why all of these files were deleted?

Options:

The table was managed

The table's data was smaller than 10 GB

The table's data was larger than 10 GB

The table was external

The table did not have a location

Question 13

A Databricks workflow fails at the last stage due to an error in a notebook. This workflow runs daily. The data engineer fixes the mistake and wants to rerun the pipeline. This workflow is very costly and time-intensive to run.

Which action should the data engineer do in order to minimise downtime and cost?

Options:

Switch to another cluster

Repair run

Re-run the entire workflow

Restart the cluster

Question 14

In which of the following file formats is data from Delta Lake tables primarily stored?

Options:

Delta

CSV

Parquet

JSON

A proprietary, optimized format specific to Databricks

Question 15

A data engineer at a company that uses Databricks with Unity Catalog needs to share a collection of tables with an external partner who also uses a Databricks workspace enabled for Unity Catalog. The data engineer decides to use Delta Sharing to accomplish this.

What is the first piece of information the data engineer should request from the external partner to set up Delta Sharing?

Options:

Their Databricks account password

The name of their Databricks cluster

The IP address of their Databricks workspace

The sharing identifier of their Unity Catalog metastore

Question 16

A data engineering team has noticed that their Databricks SQL queries are running too slowly when they are submitted to a non-running SQL endpoint. The data engineering team wants this issue to be resolved.

Which of the following approaches can the team use to reduce the time it takes to return results in this scenario?

Options:

They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to "Reliability Optimized."

They can turn on the Auto Stop feature for the SQL endpoint.

They can increase the cluster size of the SQL endpoint.

They can turn on the Serverless feature for the SQL endpoint.

They can increase the maximum bound of the SQL endpoint's scaling range

Question 17

Which query is performing a streaming hop from raw data to a Bronze table?

Options:

Option A

Option B

Option C

Option D

Question 18

A data engineer wants to create a data entity from a couple of tables. The data entity must be used by other data engineers in other sessions. It also must be saved to a physical location.

Which of the following data entities should the data engineer create?

Options:

Database

Function

View

Temporary view

Table

Question 19

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The code block used by the data engineer is below:

If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank?

Options:

processingTime(1)

trigger(availableNow=True)

trigger(parallelBatch=True)

trigger(processingTime="once")

trigger(continuous="once")

Question 20

In which of the following scenarios should a data engineer use the MERGE INTO command instead of the INSERT INTO command?

Options:

When the location of the data needs to be changed

When the target table is an external table

When the source table can be deleted

When the target table cannot contain duplicate records

When the source is not a Delta table

Question 21

A data engineer needs access to a table new_table, but they do not have the correct permissions. They can ask the table owner for permission, but they do not know who the table owner is.

Which of the following approaches can be used to identify the owner of new_table?

Options:

Review the Permissions tab in the table's page in Data Explorer

All of these options can be used to identify the owner of the table

Review the Owner field in the table's page in Data Explorer

Review the Owner field in the table's page in the cloud storage solution

There is no way to identify the owner of the table

Question 22

A data engineer has three tables in a Delta Live Tables (DLT) pipeline. They have configured the pipeline to drop invalid records at each table. They notice that some data is being dropped due to quality concerns at some point in the DLT pipeline. They would like to determine at which table in their pipeline the data is being dropped.

Which of the following approaches can the data engineer take to identify the table that is dropping the records?

Options:

They can set up separate expectations for each table when developing their DLT pipeline.

They cannot determine which table is dropping the records.

They can set up DLT to notify them via email when records are dropped.

They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.

They can navigate to the DLT pipeline page, click on the “Error” button, and review the present errors.

Question 23

A dataset has been defined using Delta Live Tables and includes an expectations clause:

CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION DROP ROW

What is the expected behavior when a batch of data containing data that violates these constraints is processed?

Options:

Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.

Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.

Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.

Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.

Records that violate the expectation cause the job to fail.

Question 24

A data engineer is working with two tables. Each of these tables is displayed below in its entirety.

The data engineer runs the following query to join these tables together:

Which of the following will be returned by the above query?

Options:

Option A

Option B

Option C

Option D

Option E

Question 25

A data engineer has been given a new record of data:

id STRING = 'a1'

rank INTEGER = 6

rating FLOAT = 9.4

Which of the following SQL commands can be used to append the new record to an existing Delta table my_table?

Options:

INSERT INTO my_table VALUES ('a1', 6, 9.4)

my_table UNION VALUES ('a1', 6, 9.4)

INSERT VALUES ( 'a1' , 6, 9.4) INTO my_table

UPDATE my_table VALUES ('a1', 6, 9.4)

UPDATE VALUES ('a1', 6, 9.4) my_table

Question 26

A data engineer needs to parse only png files in a directory that contains files with different suffixes. Which code should the data engineer use to achieve this task?

Options:

Option A

Option B

Option C

Option D

Question 27

Identify the impact of ON VIOLATION DROP ROW and ON VIOLATION FAIL UPDATE for a constraint violation.

A data engineer has created an ETL pipeline using Delta Live table to manage their company travel reimbursement detail, they want to ensure that the if the location details has not been provided by the employee, the pipeline needs to be terminated.

How can the scenario be implemented?

Options:

CONSTRAINT valid_location EXPECT (location = NULL)

CONSTRAINT valid_location EXPECT (location != NULL) ON VIOLATION FAIL UPDATE

CONSTRAINT valid_location EXPECT (location != NULL) ON DROP ROW

CONSTRAINT valid_location EXPECT (location != NULL) ON VIOLATION FAIL

Question 28

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The code block used by the data engineer is below:

Which line of code should the data engineer use to fill in the blank if the data engineer only wants the query to execute a micro-batch to process data every 5 seconds?

Options:

trigger("5 seconds")

trigger(continuous="5 seconds")

trigger(once="5 seconds")

trigger(processingTime="5 seconds")

Question 29

A data engineer is running code in a Databricks Repo that is cloned from a central Git repository. A colleague of the data engineer informs them that changes have been made and synced to the central Git repository. The data engineer now needs to sync their Databricks Repo to get the changes from the central Git repository.

Which of the following Git operations does the data engineer need to run to accomplish this task?

Options:

Merge

Push

Pull

Commit

Clone

Question 30

A data engineer needs to use a Delta table as part of a data pipeline, but they do not know if they have the appropriate permissions.

In which of the following locations can the data engineer review their permissions on the table?

Options:

Databricks Filesystem

Jobs

Dashboards

Repos

Data Explorer

Question 31

A data engineer is maintaining an ETL pipeline code with a GitHub repository linked to their Databricks account. The data engineer wants to deploy the ETL pipeline to production as a databricks workflow.

Which approach should the data engineer use?

Options:

Databricks Asset Bundles (DAB) + GitHub Integration

Maintain workflow_config.j son and deploy it using Databricks CLI

Manually create and manage the workflow in Ul

Maintain workflow_conf ig. json and deploy it using Terraform

Question 32

Which of the following commands can be used to write data into a Delta table while avoiding the writing of duplicate records?

Options:

DROP

IGNORE

MERGE

APPEND

INSERT

Question 33

An organization is looking for an optimized storage layer that supports ACID transactions and schema enforcement. Which technology should the organization use?

Options:

Cloud File Storage

Unity Catalog

Data lake

Delta Lake

Question 34

A new data engineering team team. has been assigned to an ELT project. The new data engineering team will need full privileges on the database customers to fully manage the project.

Which of the following commands can be used to grant full permissions on the database to the new data engineering team?

Options:

GRANT USAGE ON DATABASE customers TO team;

GRANT ALL PRIVILEGES ON DATABASE team TO customers;

GRANT SELECT PRIVILEGES ON DATABASE customers TO teams;

GRANT SELECT CREATE MODIFY USAGE PRIVILEGES ON DATABASE customers TO team;

GRANT ALL PRIVILEGES ON DATABASE customers TO team;

Question 35

A data engineer runs a statement every day to copy the previous day’s sales into the table transactions. Each day’s sales are in their own file in the location "/transactions/raw".

Today, the data engineer runs the following command to complete this task:

After running the command today, the data engineer notices that the number of records in table transactions has not changed.

Which of the following describes why the statement might not have copied any new records into the table?

Options:

The format of the files to be copied were not included with the FORMAT_OPTIONS keyword.

The names of the files to be copied were not included with the FILES keyword.

The previous day’s file has already been copied into the table.

The PARQUET file format does not support COPY INTO.

The COPY INTO statement requires the table to be refreshed to view the copied rows.

Question 36

A data engineer needs to create a table in Databricks using data from their organization's existing SQLite database. They run the following command:

CREATE TABLE jdbc_customer360

USING

OPTIONS (

url "jdbc:sqlite:/customers.db", dbtable "customer360"

)

Which line of code fills in the above blank to successfully complete the task?

Options:

autoloader

org.apache.spark.sql.jdbc

sqlite

org.apache.spark.sql.sqlite

Question 37

A data engineer has a Job with multiple tasks that runs nightly. Each of the tasks runs slowly because the clusters take a long time to start.

Which of the following actions can the data engineer perform to improve the start up time for the clusters used for the Job?

Options:

They can use endpoints available in Databricks SQL

They can use jobs clusters instead of all-purpose clusters

They can configure the clusters to be single-node

They can use clusters that are from a cluster pool

They can configure the clusters to autoscale for larger data sizes

Answer:

Explanation:

The best action that the data engineer can perform to improve the start up time for the clusters used for the Job is to use clusters that are from a cluster pool. A cluster pool is a set of idle clusters that can be used by jobs or interactive sessions. By using a cluster pool, the data engineer can avoid the cluster creation time and reduce the latency of the tasks. Cluster pools also offer cost savings and resource efficiency, as they can be shared by multiple users and jobs.

Option A is not relevant, as endpoints available in Databricks SQL are used for creating and managing SQL analytics workloads, not for improving cluster start up time.

Option B is not correct, as jobs clusters and all-purpose clusters have similar start up times. Jobs clusters are clusters that are dedicated to run a single job and are terminated when the job is completed. All-purpose clusters are clusters that can be used for multiple purposes, such as interactive sessions, notebooks, or multiple jobs. Both types of clusters can benefit from using a cluster pool.

Option C is not advisable, as configuring the clusters to be single-node will reduce the parallelism and performance of the tasks. Single-node clusters are clusters that have only one worker node and are typically used for testing or development purposes. They are not suitable for running production jobs that require high scalability and fault tolerance.

Option E is not helpful, as configuring the clusters to autoscale for larger data sizes will not affect the start up time of the clusters. Autoscaling is a feature that allows clusters to dynamically adjust the number of worker nodes based on the workload. It can help optimize the resource utilization and cost efficiency of the clusters, but it does not speed up the cluster creation process.

[:, Cluster Pools, Jobs, Clusters, [Databricks Data Engineer Professional Exam Guide], ]

Question 38

A data engineer is working on a personal laptop and needs to perform complex transformations on data stored in a Delta Lake on cloud storage. The engineer decides to use Databricks Connect to interact with Databricks clusters and work in their local IDE.

How does Databricks Connect enable the engineer to develop, test, and debug code seamlessly on their local machine while interacting with Databricks clusters?

Options:

By allowing direct execution of Spark jobs from the local machine without needing a network connection

By providing a local environment that mimics the Databricks runtime, enabling the engineer to develop, test, and debug code using a specific IDE that is required by Databricks

By providing a local environment that mimics the Databricks runtime, enabling the engineer to develop, test, and debug code using their preferred ide

By providing a local environment that mimics the Databricks runtime, enabling the engineer to develop, test, and debug code only through Databricks' own web interface

Question 39

A data engineer streams customer orders into a Kafka topic (orders_topic) and is currently writing the ingestion script of a DLT pipeline. The data engineer needs to ingest the data from Kafka brokers to DLT using Databricks

What is the correct code for ingesting the data?

Options:

Option A

Option B

Option C

Option D

Question 40

A data engineering team has two tables. The first table march_transactions is a collection of all retail transactions in the month of March. The second table april_transactions is a collection of all retail transactions in the month of April. There are no duplicate records between the tables.

Which of the following commands should be run to create a new table all_transactions that contains all records from march_transactions and april_transactions without duplicate records?