Databricks Real Dumps Practice Exam Questions by Dumpswarp

Databricks Certified Associate Developer for Apache Spark 3.5-Python Questions and Answers

Question 1

What is the benefit of using Pandas on Spark for data transformations?

Options:

It is available only with Python, thereby reducing the learning curve.

It computes results immediately using eager execution, making it simple to use.

It runs on a single node only, utilizing the memory with memory-bound DataFrames and hence cost-efficient.

It executes queries faster using all the available cores in the cluster as well as provides Pandas’s rich set of features.

Question 2

Given:

python

CopyEdit

spark.sparkContext.setLogLevel("")

Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?

Options:

ALL, DEBUG, FAIL, INFO

ERROR, WARN, TRACE, OFF

WARN, NONE, ERROR, FATAL

FATAL, NONE, INFO, DEBUG

Question 3

A developer wants to refactor some older Spark code to leverage built-in functions introduced in Spark 3.5.0. The existing code performs array manipulations manually. Which of the following code snippets utilizes new built-in functions in Spark 3.5.0 for array operations?

Options:

result_df = prices_df \

.withColumn("valid_price", F.when(F.col("spot_price") > F.lit(min_price), 1).otherwise(0))

result_df = prices_df \

.agg(F.count_if(F.col("spot_price") >= F.lit(min_price)))

result_df = prices_df \

.agg(F.min("spot_price"), F.max("spot_price"))

result_df = prices_df \

.agg(F.count("spot_price").alias("spot_price")) \

.filter(F.col("spot_price") > F.lit("min_price"))

Question 4

A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.

Which action should the engineer take to resolve this issue?

Options:

Optimize the data processing logic by repartitioning the DataFrame.

Modify the Spark configuration to disable garbage collection

Increase the memory allocated to the Spark Driver.

Cache large DataFrames to persist them in memory.

Question 5

An engineer has a large ORC file located at/file/test_data.orcand wants to read only specific columns to reduce memory usage.

Which code fragment will select the columns, i.e.,col1,col2, during the reading process?

Options:

spark.read.orc("/file/test_data.orc").filter("col1 = 'value' ").select("col2")

spark.read.format("orc").select("col1", "col2").load("/file/test_data.orc")

spark.read.orc("/file/test_data.orc").selected("col1", "col2")

spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Question 6

A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior?

Choose 2 answers:

Options:

The Spark engine requires manual intervention to start executing transformations.

Only actions trigger the execution of the transformation pipeline.

Transformations are executed immediately to build the lineage graph.

The Spark engine optimizes the execution plan during the transformations, causing delays.

Transformations are evaluated lazily.

Question 7

A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from failures or intentional shutdowns by continuing where the pipeline left off.

How can this be achieved?

Options:

By configuring the optioncheckpointLocationduringreadStream

By configuring the optionrecoveryLocationduring the SparkSession initialization

By configuring the optionrecoveryLocationduringwriteStream

By configuring the optioncheckpointLocationduringwriteStream

Question 8

A developer runs:

What is the result?

Options:

It stores all data in a single Parquet file.

It throws an error if there are null values in either partition column.

It appends new partitions to an existing Parquet file.

It creates separate directories for each unique combination of color and fruit.

Question 9

A data engineer is streaming data from Kafka and requires:

Minimal latency

Exactly-once processing guarantees

Which trigger mode should be used?

Options:

.trigger(processingTime='1 second')

.trigger(continuous=True)

.trigger(continuous='1 second')

.trigger(availableNow=True)

Question 10

A Spark application developer wants to identify which operations cause shuffling, leading to a new stage in the Spark execution plan.

Which operation results in a shuffle and a new stage?

Options:

DataFrame.groupBy().agg()

DataFrame.filter()

DataFrame.withColumn()

DataFrame.select()

Question 11

Which Spark configuration controls the number of tasks that can run in parallel on the executor?

Options:

spark.executor.cores

spark.task.maxFailures

spark.driver.cores

spark.executor.memory

Question 12

A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.

Which combination of Apache Spark modules should the data scientist use in this scenario?

Options:

Spark DataFrames, Structured Streaming, and GraphX

Spark SQL, Pandas API on Spark, and Structured Streaming

Spark Streaming, GraphX, and Pandas API on Spark

Spark DataFrames, Spark SQL, and MLlib

Question 13

A developer notices that all the post-shuffle partitions in a dataset are smaller than the value set forspark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold.

Which type of join will Adaptive Query Execution (AQE) choose in this case?

Options:

A Cartesian join

A shuffled hash join

A broadcast nested loop join

A sort-merge join

Question 14

What is the behavior for functiondate_sub(start, days)if a negative value is passed into thedaysparameter?

Options:

The same start date will be returned

An error message of an invalid parameter will be returned

The number of days specified will be added to the start date

The number of days specified will be removed from the start date

Question 15

A data engineer is working ona Streaming DataFrame streaming_df with the given streaming data:

Which operation is supported with streamingdf ?

Options:

streaming_df. select (countDistinct ("Name") )

streaming_df.groupby("Id") .count ()

streaming_df.orderBy("timestamp").limit(4)

streaming_df.filter (col("count") < 30).show()

Answer:

Explanation:

Which operation is supported with streaming_df?

A. streaming_df.select(countDistinct("Name"))

B. streaming_df.groupby("Id").count()

C. streaming_df.orderBy("timestamp").limit(4)

D. streaming_df.filter(col("count") < 30).show()

Answer: B

Comprehensive and Detailed Explanation:

In Structured Streaming, only a limited subset of operations is supported due to the nature of unbounded data. Operations like sorting (orderBy) and global aggregation (countDistinct) require a full view of the dataset, which is not possible with streaming data unless specific watermarks or windows are defined.

Review of Each Option:

A. select(countDistinct("Name"))

Not allowed — Global aggregation like countDistinct() requires the full dataset and is not supported directly in streaming without watermark and windowing logic.

[Reference: Databricks Structured Streaming Guide – Unsupported Operations., B. groupby("Id").count()Supported — Streaming aggregations over a key (like groupBy("Id")) are supported. Spark maintains intermediate state for each key.Reference: Databricks Docs → Aggregations in Structured Streaming (https://docs.databricks.com/structured-streaming/aggregation.html), C. orderBy("timestamp").limit(4)Not allowed — Sorting and limiting require a full view of the stream (which is infinite), so this is unsupported in streaming DataFrames.Reference: Spark Structured Streaming – Unsupported Operations (ordering without watermark/window not allowed)., D. filter(col("count") < 30).show()Not allowed — show() is a blocking operation used for debugging batch DataFrames; it's not allowed on streaming DataFrames.Reference: Structured Streaming Programming Guide – Output operations like show() are not supported., , Reference Extract from Official Guide:, “Operations like orderBy, limit, show, and countDistinct are not supported in Structured Streaming because they require the full dataset to compute a result. Use groupBy(...).agg(...) instead for incremental aggregations.”— Databricks Structured Streaming Programming Guide]

Question 16

What is the risk associated with this operation when converting a large Pandas API on Spark DataFrame back to a Pandas DataFrame?

Options:

The conversion will automatically distribute the data across worker nodes

The operation will fail if the Pandas DataFrame exceeds 1000 rows

Data will be lost during conversion

The operation will load all data into the driver's memory, potentially causing memory overflow

Question 17

A DataFramedfhas columnsname,age, andsalary. The developer needs to sort the DataFrame byagein ascending order andsalaryin descending order.

Which code snippet meets the requirement of the developer?

Options:

df.orderBy(col("age").asc(), col("salary").asc()).show()

df.sort("age", "salary", ascending=[True, True]).show()

df.sort("age", "salary", ascending=[False, True]).show()

df.orderBy("age", "salary", ascending=[True, False]).show()

Question 18

What is the difference betweendf.cache()anddf.persist()in Spark DataFrame?

Options:

Bothcache()andpersist()can be used to set the default storage level (MEMORY_AND_DISK_SER)

Both functions perform the same operation. Thepersist()function provides improved performance asits default storage level isDISK_ONLY.

persist()- Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) andcache()- Can be used to set different storage levels to persist the contents of the DataFrame.

cache()- Persists the DataFrame with the default storage level (MEMORY_AND_DISK) andpersist()- Can be used to set different storage levels to persist the contents of the DataFrame

Question 19

A data engineer wants to process a streaming DataFrame that receives sensor readings every second with columnssensor_id,temperature, andtimestamp. The engineer needs to calculate the average temperature for each sensor over the last 5 minutes while the data is streaming.

Which code implementation achieves the requirement?

Options from the images provided:

Options:

Option A

Option B

Option C

Option D

Question 20

Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?

Options:

It provides a way to run Spark applications remotely in any programming language

It can be used to interact with any remote cluster using the REST API

It allows for remote execution of Spark jobs

It is primarily used for data ingestion into Spark from external sources

Question 21

A data scientist of an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user. Before further processing the data, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns in this DataFrame. The PII columns in df_user are first_name, last_name, email, and birthdate.

Which code snippet can be used to meet this requirement?

Options:

df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")

df_user_non_pii = df_user.dropfields("first_name", "last_name", "email", "birthdate")

df_user_non_pii = df_user.dropfields("first_name, last_name, email, birthdate")

Question 22

A data engineer needs to write a Streaming DataFrame as Parquet files.

Given the code:

Which code fragment should be inserted to meet the requirement?

Options:

.format("parquet")

.option("location", "path/to/destination/dir")

CopyEdit

.option("format", "parquet")

.option("destination", "path/to/destination/dir")

.option("format", "parquet")

.option("location", "path/to/destination/dir")

.format("parquet")

.option("path", "path/to/destination/dir")

Question 23

Given a CSV file with the content:

And the following code:

from pyspark.sql.types import *

schema = StructType([

StructField("name", StringType()),

StructField("age", IntegerType())

])

spark.read.schema(schema).csv(path).collect()

What is the resulting output?

Options:

[Row(name='bambi'), Row(name='alladin', age=20)]

[Row(name='alladin', age=20)]

[Row(name='bambi', age=None), Row(name='alladin', age=20)]

The code throws an error due to a schema mismatch.

Question 24

A Spark developer is building an app to monitor task performance. They need to track the maximum task processing time per worker node and consolidate it on the driver for analysis.

Which technique should be used?

Options:

Use an RDD action like reduce() to compute the maximum time

Use an accumulator to record the maximum time on the driver

Broadcast a variable to share the maximum time among workers

Configure the Spark UI to automatically collect maximum times

Question 25

A data scientist at a financial services company is working with a Spark DataFrame containing transaction records. The DataFrame has millions of rows and includes columns fortransaction_id,account_number,transaction_amount, andtimestamp. Due to an issue with the source system, some transactions were accidentally recorded multiple times with identical information across all fields. The data scientist needs to remove rows with duplicates across all fields to ensure accurate financial reporting.

Which approach should the data scientist use to deduplicate the orders using PySpark?

Options:

df = df.dropDuplicates()

df = df.groupBy("transaction_id").agg(F.first("account_number"), F.first("transaction_amount"), F.first("timestamp"))

df = df.filter(F.col("transaction_id").isNotNull())

df = df.dropDuplicates(["transaction_amount"])

Load More Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Questions

Summer Limited Time 60% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: wrap60

Dumpswrap Top Menu

breadcrumb

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Dumps

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Free PDF Questions

Databricks Certified Associate Developer for Apache Spark 3.5-Python Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer: