Big Halloween Sale Discount Flat 70% Offer - Ends in 0d 00h 00m 00s - Coupon code: 70diswrap

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Dumps

Databricks Certified Associate Developer for Apache Spark 3.5 – Python Questions and Answers

Question 1

15 of 55.

A data engineer is working on a Streaming DataFrame (streaming_df) with the following streaming data:

id

name

count

timestamp

1

Delhi

20

2024-09-19T10:11

1

Delhi

50

2024-09-19T10:12

2

London

50

2024-09-19T10:15

3

Paris

30

2024-09-19T10:18

3

Paris

20

2024-09-19T10:20

4

Washington

10

2024-09-19T10:22

Which operation is supported with streaming_df?

Options:

A.

streaming_df.count()

B.

streaming_df.filter("count < 30")

C.

streaming_df.select(countDistinct("name"))

D.

streaming_df.show()

Question 2

A data engineer is working on a Streaming DataFrame streaming_df with the given streaming data:

as

Which operation is supported with streamingdf ?

Options:

A.

streaming_df. select (countDistinct ("Name") )

B.

streaming_df.groupby("Id") .count ()

C.

streaming_df.orderBy("timestamp").limit(4)

D.

streaming_df.filter (col("count") < 30).show()

Question 3

A data scientist of an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user. Before further processing the data, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns in this DataFrame. The PII columns in df_user are first_name, last_name, email, and birthdate.

Which code snippet can be used to meet this requirement?

Options:

A.

df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")

B.

df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")

C.

df_user_non_pii = df_user.dropfields("first_name", "last_name", "email", "birthdate")

D.

df_user_non_pii = df_user.dropfields("first_name, last_name, email, birthdate")

Question 4

A data scientist has identified that some records in the user profile table contain null values in any of the fields, and such records should be removed from the dataset before processing. The schema includes fields like user_id, username, date_of_birth, created_ts, etc.

The schema of the user profile table looks like this:

as

Which block of Spark code can be used to achieve this requirement?

Options:

Options:

A.

filtered_df = users_raw_df.na.drop(thresh=0)

B.

filtered_df = users_raw_df.na.drop(how='all')

C.

filtered_df = users_raw_df.na.drop(how='any')

D.

filtered_df = users_raw_df.na.drop(how='all', thresh=None)

Question 5

A data engineer has been asked to produce a Parquet table which is overwritten every day with the latest data. The downstream consumer of this Parquet table has a hard requirement that the data in this table is produced with all records sorted by the market_time field.

Which line of Spark code will produce a Parquet table that meets these requirements?

Options:

A.

final_df \

.sort("market_time") \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

B.

final_df \

.orderBy("market_time") \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

C.

final_df \

.sort("market_time") \

.coalesce(1) \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

D.

final_df \

.sortWithinPartitions("market_time") \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

Question 6

A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior?

Choose 2 answers:

Options:

A.

The Spark engine requires manual intervention to start executing transformations.

B.

Only actions trigger the execution of the transformation pipeline.

C.

Transformations are executed immediately to build the lineage graph.

D.

The Spark engine optimizes the execution plan during the transformations, causing delays.

E.

Transformations are evaluated lazily.

Question 7

47 of 55.

A data engineer has written the following code to join two DataFrames df1 and df2:

df1 = spark.read.csv("sales_data.csv")

df2 = spark.read.csv("product_data.csv")

df_joined = df1.join(df2, df1.product_id == df2.product_id)

The DataFrame df1 contains ~10 GB of sales data, and df2 contains ~8 MB of product data.

Which join strategy will Spark use?

Options:

A.

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently.

B.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan.

C.

Shuffle join because no broadcast hints were provided.

D.

Broadcast join, as df2 is smaller than the default broadcast threshold.

Question 8

An engineer has two DataFrames: df1 (small) and df2 (large). A broadcast join is used:

python

CopyEdit

from pyspark.sql.functions import broadcast

result = df2.join(broadcast(df1), on='id', how='inner')

What is the purpose of using broadcast() in this scenario?

Options:

Options:

A.

It filters the id values before performing the join.

B.

It increases the partition size for df1 and df2.

C.

It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes.

D.

It ensures that the join happens only when the id values are identical.

Question 9

What is the benefit of using Pandas on Spark for data transformations?

Options:

Options:

A.

It is available only with Python, thereby reducing the learning curve.

B.

It computes results immediately using eager execution, making it simple to use.

C.

It runs on a single node only, utilizing the memory with memory-bound DataFrames and hence cost-efficient.

D.

It executes queries faster using all the available cores in the cluster as well as provides Pandas’s rich set of features.

Question 10

13 of 55.

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

region_id

region_name

10

North

12

East

14

West

The resulting Python dictionary must contain a mapping of region_id to region_name, containing the smallest 3 region_id values.

Which code fragment meets the requirements?

Options:

A.

regions_dict = dict(regions.take(3))

B.

regions_dict = regions.select("region_id", "region_name").take(3)

C.

regions_dict = dict(regions.select("region_id", "region_name").rdd.collect())

D.

regions_dict = dict(regions.orderBy("region_id").limit(3).rdd.map(lambda x: (x.region_id, x.region_name)).collect())

Question 11

A developer initializes a SparkSession:

as

spark = SparkSession.builder \

.appName("Analytics Application") \

.getOrCreate()

Which statement describes the spark SparkSession?

Options:

A.

The getOrCreate() method explicitly destroys any existing SparkSession and creates a new one.

B.

A SparkSession is unique for each appName, and calling getOrCreate() with the same name will return an existing SparkSession once it has been created.

C.

If a SparkSession already exists, this code will return the existing session instead of creating a new one.

D.

A new SparkSession is created every time the getOrCreate() method is invoked.

Question 12

22 of 55.

A Spark application needs to read multiple Parquet files from a directory where the files have differing but compatible schemas.

The data engineer wants to create a DataFrame that includes all columns from all files.

Which code should the data engineer use to read the Parquet files and include all columns using Apache Spark?

Options:

A.

spark.read.parquet("/data/parquet/")

B.

spark.read.option("mergeSchema", True).parquet("/data/parquet/")

C.

spark.read.format("parquet").option("inferSchema", "true").load("/data/parquet/")

D.

spark.read.parquet("/data/parquet/").option("mergeAllCols", True)

Question 13

A data engineer is working on a real-time analytics pipeline using Apache Spark Structured Streaming. The engineer wants to process incoming data and ensure that triggers control when the query is executed. The system needs to process data in micro-batches with a fixed interval of 5 seconds.

Which code snippet the data engineer could use to fulfil this requirement?

A)

as

B)

as

C)

as

D)

as

Options:

Options:

A.

Uses trigger(continuous='5 seconds') – continuous processing mode.

B.

Uses trigger() – default micro-batch trigger without interval.

C.

Uses trigger(processingTime='5 seconds') – correct micro-batch trigger with interval.

D.

Uses trigger(processingTime=5000) – invalid, as processingTime expects a string.

Question 14

35 of 55.

A data engineer is building a Structured Streaming pipeline and wants it to recover from failures or intentional shutdowns by continuing where it left off.

How can this be achieved?

Options:

A.

By configuring the option recoveryLocation during SparkSession initialization.

B.

By configuring the option checkpointLocation during readStream.

C.

By configuring the option checkpointLocation during writeStream.

D.

By configuring the option recoveryLocation during writeStream.

Question 15

What is a feature of Spark Connect?

Options:

A.

It supports DataStreamReader, DataStreamWriter, StreamingQuery, and Streaming APIs

B.

Supports DataFrame, Functions, Column, SparkContext PySpark APIs

C.

It supports only PySpark applications

D.

It has built-in authentication

Question 16

Which Spark configuration controls the number of tasks that can run in parallel on the executor?

Options:

Options:

A.

spark.executor.cores

B.

spark.task.maxFailures

C.

spark.driver.cores

D.

spark.executor.memory

Question 17

The following code fragment results in an error:

as

Which code fragment should be used instead?

A)

as

B)

as

C)

as

D)

as

Options:

Question 18

In the code block below, aggDF contains aggregations on a streaming DataFrame:

as

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

Options:

A.

complete

B.

append

C.

replace

D.

aggregate

Question 19

Given the schema:

as

event_ts TIMESTAMP,

sensor_id STRING,

metric_value LONG,

ingest_ts TIMESTAMP,

source_file_path STRING

The goal is to deduplicate based on: event_ts, sensor_id, and metric_value.

Options:

Options:

A.

dropDuplicates on all columns (wrong criteria)

B.

dropDuplicates with no arguments (removes based on all columns)

C.

groupBy without aggregation (invalid use)

D.

dropDuplicates on the exact matching fields

Question 20

11 of 55.

Which Spark configuration controls the number of tasks that can run in parallel on an executor?

Options:

A.

spark.executor.cores

B.

spark.task.maxFailures

C.

spark.executor.memory

D.

spark.sql.shuffle.partitions

Question 21

46 of 55.

A data engineer is implementing a streaming pipeline with watermarking to handle late-arriving records.

The engineer has written the following code:

inputStream \

.withWatermark("event_time", "10 minutes") \

.groupBy(window("event_time", "15 minutes"))

What happens to data that arrives after the watermark threshold?

Options:

A.

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.

B.

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.

C.

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.

D.

The watermark ensures that late data arriving within 10 minutes of the latest event time will be processed and included in the windowed aggregation.

Question 22

17 of 55.

A data engineer has noticed that upgrading the Spark version in their applications from Spark 3.0 to Spark 3.5 has improved the runtime of some scheduled Spark applications.

Looking further, the data engineer realizes that Adaptive Query Execution (AQE) is now enabled.

Which operation should AQE be implementing to automatically improve the Spark application performance?

Options:

A.

Dynamically switching join strategies

B.

Collecting persistent table statistics and storing them in the metastore for future use

C.

Improving the performance of single-stage Spark jobs

D.

Optimizing the layout of Delta files on disk

Question 23

16 of 55.

A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior? (Choose 2 answers)

Options:

A.

Transformations are executed immediately to build the lineage graph.

B.

The Spark engine optimizes the execution plan during the transformations, causing delays.

C.

Transformations are evaluated lazily.

D.

The Spark engine requires manual intervention to start executing transformations.

E.

Only actions trigger the execution of the transformation pipeline.

Question 24

A data engineer replaces the exact percentile() function with approx_percentile() to improve performance, but the results are drifting too far from expected values.

Which change should be made to solve the issue?

as

Options:

A.

Decrease the first value of the percentage parameter to increase the accuracy of the percentile ranges

B.

Decrease the value of the accuracy parameter in order to decrease the memory usage but also improve the accuracy

C.

Increase the last value of the percentage parameter to increase the accuracy of the percentile ranges

D.

Increase the value of the accuracy parameter in order to increase the memory usage but also improve the accuracy

Question 25

An engineer wants to join two DataFrames df1 and df2 on the respective employee_id and emp_id columns:

df1: employee_id INT, name STRING

df2: emp_id INT, department STRING

The engineer uses:

result = df1.join(df2, df1.employee_id == df2.emp_id, how='inner')

What is the behaviour of the code snippet?

Options:

A.

The code fails to execute because the column names employee_id and emp_id do not match automatically

B.

The code fails to execute because it must use on='employee_id' to specify the join column explicitly

C.

The code fails to execute because PySpark does not support joining DataFrames with a different structure

D.

The code works as expected because the join condition explicitly matches employee_id from df1 with emp_id from df2

Question 26

A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The cluster has 10 nodes, each with 16 CPUs. Spark UI shows:

Low number of Active Tasks

Many tasks complete in milliseconds

Fewer tasks than available CPUs

Which approach should be used to adjust the partitioning for optimal resource allocation?

Options:

A.

Set the number of partitions equal to the total number of CPUs in the cluster

B.

Set the number of partitions to a fixed value, such as 200

C.

Set the number of partitions equal to the number of nodes in the cluster

D.

Set the number of partitions by dividing the dataset size (1 TB) by a reasonable partition size, such as 128 MB

Question 27

Which UDF implementation calculates the length of strings in a Spark DataFrame?

Options:

A.

df.withColumn("length", spark.udf("len", StringType()))

B.

df.select(length(col("stringColumn")).alias("length"))

C.

spark.udf.register("stringLength", lambda s: len(s))

D.

df.withColumn("length", udf(lambda s: len(s), StringType()))

Question 28

6 of 55.

Which components of Apache Spark’s Architecture are responsible for carrying out tasks when assigned to them?

Options:

A.

Driver Nodes

B.

Executors

C.

CPU Cores

D.

Worker Nodes

Question 29

A DataFrame df has columns name, age, and salary. The developer needs to sort the DataFrame by age in ascending order and salary in descending order.

Which code snippet meets the requirement of the developer?

Options:

A.

df.orderBy(col("age").asc(), col("salary").asc()).show()

B.

df.sort("age", "salary", ascending=[True, True]).show()

C.

df.sort("age", "salary", ascending=[False, True]).show()

D.

df.orderBy("age", "salary", ascending=[True, False]).show()

Question 30

10 of 55.

What is the benefit of using Pandas API on Spark for data transformations?

Options:

A.

It executes queries faster using all the available cores in the cluster as well as provides Pandas's rich set of features.

B.

It is available only with Python, thereby reducing the learning curve.

C.

It runs on a single node only, utilizing memory efficiently.

D.

It computes results immediately using eager execution.

Question 31

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)

as

C)

as

D)

as

Options:

A.

Use the applyInPandas API:

df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()

B.

Use the mapInPandas API:

df.mapInPandas(mean_func, schema="user_id long, value double").show()

C.

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy("user_id").agg(mean("value")).show()

D.

Use a Pandas UDF:

@pandas_udf("double")

def mean_func(value: pd.Series) -> float:

return value.mean()

df.groupby("user_id").agg(mean_func(df["value"])).show()

Question 32

A data engineer is running a batch processing job on a Spark cluster with the following configuration:

10 worker nodes

16 CPU cores per worker node

64 GB RAM per node

The data engineer wants to allocate four executors per node, each executor using four cores.

What is the total number of CPU cores used by the application?

Options:

A.

160

B.

64

C.

80

D.

40

Question 33

An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.

The initial code is:

as

def in_spanish_inner(df: pd.Series) -> pd.Series:

model = get_translation_model(target_lang='es')

return df.apply(model)

in_spanish = sf.pandas_udf(in_spanish_inner, StringType())

How can the MLOps engineer change this code to reduce how many times the language model is loaded?

Options:

A.

Convert the Pandas UDF to a PySpark UDF

B.

Convert the Pandas UDF from a Series → Series UDF to a Series → Scalar UDF

C.

Run the in_spanish_inner() function in a mapInPandas() function call

D.

Convert the Pandas UDF from a Series → Series UDF to an Iterator[Series] → Iterator[Series] UDF

Question 34

What is the behavior for function date_sub(start, days) if a negative value is passed into the days parameter?

Options:

A.

The same start date will be returned

B.

An error message of an invalid parameter will be returned

C.

The number of days specified will be added to the start date

D.

The number of days specified will be removed from the start date

Question 35

19 of 55.

A Spark developer wants to improve the performance of an existing PySpark UDF that runs a hash function not available in the standard Spark functions library.

The existing UDF code is:

import hashlib

from pyspark.sql.types import StringType

def shake_256(raw):

return hashlib.shake_256(raw.encode()).hexdigest(20)

shake_256_udf = udf(shake_256, StringType())

The developer replaces this UDF with a Pandas UDF for better performance:

@pandas_udf(StringType())

def shake_256(raw: str) -> str:

return hashlib.shake_256(raw.encode()).hexdigest(20)

However, the developer receives this error:

TypeError: Unsupported signature: (raw: str) -> str

What should the signature of the shake_256() function be changed to in order to fix this error?

Options:

A.

def shake_256(raw: str) -> str:

B.

def shake_256(raw: [pd.Series]) -> pd.Series:

C.

def shake_256(raw: pd.Series) -> pd.Series:

D.

def shake_256(raw: [str]) -> [str]:

Question 36

49 of 55.

In the code block below, aggDF contains aggregations on a streaming DataFrame:

aggDF.writeStream \

.format("console") \

.outputMode("???") \

.start()

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

Options:

A.

AGGREGATE

B.

COMPLETE

C.

REPLACE

D.

APPEND

Question 37

54 of 55.

What is the benefit of Adaptive Query Execution (AQE)?

Options:

A.

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

B.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

C.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

D.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

Question 38

How can a Spark developer ensure optimal resource utilization when running Spark jobs in Local Mode for testing?

Options:

Options:

A.

Configure the application to run in cluster mode instead of local mode.

B.

Increase the number of local threads based on the number of CPU cores.

C.

Use the spark.dynamicAllocation.enabled property to scale resources dynamically.

D.

Set the spark.executor.memory property to a large value.

Question 39

20 of 55.

What is the difference between df.cache() and df.persist() in Spark DataFrame?

Options:

A.

Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.

B.

persist() — Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER), and cache() — Can be used to set different storage levels.

C.

Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_DESER).

D.

cache() — Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER), and persist() — Can be used to set different storage levels to persist the contents of the DataFrame.

Question 40

9 of 55.

Given the code fragment:

import pyspark.pandas as ps

pdf = ps.DataFrame(data)

Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

Options:

A.

pdf.to_pandas()

B.

pdf.to_spark()

C.

pdf.to_dataframe()

D.

pdf.spark()

Page: 1 / 14
Total 136 questions