Google Real Dumps Practice Exam Questions by Dumpswarp

Google Professional Data Engineer Exam Questions and Answers

Question 1

You need to compose visualizations for operations teams with the following requirements:

Which approach meets the requirements?

Options:

Load the data into Google Sheets, use formulas to calculate a metric, and use filters/sorting to show only suboptimal links in a table.

Load the data into Google BigQuery tables, write Google Apps Script that queries the data, calculates the metric, and shows only suboptimal rows in a table in Google Sheets.

Load the data into Google Cloud Datastore tables, write a Google App Engine Application that queries all rows, applies a function to derive the metric, and then renders results in a table using the Google charts and visualization API.

Load the data into Google BigQuery tables, write a Google Data Studio 360 report that connects to your data, calculates a metric, and then uses a filter expression to show only suboptimal rows in a table.

Question 2

MJTelco is building a custom interface to share data. They have these requirements:

They need to do aggregations over their petabyte-scale datasets.

They need to scan specific time range rows with a very fast response time (milliseconds).

Which combination of Google Cloud Platform products should you recommend?

Options:

Cloud Datastore and Cloud Bigtable

Cloud Bigtable and Cloud SQL

BigQuery and Cloud Bigtable

BigQuery and Cloud Storage

Question 3

MJTelco’s Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations. You want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud Dataflow pipeline configuration setting should you update?

Options:

The zone

The number of workers

The disk size per worker

The maximum number of workers

Question 4

You create a new report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. It is company policy to ensure employees can view only the data associated with their region, so you create and populate a table for each region. You need to enforce the regional access policy to the data.

Which two actions should you take? (Choose two.)

Options:

Ensure all the tables are included in global dataset.

Ensure each table is included in a dataset for a region.

Adjust the settings for each table to allow a related region-based security group view access.

Adjust the settings for each view to allow a related region-based security group view access.

Adjust the settings for each dataset to allow a related region-based security group view access.

Question 5

MJTelco needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day. Which schema should you use?

Options:

Rowkey: date#device_idColumn data: data_point

Rowkey: dateColumn data: device_id, data_point

Rowkey: device_idColumn data: date, data_point

Rowkey: data_pointColumn data: device_id, date

Rowkey: date#data_pointColumn data: device_id

Question 6

You need to compose visualization for operations teams with the following requirements:

Telemetry must include data from all 50,000 installations for the most recent 6 weeks (sampling once every minute)

The report must not be more than 3 hours delayed from live data.

The actionable report should only show suboptimal links.

Most suboptimal links should be sorted to the top.

Suboptimal links can be grouped and filtered by regional geography.

User response time to load the report must be <5 seconds.

You create a data source to store the last 6 weeks of data, and create visualizations that allow viewers to see multiple date ranges, distinct geographic regions, and unique installation types. You always show the latest data without any changes to your visualizations. You want to avoid creating and updating new visualizations each month. What should you do?

Options:

Look through the current data and compose a series of charts and tables, one for each possiblecombination of criteria.

Look through the current data and compose a small set of generalized charts and tables bound to criteria filters that allow value selection.

Export the data to a spreadsheet, compose a series of charts and tables, one for each possiblecombination of criteria, and spread them across multiple tabs.

Load the data into relational database tables, write a Google App Engine application that queries all rows, summarizes the data across each criteria, and then renders results using the Google Charts and visualization API.

Question 7

Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-grained analysis of each day’s events. They also want to use streaming ingestion. What should you do?

Options:

Create a table called tracking_table and include a DATE column.

Create a partitioned table called tracking_table and include a TIMESTAMP column.

Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.

Create a table called tracking_table with a TIMESTAMP column to represent the day.

Question 8

Flowlogistic’s management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system. You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably. Which combination of GCP products should you choose?

Options:

Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage

Cloud Pub/Sub, Cloud Dataflow, and Local SSD

Cloud Pub/Sub, Cloud SQL, and Cloud Storage

Cloud Load Balancing, Cloud Dataflow, and Cloud Storage

Question 9

Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-tracking messages, which will now go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.

Which approach should you take?

Options:

Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are received.

Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Clod Pub/Sub.

Use the NOW () function in BigQuery to record the event’s time.

Use the automatically generated timestamp from Cloud Pub/Sub to order the data.

Question 10

Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should they do?

Options:

Store the common data in BigQuery as partitioned tables.

Store the common data in BigQuery and expose authorized views.

Store the common data encoded as Avro in Google Cloud Storage.

Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.

Question 11

Flowlogistic’s CEO wants to gain rapid insight into their customer base so his sales team can be better informed in the field. This team is not very technical, so they’ve purchased a visualization tool to simplify the creation of BigQuery reports. However, they’ve been overwhelmed by all thedata in the table, and are spending a lot of money on queries trying to find the data they need. You want to solve their problem in the most cost-effective way. What should you do?

Options:

Export the data into a Google Sheet for virtualization.

Create an additional table with only the necessary columns.

Create a view on the table to present to the virtualization tool.

Create identity and access management (IAM) roles on the appropriate columns, so only they appear in a query.

Question 12

Which of these rules apply when you add preemptible workers to a Dataproc cluster (select 2 answers)?

Options:

Preemptible workers cannot use persistent disk.

Preemptible workers cannot store data.

If a preemptible worker is reclaimed, then a replacement worker must be added manually.

A Dataproc cluster cannot have only preemptible workers.

Question 13

Which is the preferred method to use to avoid hotspotting in time series data in Bigtable?

Options:

Field promotion

Randomization

Salting

Hashing

Question 14

Which of the following statements about the Wide & Deep Learning model are true? (Select 2 answers.)

Options:

The wide model is used for memorization, while the deep model is used for generalization.

A good use for the wide and deep model is a recommender system.

The wide model is used for generalization, while the deep model is used for memorization.

A good use for the wide and deep model is a small-scale linear regression problem.

Question 15

Which Cloud Dataflow / Beam feature should you use to aggregate data in an unbounded data source every hour based on the time when the data entered the pipeline?

Options:

An hourly watermark

An event time trigger

The with Allowed Lateness method

A processing time trigger

Question 16

By default, which of the following windowing behavior does Dataflow apply to unbounded data sets?

Options:

Windows at every 100 MB of data

Single, Global Window

Windows at every 1 minute

Windows at every 10 minutes

Question 17

What are all of the BigQuery operations that Google charges for?

Options:

Storage, queries, and streaming inserts

Storage, queries, and loading data from a file

Storage, queries, and exporting data

Queries and streaming inserts

Question 18

Dataproc clusters contain many configuration files. To update these files, you will need to use the --properties option. The format for the option is: file_prefix:property=_____.

Options:

details

value

null

Question 19

Which role must be assigned to a service account used by the virtual machines in a Dataproc cluster so they can execute jobs?

Options:

Dataproc Worker

Dataproc Viewer

Dataproc Runner

Dataproc Editor

Question 20

How can you get a neural network to learn about relationships between categories in a categorical feature?

Options:

Create a multi-hot column

Create a one-hot column

Create a hash bucket

Create an embedding column

Question 21

Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values (CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low.

You are told that due to seasonality, your company expects the number of files to double for the next three months. Which two actions should you take? (choose two.)

Options:

Introduce data compression for each file to increase the rate file of file transfer.

Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.

Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.

Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them.

Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premices data to the designated storage bucket.

Question 22

You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat. Here is some of the information you need to store:

The user profile: What the user likes and doesn’t like to eat

The user account information: Name, address, preferred meal times

The order information: When orders are made, from where, to whom

The database will be used to store all the transactional data of the product. You want to optimize the data schema. Which Google Cloud Platform product should you use?

Options:

BigQuery

Cloud SQL

Cloud Bigtable

Cloud Datastore

Question 23

You are deploying a new storage system for your mobile application, which is a media streaming service. You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity ‘Movie’ the property ‘actors’ and the property ‘tags’ have multiple values but the property ‘date released’ does not. A typical query would ask for all movies with actor= ordered by date_released or all movies with tag=Comedy ordered by date_released. How should you avoid a combinatorial explosion in the number of indexes?

Options:

Option A

Option B.

Option C

Option D

Question 24

Your company has recently grown rapidly and now ingesting data at a significantly higher rate than it was previously. You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs. What should you recommend they do?

Options:

Rewrite the job in Pig.

Rewrite the job in Apache Spark.

Increase the size of the Hadoop cluster.

Decrease the size of the Hadoop cluster but also rewrite the job in Hive.

Question 25

You work for a manufacturing plant that batches application log files together into a single log file once a day at 2:00 AM. You have written a Google Cloud Dataflow job to process that log file. You need to make sure the log file in processed once per day as inexpensively as possible. What should you do?

Options:

Change the processing job to use Google Cloud Dataproc instead.

Manually start the Cloud Dataflow job each morning when you get into the office.

Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.

Configure the Cloud Dataflow job as a streaming job so that it processes the log data immediately.

Question 26

You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible. What should you do?

Options:

Load the data every 30 minutes into a new partitioned table in BigQuery.

Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery

Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore

Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.

Question 27

You work for a large fast food restaurant chain with over 400,000 employees. You store employee information in Google BigQuery in a Users table consisting of a FirstName field and a LastName field. A member of IT is building an application and asks you to modify the schema and data in BigQuery so the application can query a FullName field consisting of the value of the FirstName field concatenated with a space, followed by the value of the LastName field for each employee. How can you make that data available while minimizing cost?

Options:

Create a view in BigQuery that concatenates the FirstName and LastName field values to produce the FullName.

Add a new column called FullName to the Users table. Run an UPDATE statement that updates the FullName column for each user with the concatenation of the FirstName and LastName values.

Create a Google Cloud Dataflow job that queries BigQuery for the entire Users table, concatenates the FirstName value and LastName value for each user, and loads the proper values for FirstName, LastName, and FullName into a new table in BigQuery.

Use BigQuery to export the data for the table to a CSV file. Create a Google Cloud Dataproc job to process the CSV file and output a new CSV file containing the proper values for FirstName, LastName and FullName. Run a BigQuery load job to load the new CSV file into BigQuery.

Question 28

Your company is loading comma-separated values (CSV) files into Google BigQuery. The data is fully imported successfully; however, the imported data is not matching byte-to-byte to the source file. What is the most likely cause of this problem?

Options:

The CSV data loaded in BigQuery is not flagged as CSV.

The CSV data has invalid rows that were skipped on import.

The CSV data loaded in BigQuery is not using BigQuery’s default encoding.

The CSV data has not gone through an ETL phase before loading into BigQuery.

Question 29

You are choosing a NoSQL database to handle telemetry data submitted from millions of Internet-of-Things (IoT) devices. The volume of data is growing at 100 TB per year, and each data entry has about 100 attributes. The data processing pipeline does not require atomicity, consistency, isolation, and durability (ACID). However, high availability and low latency are required.

You need to analyze the data by querying against individual fields. Which three databases meet your requirements? (Choose three.)

Options:

Redis

HBase

MySQL

MongoDB

Cassandra

HDFS with Hive

Question 30

You create an important report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour old. What should you do?

Options:

Disable caching by editing the report settings.

Disable caching in BigQuery by editing table details.

Refresh your browser tab showing the visualizations.

Clear your browser history for the past hour then reload the tab showing the virtualizations.

Question 31

Your weather app queries a database every 15 minutes to get the current temperature. The frontend is powered by Google App Engine and server millions of users. How should you design the frontend to respond to a database failure?

Options:

Issue a command to restart the database servers.

Retry the query with exponential backoff, up to a cap of 15 minutes.

Retry the query every second until it comes back online to minimize staleness of data.

Reduce the query frequency to once every hour until the database comes back online.

Question 32

You are building a model to predict whether or not it will rain on a given day. You have thousands of input features and want to see if you can improve training speed by removing some features while having a minimum effect on model accuracy. What can you do?

Options:

Eliminate features that are highly correlated to the output labels.

Combine highly co-dependent features into one representative feature.

Instead of feeding in each feature individually, average their values in batches of 3.

Remove the features that have null values for more than 50% of the training records.

Question 33

You need to store and analyze social media postings in Google BigQuery at a rate of 10,000 messages per minute in near real-time. Initially, design the application to use streaming inserts for individual postings. Your application also performs data aggregations right after the streaming inserts. You discover that the queries after streaming inserts do not exhibit strong consistency, and reports from the queries might miss in-flight data. How can you adjust your application design?

Options:

Re-write the application to load accumulated data every 2 minutes.

Convert the streaming insert code to batch load for individual messages.

Load the original message to Google Cloud SQL, and export the table every hour to BigQuery via streaming inserts.

Estimate the average latency for data availability after streaming inserts, and always run queries after waiting twice as long.

Question 34

Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow. Numerous data logs are being are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour.

The data scientists have written the following code to read the data for a new key features in the logs.

BigQueryIO.Read

.named(“ReadLogData”)

.from(“clouddataflow-readonly:samples.log_data”)

You want to improve the performance of this data read. What should you do?

Options:

Specify the TableReference object in the code.

Use .fromQuery operation to read specific fields from the table.

Use of both the Google BigQuery TableSchema and TableFieldSchema classes.

Call a transform that returns TableRow objects, where each element in the PCollexction represents a single row in the table.

Question 35

Your company handles data processing for a number of different clients. Each client prefers to use their own suite of analytics tools, with some allowing direct query access via Google BigQuery. You need to secure the data so that clients cannot see each other’s data. You want to ensure appropriate access to the data. Which three steps should you take? (Choose three.)

Options:

Load data into different partitions.

Load data into a different dataset for each client.

Put each client’s BigQuery dataset into a different table.

Restrict a client’s dataset to approved users.

Only allow a service account to access the datasets.

Use the appropriate identity and access management (IAM) roles for each client’s users.

Question 36

You work for a car manufacturer and have set up a data pipeline using Google Cloud Pub/Sub to capture anomalous sensor events. You are using a push subscription in Cloud Pub/Sub that calls a custom HTTPS endpoint that you have created to take action of these anomalous events as they occur. Your custom HTTPS endpoint keeps getting an inordinate amount of duplicate messages. What is the most likely cause of these duplicate messages?

Options:

The message body for the sensor event is too large.

Your custom endpoint has an out-of-date SSL certificate.

The Cloud Pub/Sub topic has too many messages published to it.

Your custom endpoint is not acknowledging messages within the acknowledgement deadline.

Question 37

Your company built a TensorFlow neural-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?

Options:

Threading

Serialization

Dropout Methods

Dimensionality Reduction

Question 38

You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?

Options:

Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP type. Reload the data.

Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numeric values from the column TS for each row. Reference the column TS instead of the column DT from now on.

Create a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP values. Reference the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.

Add two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEW to true. For future queries, reference the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.

Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. Reference the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.

Question 39

You are building new real-time data warehouse for your company and will use Google BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?

Options:

Include ORDER BY DESK on timestamp column and LIMIT to 1.

Use GROUP BY on the unique ID column and timestamp column and SUM on the values.

Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.

Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.

Question 40

You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)

Options:

There are very few occurrences of mutations relative to normal samples.

There are roughly equal occurrences of both normal and mutated samples in the database.

You expect future mutations to have different features from the mutated samples in the database.

You expect future mutations to have similar features to the mutated samples in the database.

You already have labels for which samples are mutated and which are normal in the database.

Question 41

You need to set access to BigQuery for different departments within your company. Your solution should comply with the following requirements:

Each department should have access only to their data.

Each department will have one or more leads who need to be able to create and update tables and provide them to their team.

Each department has data analysts who need to be able to query but not modify data.

How should you set access to the data in BigQuery?

Options:

Create a dataset for each department. Assign the department leads the role of OWNER, and assign the data analysts the role of WRITER on their dataset.

Create a dataset for each department. Assign the department leads the role of WRITER, and assign the data analysts the role of READER on their dataset.

Create a table for each department. Assign the department leads the role of Owner, and assign the data analysts the role of Editor on the project the table is in.

Create a table for each department. Assign the department leads the role of Editor, and assign the data analysts the role of Viewer on the project the table is in.

Question 42

You are building a streaming Dataflow pipeline that ingests noise level data from hundreds of sensors placed near construction sites across a city. The sensors measure noise level every ten seconds, and send that data to the pipeline when levels reach above 70 dBA. You need to detect the average noise level from a sensor when data is received for a duration of more than 30 minutes, but the window ends when no data has been received for 15 minutes What should you do?

Options:

Use session windows with a 30-mmute gap duration.

Use tumbling windows with a 15-mmute window and a fifteen-minute. withAllowedLateness operator.

Use session windows with a 15-minute gap duration.

Use hopping windows with a 15-mmute window, and a thirty-minute period.

Answer:

Explanation:

The key requirements for the windowing strategy are:

A window groups data for a specific sensor.

A window should contain data spanningat least30 minutes ("duration of more than 30 minutes" implies activity for this period).

A window for a sensorendswhen no data has been received from that sensor for 15 minutes (this is a gap).

This scenario perfectly describessession windows.

Session Windows:Session windows group elements (per key, e.g., per sensor ID) that arrive within a certain "gap duration" of each other. A new session starts if data for a key arrives after the gap duration has passed since the last data point for that key.

In this case, if data stops arriving for a sensor for 15 minutes, the current session for that sensor closes. This matches "the window ends when no data has been received for 15 minutes."

The "duration of more than 30 minutes" requirement is a condition you would applyafterthe session window closes. You'd calculate the duration of the data within the closed session window and only compute the average if that session's duration (span of event times within it) exceeds 30 minutes. Session windows themselves don't have a fixed duration; their duration is determined by data activity and the gap.

Let's analyze why other options are less suitable:

A (Hopping windows with a 15-minute window, and a thirty-minute period):Hopping windows have a fixed size and a fixed period. They create overlapping windows. This doesn't align with the dynamic nature of sessions ending based on inactivity. A 30-minute period with a 15-minute window means windows like [0:00-0:15], [0:15-0:30], [0:30-0:45]. If activity is continuous, a 30-minute activity span would be covered, but the window closing is not based on a 15-minute gap of inactivity.

B (Tumbling windows with a 15-minute window and a fifteen-minute .withAllowedLateness operator):Tumbling windows are fixed-size, non-overlapping windows. .withAllowedLateness deals with late data arriving for a window that has already passed its end time, not with defining the window based on activity gaps.

C (Session windows with a 30-minute gap duration):This would mean a session ends only if there's a 30-minute gap of inactivity. The requirement is a 15-minute gap.

Therefore, session windows with a 15-minute gap duration (Option D) correctly model the requirement for windows to close after 15 minutes of inactivity from a sensor. The subsequent filtering for sessions lasting more than 30 minutes is a downstream operation.

[Reference:, Apache Beam Programming Guide > Windowing > Windowing functions > Session windows. "Session windowing assigns elements to windows that represent sessions of activity. A session window starts when the first element arrives for a key. If another element arrives for that key within the specified gap duration, that element is included in the existing session window. If an element arrives after the gap duration, a new session window starts for that element... Session windows are useful for data that is irregularly distributed with respect to time, such as user activity data.", This directly matches the sensor data behavior: data arrives when noise is high, and a period of no data for 15 minutes should close the analysis window for that sensor., , , ]

Question 43

A data scientist has created a BigQuery ML model and asks you to create an ML pipeline to serve predictions. You have a REST API application with the requirement to serve predictions for an individual user ID with latency under 100 milliseconds. You use the following query to generate predictions: SELECT predicted_label, user_id FROM ML.PREDICT (MODEL ‘dataset.model’, table user_features). How should you create the ML pipeline?

Options:

Add a WHERE clause to the query, and grant the BigQuery Data Viewer role to the application service account.

Create an Authorized View with the provided query. Share the dataset that contains the view with the application service account.

Create a Cloud Dataflow pipeline using BigQueryIO to read results from the query. Grant the Dataflow Worker role to the application service account.

Create a Cloud Dataflow pipeline using BigQueryIO to read predictions for all users from the query. Write the results to Cloud Bigtable using BigtableIO. Grant the Bigtable Reader role to the application service account so that the application can read predictions for individual users from Cloud Bigtable.

Question 44

You are building a teal-lime prediction engine that streams files, which may contain Pll (personal identifiable information) data, into Cloud Storage and eventually into BigQuery You want to ensure that the sensitive data is masked but still maintains referential Integrity, because names and emails are often used as join keys How should you use the Cloud Data Loss Prevention API (DLP API) to ensure that the Pll data is not accessible by unauthorized individuals?

Options:

Create a pseudonym by replacing the Pll data with cryptogenic tokens, and store the non-tokenized data in a locked-down button.

Redact all Pll data, and store a version of the unredacted data in a locked-down bucket

Scan every table in BigQuery, and mask the data it finds that has Pll

Create a pseudonym by replacing Pll data with a cryptographic format-preserving token

Question 45

You work for a mid-sized enterprise that needs to move its operational system transaction data from an on-premises database to GCP. The database is about 20 TB in size. Which database should you choose?

Options:

Cloud SQL

Cloud Bigtable

Cloud Spanner

Cloud Datastore

Question 46

You are using Google BigQuery as your data warehouse. Your users report that the following simple query is running very slowly, no matter when they run the query:

SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP BY country

You check the query plan for the query and see the following output in the Read section of Stage:1:

What is the most likely cause of the delay for this query?

Options:

Users are running too many concurrent queries in the system

The [myproject:mydataset.mytable] table has too many partitions

Either the state or the city columns in the [myproject:mydataset.mytable] table have too manyNULL values

Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew

Question 47

You want to optimize your queries for cost and performance. How should you structure your data?

Options:

Partition table data by create_date, location_id and device_version

Partition table data by create_date cluster table data by location_Id and device_version

Cluster table data by create_date location_id and device_version

Cluster table data by create_date partition by locationed and device_version

Question 48

You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company's mobile app You have reviewed old chat logs and lagged each conversation for intent based on each customer's stated intention for contacting customer service About 70% of customer requests are simple requests that are solved within 10 intents The remaining 30% of inquiries require much longer, more complicated requests Which intents should you automate first?

Options:

Automate the 10 intents that cover 70% of the requests so that live agents can handle more complicated requests

Automate the more complicated requests first because those require more of the agents' time

Automate a blend of the shortest and longest intents to be representative of all intents

Automate intents in places where common words such as "payment" appear only once so the software isn't confused

Question 49

You need to migrate a Redis database from an on-premises data center to a Memorystore for Redis instance. You want to follow Google-recommended practices and perform the migration for minimal cost. time, and effort. What should you do?

Options:

Make a secondary instance of the Redis database on a Compute Engine instance, and then perform a live cutover.

Write a shell script to migrate the Redis data, and create a new Memorystore for Redis instance.

Create a Dataflow job to road the Redis database from the on-premises data center. and write the data to a Memorystore for Redis instance

Make an RDB backup of the Redis database, use the gsutil utility to copy the RDB file into a Cloud Storage bucket, and then import the RDB tile into the Memorystore for Redis instance.

Question 50

You are designing the architecture to process your data from Cloud Storage to BigQuery by using Dataflow. The network team provided you with the Shared VPC network and subnetwork to be used by your pipelines. You need to enable the deployment of the pipeline on the Shared VPC network. What should you do?

Options:

Assign the compute. networkUser role to the Dataflow service agent.

Assign the compute.networkUser role to the service account that executes the Dataflow pipeline.

Assign the dataflow, admin role to the Dataflow service agent.

Assign the dataflow, admin role to the service account that executes the Dataflow pipeline.

Question 51

You are designing a pipeline that publishes application events to a Pub/Sub topic. You need to aggregate events across hourly intervals before loading the results to BigQuery for analysis. Your solution must be scalable so it can process and load large volumes of events to BigQuery. What should you do?

Options:

Create a streaming Dataflow job to continually read from the Pub/Sub topic and perform the necessary aggregations using tumbling windows

Schedule a batch Dataflow job to run hourly, pulling all available messages from the Pub-Sub topic and performing the necessary aggregations

Schedule a Cloud Function to run hourly, pulling all avertable messages from the Pub/Sub topic and performing the necessary aggregations

Create a Cloud Function to perform the necessary data processing that executes using the Pub/Sub trigger every time a new message is published to the topic.

Question 52

Different teams in your organization store customer and performance data in BigOuery. Each team needs to keep full control of their collected data, be able to query data within their projects, and be able to exchange their data with other teams. You need to implement an organization-wide solution, while minimizing operational tasks and costs. What should you do?

Options:

Create a BigQuery scheduled query to replicate all customer data into team projects.

Enable each team to create materialized views of the data they need to access in their projects.

Ask each team to publish their data in Analytics Hub. Direct the other teams to subscribe to them.

Ask each team to create authorized views of their data. Grant the biquery. jobUser role to each team.

Question 53

You have an upstream process that writes data to Cloud Storage. This data is then read by an Apache Spark job that runs on Dataproc. These jobs are run in the us-central1 region, but the data could be stored anywhere in the United States. You need to have a recovery process in place in case of a catastrophic single region failure. You need an approach with a maximum of 15 minutes of data loss (RPO=15 mins). You want to ensure that there is minimal latency when reading the data. What should you do?

Options:

1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions.2. Enable turbo replication.3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the us-south1 region.4. In case of a regional failure, redeploy your Dataproc duster to the us-south1 region and continue reading from the same bucket.

1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions.2. Enable turbo replication.3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the same region.4. In case of a regional failure, redeploy the Dataproc clusters to the us-south1 region and read from the same bucket.

1. Create a Cloud Storage bucket in the US multi-region.2. Run the Dataproc cluster in a zone in the ua-central1 region, reading data from the US multi-region bucket.3. In case of a regional failure, redeploy the Dataproc cluster to the us-central2 region and continue reading from the same bucket.

1. Create two regional Cloud Storage buckets, one in the us-central1 region and one in the us-south1 region.2. Have the upstream process write data to the us-central1 bucket. Use the Storage Transfer Service to copy data hourly from the us-central1 bucket to the us-south1 bucket.3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in that region.4. In case of regional failure, redeploy your Dataproc clust

Question 54

You have one BigQuery dataset which includes customers' street addresses. You want to retrieve all occurrences of street addresses from the dataset. What should you do?

Options:

Create a deep inspection job on each table in your dataset with Cloud Data Loss Prevention and create an inspection template that includes the STREET_ADDRESS infoType.

Create a de-identification job in Cloud Data Loss Prevention and use the masking transformation.

Write a SQL query in BigQuery by using REGEXP_CONTAINS on all tables in your dataset to find rows where the word "street" appears.

Create a discovery scan configuration on your organization with Cloud Data Loss Prevention and create an inspection template thatincludes the STREET_ADDRESS infoType.

Question 55

You have a Standard Tier Memorystore for Redis instance deployed in a production environment. You need to simulate a Redis instance failover in the most accurate disaster recovery situation, and ensure that the failover has no impact on production data. What should you do?

Options:

Create a Standard Tier Memorystore for Redis instance in a development environment. Initiate a manual failover by using the force-data-loss data protection mode.

Initiate a manual failover by using the limited-data-loss data protection mode to the Memorystore for Redis instance in theproduction environment.

Increase one replica to Redis instance in production environment. Initiate a manual failover by using the force-data-loss dataprotection mode.

Create a Standard Tier Memorystore for Redis instance in the development environment. Initiate a manual failover by using the limited-data-loss data protection mode.

Question 56

You store and analyze your relational data in BigQuery on Google Cloud with all data that resides in US regions. You also have a variety of object stores across Microsoft Azure and Amazon Web Services (AWS), also in US regions. You want to query all your data in BigQuery daily with as little movement of data as possible. What should you do?

Options:

Load files from AWS and Azure to Cloud Storage with Cloud Shell gautil rsync arguments.

Create a Dataflow pipeline to ingest files from Azure and AWS to BigQuery.

Use the BigQuery Omni functionality and BigLake tables to query files in Azure and AWS.

Use BigQuery Data Transfer Service to load files from Azure and AWS into BigQuery.

Question 57

You are testing a Dataflow pipeline to ingest and transform text files. The files are compressed gzip, errors are written to a dead-letter queue, and you are using Sidelnputs to join data You noticed that the pipeline is taking longer to complete than expected, what should you do to expedite the Dataflow job?

Options:

Switch to compressed Avro files

Reduce the batch size

Retry records that throw an error

Use CoGroupByKey instead of the Sidelnput

Question 58

You are designing a real-time system for a ride hailing app that identifies areas with high demand for rides to effectively reroute available drivers to meet the demand. The system ingests data from multiple sources to Pub/Sub. processes the data, and stores the results for visualization and analysis in real-time dashboards. The data sources include driver location updates every 5 seconds and app-based booking events from riders. The data processing involves real-time aggregation of supply and demand data for the last 30 seconds, every 2 seconds, and storing the results in a low-latency system for visualization. What should you do?

Options:

Group the data by using a tumbling window in a Dataflow pipeline, and write the aggregated data to Memorystore

Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to Memorystore

Group the data by using a session window in a Dataflow pipeline, and write the aggregated data to BigQuery.

Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to BigQuery.

Question 59

You plan to deploy Cloud SQL using MySQL. You need to ensure high availability in the event of a zone failure. What should you do?

Options:

Create a Cloud SQL instance in one zone, and create a failover replica in another zone within the same region.

Create a Cloud SQL instance in one zone, and create a read replica in another zone within the same region.

Create a Cloud SQL instance in one zone, and configure an external read replica in a zone in a different region.

Create a Cloud SQL instance in a region, and configure automatic backup to a Cloud Storage bucket in the same region.

Question 60

You need to modernize your existing on-premises data strategy. Your organization currently uses.

• Apache Hadoop clusters for processing multiple large data sets, including on-premises Hadoop Distributed File System (HDFS) for data replication.

• Apache Airflow to orchestrate hundreds of ETL pipelines with thousands of job steps.

You need to set up a new architecture in Google Cloud that can handle your Hadoop workloads and requires minimal changes to your existing orchestration processes. What should you do?

Options:

Use Dataproc to migrate Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases Convert your ETL pipelines to Dataflow.

Use Bigtable for your large workloads, with connections to Cloud Storage to handle any HDFS use cases Orchestrate your pipelines with Cloud Composer.

Use Dataproc to migrate your Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases. Use Cloud Data Fusion to visually design and deploy your ETL pipelines.

Use Dataproc to migrate Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases.Orchestrate your pipelines with Cloud Composer..

Answer:

Explanation:

Dataproc is a fully managed service that allows you to run Apache Hadoop and Spark workloads on Google Cloud. It is compatible with the open source ecosystem, so you can migrate your existing Hadoop clusters to Dataproc with minimal changes. Cloud Storage is a scalable, durable, and cost-effective object storage service that can replace HDFS for storing and accessing data. Cloud Storage offers interoperability with Hadoop through connectors, so you can use it as a data source or sink for your Dataproc jobs. Cloud Composer is a fully managed service that allowsyou to create, schedule, and monitor workflows using Apache Airflow. It is integrated with Google Cloud services, such as Dataproc, BigQuery, Dataflow, and Pub/Sub, so you can orchestrate your ETL pipelines across different platforms. Cloud Composer is compatible with your existing Airflow code, so you can migrate your existing orchestration processes to Cloud Composer with minimal changes.

The other options are not as suitable as Dataproc and Cloud Composer for this use case, because they either require more changes to your existing code, or do not meet your requirements. Dataflow is a fully managed service that allows you to create and run scalable data processing pipelines using Apache Beam. However, Dataflow is not compatible with your existing Hadoop code, so you would need to rewrite your ETL pipelines using Beam. Bigtable is a fully managed NoSQL database service that can handle large and complex data sets. However, Bigtable is not compatible with your existing Hadoop code, so you would need to rewrite your queries and applications using Bigtable APIs. Cloud Data Fusion is a fully managed service that allows you to visually design and deploy data integration pipelines using a graphical interface. However, Cloud Data Fusion is not compatible with your existing Airflow code, so you would need to recreate your orchestration processes using Cloud Data Fusion UI. References:

Dataproc overview

Cloud Storage connector for Hadoop

Cloud Composer overview

Load More Professional-Data-Engineer Questions

Spring Sale Discount Flat 70% Offer - Ends in 0d 00h 00m 00s - Coupon code: 70diswrap

Dumpswrap Top Menu

breadcrumb

Google Professional-Data-Engineer Dumps

Professional-Data-Engineer Free PDF Questions

Google Professional Data Engineer Exam Questions and Answers

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options: