NVIDIA AI Operations Questions and Answers
An organization has multiple containers and wants to view STDIN, STDOUT, and STDERR I/O streams of a specific container.
What command should be used?
You are an administrator managing a large-scale Kubernetes-based GPU cluster using Run:AI.
To automate repetitive administrative tasks and efficiently manage resources across multiple nodes, which of the following is essential when using the Run:AI Administrator CLI for environments where automation or scripting is required?
An administrator is troubleshooting a bottleneck in a deep learning run time and needs consistent data feed rates to GPUs.
Which storage metric should be used?
You are monitoring the resource utilization of a DGX SuperPOD cluster using NVIDIA Base Command Manager (BCM). The system is experiencing slow performance, and you need to identify the cause.
What is the most effective way to monitor GPU usage across nodes?
A system administrator needs to scale a Kubernetes Job to 4 replicas.
What command should be used?
What should an administrator check if GPU-to-GPU communication is slow in a distributed system using Magnum IO?
You are configuring cloudbursting for your on-premises cluster using BCM, and you plan to extend the cluster into both AWS and Azure.
What is a key requirement for enabling cloudbursting across multiple cloud providers?
You are configuring networking for a new AI cluster in your data center. The cluster will handle large-scale distributed training jobs that require fast communication between servers.
What type of networking architecture can maximize performance for these AI workloads?
A data scientist is training a deep learning model and notices slower than expected training times. The data scientist alerts a system administrator to inspect the issue. The system administrator suspects the disk IO is the issue.
What command should be used?
A system administrator needs to collect the information below:
GPU behavior monitoring
GPU configuration management
GPU policy oversight
GPU health and diagnostics
GPU accounting and process statistics
NVSwitch configuration and monitoring
What single tool should be used?
A Fleet Command system administrator wants to create an organization user that will have the following rights:
For locations - read only
For Applications - read/write/admin
For Deployments - read/write/admin
For Dashboards - read only
What role should the system administrator assign to this user?
A Slurm user is experiencing a frequent issue where a Slurm job is getting stuck in the “PENDING” state and unable to progress to the “RUNNING” state.
Which Slurm command can help the user identify the reason for the job’s pending status?
You are deploying AI applications at the edge and want to ensure they continue running even if one of the servers at an edge location fails.
How can you configure NVIDIA Fleet Command to achieve this?
A GPU administrator needs to virtualize AI/ML training in an HGX environment.
How can the NVIDIA Fabric Manager be used to meet this demand?
An administrator wants to check if the BlueMan service can access the DPU.
How can this be done?
A cloud engineer is looking to deploy a digital fingerprinting pipeline using NVIDIA Morpheus and the NVIDIA AI Enterprise Virtual Machine Image (VMI).
Where would the cloud engineer find the VMI?
Which of the following correctly identifies the key components of a Kubernetes cluster and their roles?
You are using BCM for configuring an active-passive high availability (HA) cluster for a firewall system. To ensure seamless failover, what is one best practice related to session synchronization between the active and passive nodes?
You are tasked with deploying a deep learning framework container from NVIDIA NGC on a stand-alone GPU-enabled server.
What must you complete before pulling the container? (Choose two.)