Pre-Summer Sale Discount Flat 70% Offer - Ends in 0d 00h 00m 00s - Coupon code: 70diswrap

NVIDIA NCP-AII Dumps

Page: 1 / 12
Total 123 questions

NVIDIA AI Infrastructure Questions and Answers

Question 1

If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE HOST CHANNEL ADAPTER to a QSFP port capable of both 100 GbE and 25 GbE, which of the following solutions would best meet this requirement?

Options:

A.

SFP Connectors

B.

SFP to 1G BASE-T (RJ45) adapter

C.

QSA Adapter

Question 2

During a DGX cluster deployment, what is the most effective way to verify the health and integrity of the local RAID storage array?

Options:

A.

Run a read/write benchmark utility, such as FIO, across the RAID array, looking for expected speed and latency metrics as proof of storage integrity.

B.

Verify that all configured RAID volumes are mounted and available in the operating system, and that disk utilization levels are within recommended limits.

C.

Use the mdadm --examine and mdadm --detail commands to review the RAID array’s status, checking for drive failures, array consistency, and error events.

Question 3

An administrator needs to add additional GPUs to an existing server. What are the server requirements to check before installing new GPUs?

Options:

A.

Sufficient networking, water-cooled racks, adequate rack power, sufficient storage, and rack space.

B.

Sufficient storage, sufficient networking, adequate rack power, and compatible hardware.

C.

Sufficient CPU capacity, PCIe slot allocation, sufficient cooling in the data center, and rack space.

D.

Sufficient cooling in the data center, adequate rack power, compatible hardware, and PCIe slot allocation.

Question 4

After ClusterKit reports " GPU-Host latency exceeds threshold, " which NVIDIA diagnostic tool should be used to isolate hardware faults?

Options:

A.

Re-run ClusterKit with --stress=gpu -Y 60 to extend test duration

B.

nvidia-smi topo -m to inspect GPU topology connections

C.

DCGM Diags dcgmi diag -r 2

D.

ib_write_bw to measure InfiniBand bandwidth between nodes

Question 5

An engineer needs to completely remove NVIDIA GPU drivers from an Ubuntu 22.04 system to troubleshoot conflicts. Which command sequence ensures all driver components are purged?

Options:

A.

sudo ubuntu-drivers uninstall

B.

sudo rm -rf /usr/lib/nvidia

C.

sudo apt-get remove nvidia-driver-550

D.

sudo apt-get purge nvidia-* & & sudo apt-get autoremove

Question 6

After NCCL burn-in reports " transport retry count exceeded, " which corrective action addresses the underlying fabric issue?

Options:

A.

Switch from Ring to Tree algorithms via NCCL_ALGO=TREE

B.

Reduce message size to decrease network utilization

C.

Increase NCCL_IB_TIMEOUT to tolerate longer latencies

D.

Inspect InfiniBand link quality metrics (BER, symbol errors) and replace faulty cables

Question 7

What is the primary purpose of performing a NeMo burn-in on a new AI infrastructure?

Options:

A.

To benchmark production training speed and ensure all GPUs are running at identical clock speeds.

B.

To stress test the hardware and software stack with representative NeMo workloads, ensuring reliability.

C.

To tune NeMo model hyperparameters for maximum accuracy on user datasets during cluster deployment.

Question 8

A healthcare organization is deploying an AI system to analyze patient data for predictive diagnostics. The system must comply with strict data protection regulations such as HIPAA, ensuring that sensitive information remains confidential and secure. Considering the need for robust security measures, which combination of strategies should the organization prioritize to protect against data breaches and ensure regulatory compliance?

Options:

A.

Deploy data masking to obscure sensitive data during processing and use role-based access control (RBAC) to limit data access based on user roles.

B.

Use tokenization to replace sensitive data with non-sensitive tokens and employ multi-factor authentication (MFA) for system access.

C.

Implement symmetric encryption for all data at rest and rely solely on password-based access controls.

D.

Rely on asymmetric encryption for all communications and use data deduplication to minimize storage costs without additional security measures.

Question 9

A user encounters " permission denied " errors when running GPU-accelerated containers on a Secure Boot-enabled system. What resolves this?

Options:

A.

Enroll the MOK and sign NVIDIA kernel modules.

B.

Reinstall Docker without the NVIDIA runtime.

C.

Disable SELinux to relax unnecessary security policies.

D.

Run Docker with sudo for elevated privileges.

Question 10

A customer has just completed the first boot of their DGX system and is prompted to create an administrative user. What is the correct approach for setting up this user to ensure secure BMC and GRUB access?

Options:

A.

Create a unique, strong, lower-case username and password that will be used for both BMC and GRUB access, avoiding default or weak credentials.

B.

Create separate usernames for BMC and GRUB to maximize flexibility.

C.

Skip the creation of a new user and retain the default admin account for BMC and GRUB access.

D.

Use “sysadmin” as the username and a simple password for ease of management.

Question 11

The system administrator plans to use Multi-Instance GPU profiles. What command should be used to verify that the GPU has this mode enabled?

Options:

A.

nvidia-mode

B.

nvidia-mig

C.

nvidia-enable

D.

nvidia-smi

Question 12

What is the purpose of using NCCL in verifying East-West fabric in an NVIDIA AI Factory?

Pick the 2 correct responses below.

Options:

A.

To measure the storage network performance.

B.

To measure the latency between GPUs.

C.

To measure the power consumption of GPUs.

D.

To measure bandwidth between GPUs.

Question 13

During server maintenance, a system administrator wants to ensure that the NVIDIA DGX server has sufficient disk space for operational activities. The administrator is scripting an alert system that will notify the team if disk space falls below a threshold. Which command could be included in the maintenance script to check the available disk space on the server?

Options:

A.

nvidia-smi --query-disk-space

B.

du -sh /home/*

C.

df -h | grep ' /var '

D.

lsof +L1

Question 14

To validate bisectional bandwidth across two racks in a Spectrum-X Ethernet fabric, which NCCL test configuration isolates East-West traffic?

Options:

A.

NCCL_TESTS_SPLIT= " OR 0x7 " ./all_reduce_perf -g 8

B.

Run without splits and analyze per-rack averages.

C.

NCCL_TESTS_SPLIT= " MOD 2 " ./all_reduce_perf -g 8

D.

NCCL_TESTS_SPLIT= " DIV 8 " ./all_reduce_perf -g 1

Question 15

A systems engineer is updating firmware across a large DGX cluster using automation. What is the best practice for minimizing risk and ensuring cluster health during and after the process?

Options:

A.

Drain nodes from the scheduler, run pre-update diagnostics, update firmware in batches, and verify health post-update before scaling to the next batch.

B.

To save time, simultaneously update all nodes in the cluster without draining or diagnostics.

C.

Update nodes that have reported faults, leaving others on older firmware.

D.

Drain nodes from the scheduler, update firmware in batches, skip diagnostics and verify health post-update before scaling to the next batch.

Question 16

An administrator needs to perform a comprehensive pre-production stress test on a DGX H100 system. Which command validates GPU, CPU, memory, and storage components while following NVIDIA’s recommended procedure?

Options:

A.

nvidia-smi -q | grep " GPU Stress Test "

B.

sudo nvsm stress-test --force

C.

stress --cpu $(nproc) --io $(nproc) --timeout 600

D.

./gpu_burn 60

Question 17

An administrator needs to manually deploy the BlueField image on a target DPU. The administrator downloads the new image file and needs to flash it to the hardware. Which command should the administrator use?

Options:

A.

/opt/mellanox/mlnx-fw-updater/mlnx_fw_updater.pl

B.

apt install doca-runtime

C.

dd if=/root/bf.image of=/dev/bf bs=4096k

D.

bfb-install --rshim

Question 18

A system engineer needs to set the vGPU scheduling behavior for all GPUs to share the scheduling equally with the default time slice length. What command should be used?

Options:

A.

esxcli system module parameters set -m nvidia -p " NVreg_RegistryDwords=RmPVMRL=0x01 "

B.

esxcli graphics module parameters set -m nvidia -p " NVreg_RegistryDwords=RmPVMRL=0x01 "

C.

esxcli system module parameters set -m nvidia -p " NVreg_RegistryDwords=FRL=0x01 "

D.

esxcli system module parameters set -m nvidia -p " NVreg_RegistryDwords=RmPVMRL=0x00 "

Question 19

During HPL execution on a DGX cluster, the benchmark fails with “not enough memory” errors despite sufficient physical RAM. Which HPL.dat parameter adjustment is most effective?

Options:

A.

Disable double-buffering via BCAST parameter.

B.

Increase block size to 6144 to maximize GPU utilization.

C.

Reduce the problem size while maintaining the same block size.

D.

Set PMAP to 1 to enable process mapping.

Question 20

An engineer wants to verify that an NVIDIA GPU is accessible inside a Docker container for running deep learning workloads. The NVIDIA Container Toolkit is installed on a machine with working NVIDIA drivers. Which command demonstrates the correct way to run a container that can access all available GPUs?

Options:

A.

docker run --rm --runtime=docker nvidia/cuda nvidia-smi

B.

docker run --rm -it ubuntu:22.04 nvidia-smi

C.

docker run --rm --gpus all nvidia/cuda:12.4.6-base-ubuntu22.04 nvidia-smi

D.

docker run --rm nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

Question 21

An engineer must ensure that a BlueField-3 NIC firmware download matches the cluster’s PSID. Which step is critical before installation?

Options:

A.

Check that the DPU’s BMC IP is reachable by ping.

B.

Confirm that the firmware file size matches the DPU’s flash capacity.

C.

Use mstflint -d < PCI_ID > query to validate the device PSID before selecting the firmware image.

D.

Verify that the SHA256 hash of the firmware matches NVIDIA’s public ledger.

Question 22

An engineer is reimaging a DGX system in a large cluster. Which method ensures the most efficient and secure remote installation without physical access?

Options:

A.

Use apt-get to upgrade the operating system without rebooting the system.

B.

Create a USB drive with the ISO and manually boot from it on the DGX system.

C.

Build a software image on Base Command Manager and then reimage the system.

D.

Skip ISO verification and directly flash the operating system to the disk via SSH.

Question 23

An infrastructure engineer in an AI factory has successfully replaced a power supply unit on an NVIDIA DGX H100. After installation, both the IN and OUT LEDs on the new power supply illuminate solid green. Which NVSM CLI command should the engineer use to quickly verify the overall system status and ensure it is operating as expected?

Options:

A.

nvsm show power

B.

nvsm show powermode

C.

nvsm show health

D.

nvsm show alerts

Question 24

After installing NGC CLI on RHEL, a user runs ngc registry image list but sees no results. The API key and organization are correctly configured. What resolves this?

Options:

A.

Disable SELinux to eliminate unnecessary security restrictions.

B.

Run ngc config set --team team-name to specify a team.

C.

Reinstall the CLI using the yum command instead of manual installation.

D.

Ensure the user ' s NGC account has REGISTRY_READ permissions for the organization.

Question 25

You are leading a project to enhance the energy efficiency of a data center that heavily relies on AI workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers. Which strategy should you prioritize to develop more accurate energy-efficiency metrics?

Options:

A.

Focus on integrating kilowatt-hours into existing metrics to better reflect the actual energy used for productive work.

B.

Use Power Usage Effectiveness as the primary metric while supplementing it with additional measures of useful work done per unit of energy.

C.

Develop benchmarks tailored to specific workloads, such as MLPerf for AI applications, to better understand energy use in real-world scenarios.

D.

Use watts-used as the primary measure of efficiency, as it accurately reflects the power input at any given time.

Question 26

Which function is used to collect the cluster counters information?

Options:

A.

SM

B.

PM

C.

GM

D.

FM

Question 27

An engineer needs to validate 400G DAC cable signal integrity in a DGX cluster. Which CVT metric best identifies marginal cables needing replacement?

Options:

A.

Lane power variance < 3dB across all transceivers.

B.

Transceiver model matching QSFP-DD specifications.

C.

Temperature fluctuations > 5°C during validation.

D.

Effective BER > 1.5E-254 during a < 6-hour monitoring window.

Question 28

A cluster administrator is preparing to update the firmware on a DGX H100 system, including the GPU tray (baseboard). What is the correct sequence of steps to perform a safe and successful firmware upgrade?

Options:

A.

Update the BMC and skip the GPU tray and motherboard tray updates if the system appears healthy.

B.

Perform a cold reset, stop all GPU activity, update and reboot the BMC, update motherboard and tray components, and verify completion.

C.

Update the GPU tray first, then the motherboard tray, and reboot the BMC after all updates are complete.

D.

Stop all GPU activity, update and reboot the BMC, update motherboard and tray components, perform a cold reset, and verify completion.

Question 29

An administrator installs NVIDIA GPU drivers on a DGX H100 system with UEFI Secure Boot enabled. After reboot, the drivers fail to load. What is the first action to resolve this issue?

Options:

A.

Disable Secure Boot permanently in BIOS/UEFI settings.

B.

Delete /etc/X11/xorg.conf to force driver reconfiguration.

C.

Enroll the Machine Owner Key (MOK) during system reboot and enter the recorded password.

D.

Reinstall drivers using apt-get install nvidia-driver-550 without rebooting.

Question 30

For a 48-hour NCCL burn-in test, which parameters ensure sustained fabric stress while detecting silent data corruption?

Options:

A.

broadcast_perf -b 4G -e 16G -w 160

B.

all_reduce_perf -b 8G -e 32G -c 1000 -z 1 -G 1000

C.

all_reduce_perf -b 8G -e 32G -z 1 -G 1000

D.

reduce_scatter_perf -f 2 -g 8

Question 31

An engineer needs to validate NVLink Switch functionality on a DGX H100 system with 8 GPUs. Which NCCL command verifies intra-node NVLink bandwidth?

Options:

A.

broadcast_perf -b 8 -e 16G -f 2 -g 8 without split configuration

B.

all_reduce_perf -b 8 -e 16G -f 2 -g 4 with NCCL_TESTS_SPLIT= " MOD 2 "

C.

all_reduce_perf -b 8 -e 16G -f 2 -g 1 repeated 8 times

D.

all_reduce_perf -b 8 -e 16G -f 2 -g 8 with NCCL_TESTS_SPLIT= " OR 0x7 "

Question 32

When configuring an out-of-core HPL burn-in for a 40B matrix on 8x H100 nodes, which environment variable prevents GPU out-of-memory errors while reserving space for drivers?

Options:

A.

export HPL_OOC_SAFE_SIZE=4.0

B.

export HPL_OOC_MODE=0

C.

export HPL_OOC_NUM_STREAMS=8

D.

export HPL_OOC_MAX_GPU_MEM=90

Question 33

ClusterKit’s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?

Options:

A.

Critical failure; expected is greater than 390 GB/s for HDR InfiniBand.

B.

Suboptimal performance; requires FEC tuning to reach 380+ GB/s.

C.

Optimal performance, indicating healthy fabric and GPUDirect RDMA.

D.

Inconclusive; rerun with --stress=cpu to validate.

Question 34

After Spectrum-X fabric deployment, NCCL tests show intermittent latency spikes. Which network condition most severely impacts East-West bandwidth?

Options:

A.

Multiple transceiver firmware mismatches.

B.

400G port utilization at 70% on several nodes during tests.

C.

Jitter below 5 ps with consistent latency.

D.

Packet loss greater than 0.001% causing NCCL pipeline stalls.

Question 35

You are expanding a DGX-based deep learning cluster to train on large, high-resolution images that cannot fit into local cache. Multiple nodes will access this data concurrently and require high performance. Which storage and networking solution best meets these requirements?

Options:

A.

Increase the SSD RAID-0 local cache size in each node so it can absorb most training data, making network storage type and speed less important for performance.

B.

Implement a standard NFS server on a 10GbE network because the cluster can access the export and job performance will not be impacted.

C.

Deploy a high-performance parallel file system across InfiniBand or 40/100GbE, ensuring at least 3 GB/s per node and scalable aggregate bandwidth for all cluster workloads.

D.

Recommend general-purpose object storage for all training data because it is optimized for deep learning workloads and distributed data access at any scale.

Question 36

As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware?

Options:

A.

Perform a software-driven restart on the operating system of every compute node, then use advanced tools to check firmware status, and reissue update commands if any firmware appears inactive afterward.

B.

Execute a single AC power cycle on the DGX after the update process, then reset the software stack and verify status using diagnostic commands on each node for confirmation of all component updates.

C.

Initiate a cold power cycle on all node trays to activate firmware, follow with a DGX reboot procedure, and use the management interface to finish activating CPLD firmware on the host.

D.

Initiate a cold power cycle on the system to activate firmware for components, reset the BMC using the recommended command, and perform an AC power cycle to ensure EROT and CPLD firmware is activated.

Page: 1 / 12
Total 123 questions