Spring Sale Discount Flat 70% Offer - Ends in 0d 00h 00m 00s - Coupon code: 70diswrap

NVIDIA NCP-AII Dumps

Page: 1 / 7
Total 71 questions

NVIDIA AI Infrastructure Questions and Answers

Question 1

A system administrator needs to install a GPU/DPU in a server. The server has a free PCI-e slot, there are enough free PCI-e lanes, and there is enough room for the card. Which procedure should be followed?

Options:

A.

Ensure the server has enough power. Verify compatibility of cables with server's platform. Make sure the server is down to remove cables safely. Do not wear an ESD bracelet.

B.

Ensure the server has enough power. Make sure the server is down to remove cables safely. Wear an ESD bracelet.

C.

Ensure the server has enough power. Make sure the server is up and running with attached cables. Wear an ESD bracelet.

D.

Ensure the server has enough power. Verify compatibility of cables with server's platform. Make sure the server is down to remove cables safely. Wear an ESD bracelet.

Question 2

You are evaluating the integration of NVIDIA BlueField DPUs into your data center's storage architecture to optimize AI workloads. The storage solution chosen has incorporated BlueField DPUs to enhance performance and efficiency. Which of the following benefits directly results from this integration?

Options:

A.

Unlimited scalability by adding more DPUs without architectural changes.

B.

Elimination of latency issues in data processing tasks.

C.

Reduced CPU load by offloading data processing tasks to DPUs.

D.

Enhanced I/O performance with NVMe storage access speeds.

Question 3

After configuring HA, the administrator runs cmsh status and notices the secondary head node reports mysql [FAIL]. What is the most likely cause?

Options:

A.

The BCM license expired after HA configuration.

B.

Network connectivity issues between the primary and secondary head nodes.

C.

The secondary head node lacks NVIDIA GPU drivers.

D.

The cluster nodes are powered on during the HA configuration.

Question 4

You are installing the operating system as part of the initial setup for a new NVIDIA Base Command Manager (BCM) cluster. Which two of the following actions are essential for a successful OS installation on the cluster's head node? (Pick the 2 correct responses below)

Options:

A.

Configure network switches for PXE boot to all compute nodes before installing the OS on the head node.

B.

Download the latest BCM ISO and verify its integrity using the provided checksum, then start the installation.

C.

Start the head node OS installation process with the system BIOS set to legacy boot mode instead of UEFI.

D.

Set the desired time zone and configure NTP synchronization during the OS installation wizard.

Question 5

What command sequence is used to identify the exact name of the server that runs as the master SM in a multi-node fabric?

Options:

A.

sminfo, then smpquery ND

B.

ibstat, then sminfo

C.

ibnetdiscover, then ibsim

D.

sminfo, then smpquery NI

Question 6

After upgrading to HPL-AI 2.0 on a DGX A100 cluster, a 2x performance gain is observed. Which optimization is primarily responsible for this improvement?

Options:

A.

Reduction of problem size (N) to accelerate computation.

B.

MPI-aware GPU communication that reduces CPU bottlenecks and GPU idle time.

C.

Doubling of GPU clock speeds through firmware updates and relevant configuration.

D.

Automatic NVLink bandwidth doubling via driver updates.

Question 7

During HPL execution on a DGX cluster, the benchmark fails with "not enough memory" errors despite sufficient physical RAM. Which HPL.dat parameter adjustment is most effective?

Options:

A.

Reduce the problem size while maintaining the same block size.

B.

Set PMAP to 1 to enable process mapping.

C.

Increase block size to 6144 to maximize GPU utilization.

D.

Disable double-buffering via BCAST parameter.

Question 8

A user encounters "permission denied" errors when running GPU-accelerated containers on a Secure Boot-enabled system. What resolves this?

Options:

A.

Enroll the MOK and sign NVIDIA kernel modules.

B.

Reinstall Docker without the NVIDIA runtime.

C.

Disable SELinux to relax unnecessary security policies.

D.

Run Docker with sudo for elevated privileges.

Question 9

An infrastructure engineer in an AI factory has successfully replaced a power supply unit on an NVIDIA DGX H100. After installation, both the IN and OUT LEDs on the new power supply illuminate solid green. Which NVSM CLI command should the engineer use to quickly verify the overall system status and ensure it is operating as expected?

Options:

A.

nvsm show power

B.

nvsm show powermode

C.

nvsm show health

D.

nvsm show alerts

Question 10

Refer to the output:

~ $ sudo nvsm show healthinfo

—Timestamp: Sat Dec 16 16:26:32 2017 -0800

Version: 17.12-5

Checks—BIOS Revision [5.11].........................

DGX Serial Number [YSY72800016)..................

Verify installed DIMM memory sticks........................Healthy

...[output truncated)

Verify Ethernet controllers...........................Healthy

Verify installed GPU's..............................Unhealthy

Checking output of 'lspci' for expected GPU's

Missing GPU at PCI address '07:00.0'

Verify installed InfiniBand controllers....................Healthy

Verify PCIe switches..................................Healthy

...[output truncated)

What insights can a system administrator gain regarding the DGX system's health?

Options:

A.

A GPU tray upgrade failed.

B.

A GPU is missing on the DGX system.

C.

A GPU driver upgrade has failed.

D.

The system has passed the hardware health check successfully.

Question 11

An administrator installs NVIDIA GPU drivers on a DGX H100 system with UEFI Secure Boot enabled. After reboot, the drivers fail to load. What is the first action to resolve this issue?

Options:

A.

Disable Secure Boot permanently in BIOS/UEFI settings.

B.

Delete /etc/X11/xorg.conf to force driver reconfiguration.

C.

Enroll the Machine Owner Key (MOK) during system reboot and enter the recorded password.

D.

Reinstall drivers using apt-get install nvidia-driver-550 without rebooting.

Question 12

An engineer needs to verify the current firmware versions of all components (ATF, BSP, NIC, UEFI) on a BlueField-3 DPU's BMC. Which Redfish API command provides this information?

Options:

A.

mlxconfig -d q

B.

curl -k -u root: -X GET https:// /redfish/v1/UpdateService/FirmwareList

C.

mstflint -d query full

D.

curl -k -u root: -X GET https:// /redfish/v1/UpdateService/FirmwareInventory

Question 13

During East-West fabric validation on a 64-GPU cluster, an engineer runs all_reduce_perf and observes an algorithm bandwidth of 350 GB/s and bus bandwidth of 656 GB/s. What does this indicate about the fabric performance?

Options:

A.

Inconclusive; rerun with point-to-point tests.

B.

Optimal performance; bus bandwidth near theoretical peak for NDR InfiniBand.

C.

Critical failure; bus bandwidth exceeds hardware capabilities.

D.

Suboptimal performance; algorithm bandwidth should match bus bandwidth.

Question 14

As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware?

Options:

A.

Perform a software-driven restart on the operating system of every compute node, then use advanced tools to check firmware status, and reissue update commands if any firmware appears inactive afterward.

B.

Execute a single AC power cycle on the DGX after the update process, then reset the software stack and verify status using diagnostic commands on each node for confirmation of all component updates.

C.

Initiate a cold power cycle on all node trays to activate firmware, follow with a DGX reboot procedure, and use the management interface to finish activating CPLD firmware on the host.

D.

Initiate a cold power cycle on the system to activate firmware for components, reset the BMC using the recommended command, and perform an AC power cycle to ensure EROT and CPLD firmware is activated.

Question 15

During BCM cluster setup, an engineer must configure bonded network interfaces on DGX nodes for high availability. Which cmsh command sequence properly configures a bond0 interface with two physical NICs?

Options:

A.

device use dgx001 ; interfaces add vlan vlan100 ; set parent bond0 ; set mode 1 ; set network internalnet

B.

device use dgx001 ; interfaces add bond bond0 ; append interfaces enp225s0f1np1 enp97s0f1np1 ; set mode 1 ; set network internalnet

C.

device use dgx001 ; interfaces set enp225s0f1np1 network internalnet ; interfaces set enp97s0f1np1 network internalnet

D.

device use dgx001 ; interfaces delete enp225s0f1np1 ; interfaces delete enp97s0f1np1

Question 16

A media company is developing an AI platform for video content analysis that requires storing and processing large volumes of unstructured video data. The platform must support high throughput for data ingestion and provide efficient access for real-time analytics. Given these requirements, which storage strategy should the company implement?

Options:

A.

Tape storage for its cost-effectiveness and archival capabilities

B.

Block storage for low latency and high performance

C.

File storage for hierarchical organization and easy navigation

D.

Object storage for scalability and metadata management

Question 17

A team is installing the NVIDIA Run:ai control plane on a Kubernetes cluster. Which two (2) options are most critical to validate before proceeding? (Pick the 2 correct responses below)

Options:

A.

Helm is installed on the installer machine.

B.

Ensure Kubernetes is running on the cluster.

C.

All cluster nodes have NVIDIA GPUs installed.

D.

NTP is disabled to simplify time synchronization.

Question 18

After initial setup and health checks, the DGX H100 system administrator wants to verify that containers can access GPUs before running production workloads. Which method is recommended for this validation?

Options:

A.

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 systemctl

B.

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 ls -la

C.

sudo docker run --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

D.

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

Question 19

A cluster administrator needs to validate transceiver firmware versions across 200 ports using UFM. Which GUI-based method provides a consolidated view?

Options:

A.

Navigate to ’Devices" > select a switch > "Cables' tab to see ASIC firmware and transceiver versions.

B.

Use "Topology’ view to visually inspect cable icons.

C.

Run mlxlink -d lid- -m on each port manually.

D.

Export all switch logs and grep for ’FW Version".

Question 20

A 24-hour HPL burn-in fails with "illegal value" errors during the first iteration. Which initial troubleshooting step resolves this without compromising burn-in validity?

Options:

A.

Switch from FP64 to FP32 precision.

B.

Disable GPU affinity.

C.

Reduce test duration to 12 hours.

D.

Verify the matrix size is divisible by block size.

Question 21

A leaf switch shows "FW Version Mismatch" alerts for transceivers after cluster expansion. Which tool validates transceiver firmware against expected versions?

Options:

A.

flint

B.

iblinkinfo

C.

mlxconfig

D.

ethtool

Page: 1 / 7
Total 71 questions