3 posts tagged with "c"

Get Started With CUDA Memory Management

May 15, 2024 · 5 min read

Ordinary Magician | Half stack developer

Data transfer between host and device

In CUDA programming, memory management functions are essential for optimizing data transfer between the host (CPU) and the device (GPU).

Copy from/to Pageable Memory

copy from pageable memory

In this case you move data manually from host/device side to the other side.

You first malloc memory on host and copy it to device via cudaMemcpy. When the computation on device is finished, you copy the result back via cudaMemcpy again.

Get Started With CUDA Execution Model

February 22, 2024 · 17 min read

VisualDust

Ordinary Magician | Half stack developer

The starting point of all optimizations is to better "squeeze" hardware performance through programming.

The GPU architecture is built around a scalable array of Streaming Multiprocessors (SM). GPU hardware parallelism is achieved through the replication of this architectural building block.

Each SM in a GPU is designed to support concurrent execution of hundreds of threads, and there are generally multiple SMs per GPU, so it is possible to have thousands of threads executing concurrently on a single GPU. When a kernel grid is launched, the thread blocks of that kernel grid are distributed among available SMs for execution. Once scheduled on an SM, the threads of a thread block execute concurrently only on that assigned SM. Multiple thread blocks may be assigned to the same SM at once and are scheduled based on the availability of SM resources. Instructions within a single thread are pipelined to leverage instruction-level parallelism, in addition to the thread-level parallelism you are already familiar with in CUDA.

Key components of a Fermi SM are:

CUDA Cores
Shared Memory/L1 Cache
Register File
Load/Store Units
Special Function Units
Warp Scheduler

Figure: Key components of a Fermi(a GPU architecture) SM

Get Started With CUDA Programming Model

February 18, 2024 · 55 min read

VisualDust

Ordinary Magician | Half stack developer

Sonder

HPC Engineer

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by NVIDIA. It enables developers to utilize the immense computational power of NVIDIA GPUs (Graphics Processing Units) for general-purpose processing tasks beyond graphics rendering.

Key features of CUDA programming include:

Parallelism: CUDA enables developers to exploit parallelism at multiple levels, including thread-level, instruction-level, and data-level parallelism, allowing for efficient computation on GPUs.
CUDA C/C++ Language Extensions: CUDA extends the C/C++ programming languages with additional keywords and constructs to facilitate programming for GPU architectures, making it easier to write parallel code.
CUDA Runtime API: The CUDA Runtime API provides a set of functions for managing GPU devices, memory allocation, data transfer between CPU and GPU, and launching kernel functions (the functions executed on the GPU).
CUDA Libraries: NVIDIA provides a collection of libraries optimized for GPU computing tasks, such as cuBLAS for linear algebra, cuFFT for Fast Fourier Transforms, cuDNN for deep neural networks, and more.
CUDA Toolkit: The CUDA Toolkit includes compilers, debuggers, profilers, and other development tools necessary for CUDA programming.

CUDA programming allows developers to harness the massive parallel processing power of GPUs to accelerate a wide range of computational tasks, including scientific simulations, image and video processing, machine learning, and more.

Data transfer between host and device​

Copy from/to Pageable Memory​

Data transfer between host and device

Copy from/to Pageable Memory