That First CUDA Blog I Needed

In this blog, we’re going to dive into one of the most critical concepts in CUDA programming: shared memory. Shared memory is like the secret ingredient that can supercharge your GPU code. While CUDA’s global memory serves as the main storage, it’s often slow to access repeatedly. That’s where shared memory comes in. It acts as a customizable, fast-access scratchpad where you can store data that is frequently reused by threads within the same block, helping you avoid costly memory transfers. We’ll explore how this works, why it matters, and how you can use it to make your CUDA programs much faster.

Let us begin the CUDA journey, the right way.

1. Hello World

A good way to learn something new, is to begin from something you already know and then connect the dots. Let us first look at a simple Hello World program in C++.

#include <iostream>
int main(){
    std::cout << "Hello World!" << std::endl;
}

To execute this program on your machine, follow the following steps :

Open a terminal and navigate to the directory containing this .cpp file.
```
 cd that_first_cuda_blog/1_hello_world
```
Compile the program using the following command
```
 g++ hello_world.cpp -o hello_world
```
This will create an executable hello_world in this directory. Execute it using
```
 ./hello_world
```

This should print Hello World! in the terminal, as expected. Here, the g++ compiler compiles the source code hello_world.cpp and translates it to machine code, in form of an executable file. The CPU then executes this machine code, to print Hello World! onto the temrinal.

If one intends to execute the same on an NVIDIA GPU, CUDA can be used.
CUDA is a programming framework, that allows programmers to talk to NVIDIA GPUs via the CPU.

TODO : TALK HERE ABOUT HOW CUDA HAS PARALLILAZATION USING THREADS

Let us look at simple Hello World example in CUDA.

#include <iostream>
#include <cuda_runtime.h>

__global__ void gpu_hello_world(){
  printf("Hello World from GPU! \n");
}

int main(){
  std::cout << "Hello World from CPU!" << std::endl;
  gpu_hello_world<<<1,1>>>();
  cudaDeviceSynchronize();
}

To execute this program on your machine, follow the following steps :

Open a terminal and navigate to the 1_hello_world directory
```
 cd that_first_cuda_blog/1_hello_world
```

Compile the program using the following command

 nvcc hello_world_gpu.cu -o hello_world_gpu

This will create an executable hello_world_gpu in this directory. Execute it using
```
 ./hello_world_gpu
```

The output should be the following

 Hello World from CPU!
 Hello World from GPU! 

Now that we have our first CUDA program running, let us dissect this CUDA program, and understand how it works from first principles.

__global__ void gpu_hello_world(){
  printf("Hello World from GPU! \n");
}

The code snippet above is a function that is intended to run on the GPU.
In the GPU jargon, such a function is called kernel.

Kernel, specifically, is a special function, that can be invoked from the CPU, but runs only on the GPU. CPU, is generally referred to as host and GPU is referred to as device, since the CPU hosts the GPU in some sense. The __global__ keyword is used to specify that this function is a kernel, in that, it can be called from the host but executed on the device.

gpu_hello_world<<<1,1>>>(); is a CUDA-specific syntax. We will discuss what <<<1,1>>> means later in this blog. For now it is sufficient to understand that <<<1,1>>>, allocates 1 thread for executing this kernel.

Let us first understand how this kernel launch works from first principles.

The host (CPU) executes instructions ( compiled lines of code ), one at a time, sequentially.
When it reaches the kernel launch instruction (gpu_hello_world<<<1,1>>>();), the host launches the kernel. Under the hood, the CUDA Runtime Library on the host, places the launch command onto a CUDA Stream which a queue mantained on the host. This queue is designed to hold kernel launches, memory transfer requests and other CUDA tasks, to ensure they execute sequentially for the same CUDA Stream. We will dissect CUDA Streams later in this blog.
The CUDA Runtime, now hands over the launch commands to the NVIDIA Driver on the host, which is responsible for talking to the device (GPU).
The NVIDIA Driver pushes this launch command to the command buffer which is managed by the GPU hardware. This buffer resides on the device and holds the commands to be executed, once sufficient GPU resources are available.
The GPU, once resources are available, pulls commands from the command buffer and starts executing them.
The host does not wait for the kernel execution to finish, and moves on with the next instruction. This execution approach is known as asynchronous. In particular, the host and device executes independently and simulatenously. When a command is like kernel launch is issued by the host, it does not wait for the command to complete on the device, but simply moves on to the next instruction, while the device handles the requested operation in parallel.

To unerstand this better, we can change the earlier hello_world.cu source code, by commenting out cudaDeviceSynchronize();.

#include <iostream>
#include <cuda_runtime.h>

__global__ void gpu_hello_world(){
  printf("Hello World from GPU! \n");
}

int main(){
  std::cout << "Hello World from CPU!" << std::endl;
  gpu_hello_world<<<1,1>>>();
  // comment out this line. 
  // Now the host does not wait for the device and moves on.
  // cudaDeviceSynchronize();
}

The output of this program will be just as follows :

Hello World from CPU!

Note, that since we removed cudaDeviceSynchronize();, the host launches the gpu_hello_world kernel and moves on to the next instruction. The exection of the host code finishes, even before the device completes, hence it does not print Hello World from GPU! onto the output buffer.

Let us now extend our single thread CUDA Hello World, to run it with 8 threads. We would like the GPU to repeat this same “Hello World from GPU” operation 8 times. Just one small change in our original code will make this happen.

#include <iostream>
#include <cuda_runtime.h>

__global__ void gpu_hello_world(){
  printf("Hello World from GPU! \n");
}

int main(){
  std::cout << "Hello World from CPU!" << std::endl;
  // HERE , we replace <<<1,1>>> with <<<1,8>>>.
  gpu_hello_world<<<1,8>>>();
  cudaDeviceSynchronize();
}

The output of this code, will be one Hello World from CPU! and 8 Hello World from GPU!s. The main change as explained in the comment above the kernel code is replace <<<1,1>>> with <<<1,8>>>, which essentially means launching the same kernel with 8 threads. The GPU runs 8 “print Hello World” operations in parallel.

We will understand what <<<1,8>>> exactly means in absolute detail, but at this point, it is sufficient to understand that <<<1,1>>> launches one thread and <<<1,8>>> launches 8 threads in parallel.

In summary, in this Hello World section, we first looked at how to print Hello World using the CPU, followed by the same using the GPU. The major takeaway from this section is to understand what are kernels in general, and how exactly is a kernel launched from the host, to run the same operations in parallel on the device.

2. Print Square of Numbers

The basic foundation is now laid and we will now lay some more foundation on top. Let us print square of list of integers.

#include <iostream>
#include <cuda_runtime.h>

#define N 5
__global__ void print_square(){
  unsigned id = threadIdx.x;
  printf("%d\n", id * id);
}

int main(){
  for(int i = 0; i<N;i++){
    std::cout << i * i << std::endl;
  }
  print_square<<<1,N>>>();
  cudaDeviceSynchronize();
}

1. Hello World

2. Print Square of Numbers

Enjoy Reading This Article?