write a CUDA program to perform a convolution.
We will first write a specialized convolution kernel to perform the convolution. Then we will use the kernel to compute a weighted running average.
The Convolution
In the textbook, we assume that the convolution mask is symmetric – that is, there are an odd number of elements in the mask and the center of the mask is align to the element we want to compute the convolution.
Here we want to program a right-aligned convolution mask. In a right-aligned convolution mask, the rightmost element is aligned to the element we want to compute the convolution. Since the mask is not symmetric, it does not matter if the number of elements in the mask is even or odd.
Further, in the textbook, we assume the halo cells are having a default value of zero. Here, we would like to try a different strategy, and assume that these halo cells will have a value that is identical to the value at the boundary.
The Running Average
Once we have implemented the convolution kernel, we can use it to compute the running average. For example, an unweighted 4-day running average can be computed using a convolution mask of {0.25, 0.25, 0.25, 0.25}. Here, we want to compute a weighted 4-day running average. The mask we will use will be {0.1, 0.2, 0.3, 0.4}. This allows recent data to have more influence on the running average.
For example, if the input array has the following data:
767.725, 952.943, 488.705, 291.886, 309.881, 475.827, ...
Then the running average using the above-mentioned mask should be:
767.725, 841.812, 711.682, 530.727, 404.554, 390.543, ...
The program
Your program must be a CUDA program. When executed, it should launch a grid of 256 blocks, with 256 threads in each block. However, you cannot assume this knowledge in your kernel. That is, you must calculate the total number of threads by using gridDim and blockDim.
You MUST use constant memory to store the convolution mask. The declaration and allocation of the constant memory has been provided (and you are not allowed to change it). Use of shared memory to cut down global memory access bandwidth consumption is encouraged but not required.
Both the input and output files have the same format. The first line contains a single integer n indicating how many data values are in the file. This is followed by n lines each contains a single floating-point value (in “%.3f” format). You do not have to worry about file handling as the main function will be handling all these tasks. You can further assume that the input file format is correct and no need to perform any error checking.

What I have tried:

<pre>#include <cuda.h>
#include <stdio.h>

#define MAX_MASK_SIZE 10

__constant__ float MASK [MAX_MASK_SIZE];

__global__ void average_kernel (float *output, float *input, int input_size, int mask_size)
{
    /******************/
    /* Your code here */
    /******************/

    /* 1. calculate thread id, and use it as index to output */
    /* 2. calculate number of threads                        */
    /* 3. while index < input_size                           */
    /* 4.     initialize a running total                     */
    /* 5.     calculate start index                          */
    /* 6.     perform convolution with a for loop            */
    /* 7.     write running total to output                  */
    /* 8.     increment index appropriately                  */

}

void process_data (float *output, float *input, float *mask, int input_size, int mask_size)
{
    /******************/
    /* Your code here */
    /******************/

    /* 1. declare device memory              */
    /* 2. allocate device memory             */
    /* 3. copy input data into device memory */
    /* 4. copy mask into constant memory     */
    /* 5. invoke kernel                      */
    /* 6. copy output from device memory     */
    /* 7. deallocate device memory           */
}

int main (int argc, char **argv)
{
    FILE *infile;
    FILE *outfile;
    
    float *input;
    float *output;
    float mask [] = {0.1, 0.2, 0.3, 0.4};
    
    int i;
    int n;
    
    if (argc < 3)
    {
        fprintf (stderr, "Usage: %s <infile> <outfile>\n", argv [0]);
        exit (1);
    }
    
    infile = fopen (argv [1], "r");
    if (infile == NULL)
    {
        fprintf (stderr, "Error: cannot open input file [%s].\n", argv [1]);
        exit (1);
    }
    
    fscanf (infile, "%d", &n);
    input = (float *) malloc (n * sizeof (float));
    
    for (i = 0; i < n; i++)
    {
        fscanf (infile, "%f", &(input [i]));
    }
    
    fclose (infile);
    
    output = (float *) malloc (n * sizeof (float));
    process_data (output, input, mask, n, 4);
    
    outfile = fopen (argv [2], "w");
    fprintf (outfile, "%d\n", n);
    
    for (i = 0; i < n; i++)
    {
        fprintf (outfile, "%.3f\n", output [i]);
    }
    
    fclose (outfile);
    
    free (input);
    free (output);
    
    return 0;
}

Please help me with the code that the body of this given code showed! Thanks in advance