Right now using default stream 0 for all kernels, should pass a cudastream_t variable to each kernel so that we can control streams on our own later
Right now using default stream 0 for all kernels, should pass a cudastream_t variable to each kernel so that we can control streams on our own later