(By the way, if you are wondering why I don’t just use the built-in LSTM, it is because I actually want to use a somewhat different architecture that is not supported as a built-in.) In the second implementation, I use streams via my StreamSpreader class. This is what is like pytorch does when not using cudnn, I believe. In the first, I do separate matrix multiplies of the hidden state and the input and add the results. ![]() I tried three implementations of an LSTM. One question I have is: if this implementation doesn’t work because pytorch will wait for each operation to complete before launching the next, is it at least possible to make a working StreamSpreader class with the same API? What I tried doing was making a simple class to run code in parallel on different streams, like so: class StreamSpreader(): If I am right that pytorch waits, that explains why my naive attempt to use streams below fails to improve performance. And I would think that if pytorch just did an asynchronous kernel launch and immediately returned it would be faster. While seeing such numbers is good for helping me to see where bottlenecks are, it seems to imply that pytorch is waiting for computations to finish after each line (and not just when, say, I try to print out the result of a computation). I have tried line profiling code using the python line profiler ( ), and it seems like the numbers I get for how much time is spent on each line roughly correspond to how long I would expect the corresponding computation to take on the GPU (but, as noted below, simple operations are not as much faster than complex ones as I might expect, and a further caveat is that the line profiler doesn’t provide any kind of variance estimate). (It is hard to understand where the bottlenecks are, but one tipoff is that nvidia-smi reports sm usage of only around 33%.)Ī prior confusion that I have about pytorch before even getting to the topic of streams is about when pytorch is waiting for kernels to finish running. I believe that currently my model is getting much less out of the GPU than it could. Here is my code to measure time in GPU and CPU.I am also trying to understand how to use streams. So, can you suggest if I’m missing sth here. And it increases more times when the size of matrix increased. ![]() I see that the time computed by GPU is larger than by CPU. ![]() In other case, still use 4 threads, assign tile_width = 10, then the result goes wrong.īesides, I also have some related questions, please help to clarify.Ģ) How the structure of grid and block affect the performance of program? How to choose the best structure of grid and block to have highest effect?ģ) I have tried with some data samples (by increase significantly the size of matrix) the compare the time calculate by GPU and CPU. If I use 4 threads (BWIDTH=2), then each thread computes a matrix 8x8 (tile_width = 8). For example, I have matrices A and B are 16x16. Of course, in all cases we can use one thread to compute all elements like CPU does. So, how to handle in cases MWIDTH/(MTILE*BWIDTH) in not an integer? So, I have to define (grid, block) structure to fit with the data. I have applied the suggested code and see that in this problem, it works correctly in case the size of data area (tile_width) covered by one thread must be equal between threads. Void cpu_matrixMul(int *a, int *b, int *c, int N) 1 Thread compute multiple elements of matrix product. In this program, I want to calculate multiple elements of a matrix by using If you’re still having trouble, make sure you are using proper cuda error checking (hint: google “proper cuda error checking”) and if you still want help please post a complete code, that someone else could copy, paste and compile, without having to add anything or change anything. Since I don’t see how you could have compiled this code you have shown.Īnyway, with the above changes, I was able to compute a sensible result in a piece of test code. “I have compiled but it gets wrong result.” ![]() Nevertheless I’m confused by this statement: Probably you meant something like this: c = sum So I don’t see how you could actually be running this code. Int start_col = (blockDim.x*blockIdx.x + threadIdx.x)*tile_width Īlso, this line in your kernel doesn’t make sense: d_p = P_val ĭ_p and P_val aren’t defined anywhere in your kernel. I think they should be like this: int start_row = (blockDim.y*blockIdx.y + threadIdx.y)*tile_width Int start_col = blockDim.x*blockIdx.x + threadIdx.x*tile_width If each thread in your 2D thread array is responsible for a tile_width*tile_width portion of the matrix, then I don’t think these calculations are correct: int start_row = blockDim.y*blockIdx.y + threadIdx.y*tile_width It would help if you show a complete code.
0 Comments
Leave a Reply. |