Essential Abstractions in GCC '10, Workshop on GCC Internals

Solution to Assignment on Parallelization and Vectorization in GCC

Solutions (Level 1)

The statement is a[i] = a[i-1] + 2.
In Iteration 1, the value read is a[0], and the value written is a[1].
In Iteration 2, the value read is a[1], and the value written is a[2].
The values therefore are first written into, and then read (Read after Write). This is an loop carried true dependency.
The statement is a[i] = b[i].
There is no dependence in this statement. Therefore the loop can be both parallelized and vectorized.
The statement is a[i] = a[i+1].
In Iteration 1, the value read is a[1], and the value written is a[0].
In Iteration 2, the value read is a[2], and the value written is a[1].
The values are therefore first read from, and then written into (Write after Read). This is therefore loop carried antidependence.
S1 == a[i+2] = b[i] + 5.
S2 == b[i+3] = a[i] + 10.
There exists a loop carried true dependence from S1 to S2 due to a[i+2] and a[i]. Also, there exists a loop carried true dependence from S2 to S1 due to b[i+3] and b[i]. Therefore, the statements S1 and S2 are cyclically interlocked due to true dependence, rendering it incapable of being interchanged, distributed, parallelized or vectorized.
The statement S1 is a[i+1] = a[i] + 2.
There is a loop carried true dependence. But predictive commoning extracts the value of a[1] in a scalar variable and then reduces the computation to a[i+1] = *some scalar*
Therefore the code can be parallelized. The code will still not be vectorized because predcom pass happens after vectorization.
We can stop the parallelization by disabling predictive commoning.

Solutions (Level 2)

The statement is a[i] = a[i-4] + 5.
This is a read after write dependency, i.e true dependence. However, the vectorization factor is 4, and therefore the loop statement can be vectorized, as the conflict won't arise.
If you change a[i-4] to a[i-3], then the statement can't be vectorized. Infact any value less than 4 will cause read write conflict, and prevent vectorization.
Therefore, if the values are accessed beyond the vectorization factor, the statement can be vectorized.
Again, a[i] = a[i-1] has true dependency.
But the second statement, b[i] = c[i] does not have any dependence. It can be separated out of the loop into a separate loop by loop distrubution, and then the statement S2 can be independently vectorized.
In C, the arrays are stored in Row Major. Therefore, having j as the innermost loop's induction variable, and the arrays being accessed with j as the row is not beneficial.
We can interchange the loop, and make i as the innermost loop induction variable, thereby accessing the array elements of the same row first.
This is a simple C code with conditional branch. It can be parallelized. But the data dependence tests for Lambda frame- work fail to deal with conditional code, and therefore the code is not parallelized.
However, the data dependence test for graphite finds out that the statement can be parallelized, thereby parallelizing the conditional loop.
The statement a[i][j] = **(a + 4*j) induces a dependency. However, the data dependence tests of Lambda framework don't deal with pointer arithmetic, and therefore end up parallelizing the loop.
But Graphite framework successfully determines that a[i][j] and **a belong to the same alias set, and therefore there exists a dependency in the statement. Hence it does not parallelize the loop.
The stride is i+=2. The code can't be parallelized. Since the stride is more than 1, Lambda framework can't parallelize it. But Graphite framework successfully parallelizes it.
If you try to interchange the loops, the interchange won't be successful. Inspect the complete dump, when chrecs are being created. You will see that the scalar evolution fails because the expression i*j is not affine.

Login \| Contact Us \|

Essential Abstractions in GCC '10 A Workshop on GCC Internals by GCC Resource Center July 5-8, 2010, IIT Bombay Department of Computer Science & Engineering Indian Institute of Technology, Bombay
	Home Registration Schedule Slides Assignments Photos How to Reach FAQs
	Solution to Assignment on Parallelization and Vectorization in GCC Solutions (Level 1) The statement is a[i] = a[i-1] + 2. In Iteration 1, the value read is a[0], and the value written is a[1]. In Iteration 2, the value read is a[1], and the value written is a[2]. The values therefore are first written into, and then read (Read after Write). This is an loop carried true dependency. The statement is a[i] = b[i]. There is no dependence in this statement. Therefore the loop can be both parallelized and vectorized. The statement is a[i] = a[i+1]. In Iteration 1, the value read is a[1], and the value written is a[0]. In Iteration 2, the value read is a[2], and the value written is a[1]. The values are therefore first read from, and then written into (Write after Read). This is therefore loop carried antidependence. S1 == a[i+2] = b[i] + 5. S2 == b[i+3] = a[i] + 10. There exists a loop carried true dependence from S1 to S2 due to a[i+2] and a[i]. Also, there exists a loop carried true dependence from S2 to S1 due to b[i+3] and b[i]. Therefore, the statements S1 and S2 are cyclically interlocked due to true dependence, rendering it incapable of being interchanged, distributed, parallelized or vectorized. The statement S1 is a[i+1] = a[i] + 2. There is a loop carried true dependence. But predictive commoning extracts the value of a[1] in a scalar variable and then reduces the computation to a[i+1] = some scalar Therefore the code can be parallelized. The code will still not be vectorized because predcom pass happens after vectorization. We can stop the parallelization by disabling predictive commoning. Solutions (Level 2) The statement is a[i] = a[i-4] + 5. This is a read after write dependency, i.e true dependence. However, the vectorization factor is 4, and therefore the loop statement can be vectorized, as the conflict won't arise. If you change a[i-4] to a[i-3], then the statement can't be vectorized. Infact any value less than 4 will cause read write conflict, and prevent vectorization. Therefore, if the values are accessed beyond the vectorization factor, the statement can be vectorized. Again, a[i] = a[i-1] has true dependency. But the second statement, b[i] = c[i] does not have any dependence. It can be separated out of the loop into a separate loop by loop distrubution, and then the statement S2 can be independently vectorized. In C, the arrays are stored in Row Major. Therefore, having j as the innermost loop's induction variable, and the arrays being accessed with j as the row is not beneficial. We can interchange the loop, and make i as the innermost loop induction variable, thereby accessing the array elements of the same row first. This is a simple C code with conditional branch. It can be parallelized. But the data dependence tests for Lambda frame- work fail to deal with conditional code, and therefore the code is not parallelized. However, the data dependence test for graphite finds out that the statement can be parallelized, thereby parallelizing the conditional loop. The statement a[i][j] = *(a + 4j) induces a dependency. However, the data dependence tests of Lambda framework don't deal with pointer arithmetic, and therefore end up parallelizing the loop. But Graphite framework successfully determines that a[i][j] and *a belong to the same alias set, and therefore there exists a dependency in the statement. Hence it does not parallelize the loop. The stride is i+=2. The code can't be parallelized. Since the stride is more than 1, Lambda framework can't parallelize it. But Graphite framework successfully parallelizes it. If you try to interchange the loops, the interchange won't be successful. Inspect the complete dump, when chrecs are being created. You will see that the scalar evolution fails because the expression ij is not affine.
All content copyright © 2010, GCC Resource Center