|
Solution to Assignment on Parallelization and Vectorization in GCC
Solutions (Level 1)
The statement is a[i] = a[i-1] + 2.
In Iteration 1, the value read is a[0], and the value written is a[1].
In Iteration 2, the value read is a[1], and the value written is a[2].
The values therefore are first written into, and then read (Read after Write). This is an loop carried true dependency.
The statement is a[i] = b[i].
There is no dependence in this statement. Therefore the loop can be both parallelized and vectorized.
The statement is a[i] = a[i+1].
In Iteration 1, the value read is a[1], and the value written is a[0].
In Iteration 2, the value read is a[2], and the value written is a[1].
The values are therefore first read from, and then written into (Write after Read). This is therefore loop carried antidependence.
S1 == a[i+2] = b[i] + 5.
S2 == b[i+3] = a[i] + 10.
There exists a loop carried true dependence from S1 to S2 due to a[i+2] and a[i]. Also, there exists a loop carried true
dependence from S2 to S1 due to b[i+3] and b[i]. Therefore, the statements S1 and S2 are cyclically interlocked due to true dependence,
rendering it incapable of being interchanged, distributed, parallelized or vectorized.
The statement S1 is a[i+1] = a[i] + 2.
There is a loop carried true dependence. But predictive commoning
extracts the value of a[1] in a scalar variable and then reduces
the computation to a[i+1] = *some scalar*
Therefore the code can be parallelized. The code will still not be
vectorized because predcom pass happens after vectorization.
We can stop the parallelization by disabling predictive commoning.
Solutions (Level 2)
The statement is a[i] = a[i-4] + 5.
This is a read after write dependency, i.e true dependence.
However, the vectorization factor is 4, and therefore the loop
statement can be vectorized, as the conflict won't arise.
If you change a[i-4] to a[i-3], then the statement can't be
vectorized. Infact any value less than 4 will cause read write
conflict, and prevent vectorization.
Therefore, if the values are accessed beyond the vectorization
factor, the statement can be vectorized.
Again, a[i] = a[i-1] has true dependency.
But the second statement, b[i] = c[i] does not have any
dependence. It can be separated out of the loop into a separate
loop by loop distrubution, and then the statement S2 can be
independently vectorized.
In C, the arrays are stored in Row Major. Therefore, having
j as the innermost loop's induction variable, and the arrays
being accessed with j as the row is not beneficial.
We can interchange the loop, and make i as the innermost loop
induction variable, thereby accessing the array elements of
the same row first.
This is a simple C code with conditional branch. It can be
parallelized. But the data dependence tests for Lambda frame-
work fail to deal with conditional code, and therefore the
code is not parallelized.
However, the data dependence test for graphite finds out that
the statement can be parallelized, thereby parallelizing the
conditional loop.
The statement a[i][j] = **(a + 4*j) induces a dependency.
However, the data dependence tests of Lambda framework don't
deal with pointer arithmetic, and therefore end up parallelizing
the loop.
But Graphite framework successfully determines that a[i][j]
and **a belong to the same alias set, and therefore there
exists a dependency in the statement. Hence it does not
parallelize the loop.
The stride is i+=2. The code can't be parallelized. Since the
stride is more than 1, Lambda framework can't parallelize it.
But Graphite framework successfully parallelizes it.
If you try to interchange the loops, the interchange won't be
successful. Inspect the complete dump, when chrecs are being
created. You will see that the scalar evolution fails because
the expression i*j is not affine.
|
|