1.3 Examples: Simple Cases
Case0: degenerate store
public static void main(java.lang.String[])
{
r0 := @parameter0;
load.r r0;
staticinvoke runBenchmark;
store.l $l0;
return;
}
reduces to:
public static void main(java.lang.String[])
{
r0 := @parameter0;
load.r r0;
staticinvoke runBenchmark;
pop;
return;
}
|
Case1: simple sl pair
public static long getCachingtime()
{
staticinvoke
store.l $l0;
load.l $l0;
return.l;
}
reduces to:
public static long getCachingtime()
{
staticget ;
return.l;
}
|
Case2: simple sll triple
...
store.i $i7;
load.i $i7;
load.i $i7;
fieldget
...
reduces to:
...
dup1
fieldget
...
|
Remarks:
- The resulting code in case0 can further be simplified by appliying peephole patterns.
- Case 1 is the pattern most often encountered in the generated Baf code.
- Case 2 in the above form hardly ever (if ever) occurs in the generated Baf Code.
- Often the store and load instructions of cases 1 and 2 are seperated by sequences of bytecodes and thus cannot be so
easily eliminated. The effect of these interleaving bytecodes sequences on the stack must first be determined.
1.4 Interleaved sl pairs and Interleaved sll triples
As hinted in the previous slide, most sl pairs and sll triples have their store and load instructions
interleaved with sequences of instructions which makes their reduction more difficult to ascertain.
Example:
new spec.io.FileInputStream;
store.r $r3;
load.r $r3;
load.r r1;
specialinvoke ;
load.r $r3;
store.r r2;
load.r r2;
virtualinvoke getContentLength;
store.i i3;
load.i i3;
newarray;
store.r $r4;
load.r r0;
load.r $r4;
fieldput spec.benchmarks._201_compress.Harness.orig_text_buffer;
load.i i3;
newarray;
store.r $r5;
load.r r0;
load.r $r5;
fieldput spec.benchmarks._201_compress.Harness.comp_text_buffer;
|
To identify the possible reductions in the above code, the effect on the stack of the interleaving sequences must be determined.
Also many of these sequences will affect the stack in such a way that
no reductions will be possible without some reordering of instructions.
2. Canonical Examples and Solutions
2.1 Interleaving Sequences
The pathological reduction cases as illustrated previously, occur when the store and load(s) are
interleaved by a sequence of instructions.
Two pieces of information characterise interleaving sequences:
- The net effect on the stack height after executing such a sequence. This shall
be refered to as the net stack height variation or nshv.
- The minimum stack height variation attained while executing such a sequence. This
shall be refered to as the minimum stack height variation or mshv.
A sequence of instructions having both nshv == 0 and mshv == 0 will
be refered to as a level sequence.
2.2 Interleaving Sequences Having Side Effects on the Stack
Consider:
load.i i0;
load.i i2;
add.i;
store.i $i3;
push 0;
store.i i5;
load.r r1;
load.i $i3;
arrayread.b;
store.i i4;
|
Here the the interleaving sequence is
load.r r1;
which leaves an
item on the stack. Thus the store/load pair cannot be eliminated so easily for obvious reasons.
Incorrect:
load.i i0;
load.i i2;
add.i;
push 0;
store.i i5;
load.r r1;
arrayread.b; <== reference on top of integer.
store.i i4;
|
Remarks:
- Of course a swap instruction could be used here, but that would add an instruction, would apply only to type 1 computational
types, and would only be valid only for nshv = 1;
- A more general heuristic used to deal with such situations is simply to try to relocate an instruction
in the sequence having a positive stack height side effect, to a previous location in the block, before the
store instruction.
Relocating instructions
Consider:
load.i i0;
load.i i2;
add.i;
store.i $i3;
push 0;
store.i i5;
load.r r1;
load.i $i3;
arrayread.b;
store.i i4;
|
Issues involved in relocating a candidate:
Going back to our previous example,
load.i i0;
load.i i2;
add.i;
store.i $i3;
push 0;
store.i i5;
load.r r1;
load.i $i3;
arrayread.b;
store.i i4;
|
- The second candidate to relocate in the sequence is load.r r1;.
- We need to find a insertion slot in the block such that
the sequence of instruction between this slot and the candidate to move forms a level sequence.
- The first such slot in this example is just after store.i $i3;.
The block can thus be rearranged as follows:
load.i i0;
load.i i2;
add.i;
store.i $i3;
load.r r1;
push 0;
store.i i5;
load.i $i3;
arrayread.b;
store.i i4;
|
Of course this new ordering does not solve our original problem as the relocation slot was chosen in the store/load sequence. We need
to find a slot occuring before the store.
The second and only valid slot occurs before the first instruction, we get:
load.r r1;
load.i i0;
load.i i2;
add.i;
store.i $i3;
push 0;
store.i i5;
load.i $i3;
arrayread.b;
store.i i4;
|
Now the interleaving sequence is level.so the store/load reduction can occur according to
the criteria previously presented.
Final result:
load.r r1;
load.i i0;
load.i i2;
add.i;
push 0;
store.i i5;
arrayread.b;
store.i i4;
|
Moving the store next to the load
When the interleaving sequence has a mshv < 0,
the previous techniques will not work as we cannot leave the value on the
stack (as it would then be eaten up by the sequence).
Consider:
add.i;
staticput spec.io.FileInputStream.num_cached_files;
staticinvoke currentTimeMillis;
staticget spec.io.FileInputStream.Cachingtime;
store.l $l8;
load.l l6;
sub.l;
store.l $l9;
load.l $l8;
load.l $l9;
|
In this situation we can try to move the smallest level-block, if any,
ending with the store, inorder to put it just before the load.
In the above example the level block:
staticget spec.io.FileInputStream.Cachingtime;
store.l $l8;
|
can be moved just before load.l $l8;. Thus yeilding,
add.i;
staticput spec.io.FileInputStream.num_cached_files;
staticinvoke currentTimeMillis;
load.l l6;
sub.l;
store.l $l9;
staticget spec.io.FileInputStream.Cachingtime;
store.l $l8;
load.l $l8;
load.l $l9;
|
And now of course the the sl pair can be reduced.
add.i;
staticput spec.io.FileInputStream.num_cached_files;
staticinvoke currentTimeMillis;
load.l l6;
sub.l;
store.l $l9;
staticget spec.io.FileInputStream.Cachingtime;
load.l $l9;
|
More complicated sll triple optimizations
Consider:
load.r r0;
dup1.r;
fieldget spec.benchmarks._201_compress.Comp_Base.free_ent;
store.i $i7;
load.i $i7;
push 1;
add.i;
fieldput spec.benchmarks._201_compress.Comp_Base.free_ent;
load.i $i7;
virtualinvoke set;
|
- In the interleaving sequence has negative minimum stack height.
- Observe that nshv == -2 and mshv == -2.
- We can use a dup1_x1 instruction here
Result:
load.r r0;
dup1.r;
fieldget spec.benchmarks._201_compress.Comp_Base.free_ent;
dup1_x1.i_r;
push 1;
add.i;
fieldput spec.benchmarks._201_compress.Comp_Base.free_ent;
virtualinvoke set;
|
2. The Reduction Algorithm
2.1 Design Goals
- Genericity: strive to the catch all the possible reductions, while making the least number assumptions on the
structure of the baf code (as opposed to peepholes).
- Simplicity: break down the optimization in small step that are easy to understand, prove correct and are extendable.
- Speed: The optimization must run in a reasonnable amount of time.
2.2 Algorithm Overview
- The unit over which the algorithm applies is a Baf method.
- Given a Baf method, the algorithm performs a fixed point iteration until either
no reductions are performed or no reordering of the method's instructions has occured.
- For a given method, a list of it's store instruction is computed and the target cases are identified.
- The target cases are analysed and accordingly reductions and/or code reordering are possibly preformed for
each these.
2.2.1 Identifing the Target Cases
The following relevent cases must be identified:
- Degenerate stores
- sl pairs
- sll triples
This is done by building a list of all the store instructions in the target method and then
verifying for a given store, if it belongs to one of the above 3 categories.
If the store belongs to the first category, it is simply replaced by a pop.
If we find a sl pair or a sll pair then more consideration must be given.
2.2.2 The reduction of sl pairs and sll triples
The pathological reductions cases as illustrated previously, occur when the store and loads are
interleaved by a sequence of instructions.
Once a sl pair is identified, the algorithm used to reduce it is:
While reordering occured Do
If the interleaving sequence is a level sequence then
eliminate the pair.
Else if mshv < 0 then
try to identify and move a level sequence ending with the store
and insert it just before the load instruction.
Else if mshv == -1 && nshv == -1 then
if the stack value consumed comes from a dup then eliminate.
Else nshv > 0 and mshv >= 0 then
try to move an instruction belonging to the interleaving
sequence to a somewhere before the store instruction in the block.
Done
|
Once a sll triple is identified, the algorithm used to reduce it is:
|
Consider the store and the first load as an sl pair and apply a variation of the sl pair algorithm,
where the load is moved next to the store in the cases where the sl pair would be eliminated.
If the algorithm succeeds (ie. the first load now occurs immediately after the store), then
Consider the store and the second load. (Note that the first load will
now be in the interleaving sequence).
While reordering occured do
If mshv == 0 && nshv == 0 then
replace the store by a dup1 instruction and remove the 2 loads.
Else If mshv == -1 && nshv == -1 then
replace the store by a dup1 instruction and remove the 2 loads.
Else If mshv == -2 && nshv == -2 then
replace the store by a dup1_x1 instruction and remove the 2 loads.
Else If mshv < -2 then
give up
Else
try to move an instruction belonging to the interleaving sequence to
somewhere before the store instruction in the block.
Done
|
note:
In the above algorithms, reordering occured means that an instruction was succesfully moved in the
final else clause of of each algorithm respectively.
2.3 Code Specific Details
The algorithm is implemented by the methods
optimizeStoreLoads
and
stackIndependent,
which make use of a few helper methods notably :
Here is a prosaic description of the steps in it's implementation:
OptimizeStoreLoads does the following as a fixed point iteration:
- A list of all the stores in the method body is computed and the method then iterates over this list.
- For each store the uses of this store instruction are found.
- If the store has no uses then the degenerate store action is taken.
- If the store has 1 or 2 uses then if these are loads and are in the
same basic block then the sl action or the sll action is taken
- If an action succeded or something changed the fixed point iteration repeats.
For the sl and sll actions:
- The optimizeStoreLoads method implements the logic of these algorithms, while
stackIndependent implements the simulation of the interleaving sequences
and computes the values for mshv and nshv.
- More precisely optimizeLoadStore calls stackIndependent for a given store/load pair
and specifies it's context.
- If the load/store pair is sl pair then the context specified is STORE_LOAD_ELIMINATION. Otherwise the
store/load pair comes from a sll triple and the context specified is STORE_LOAD_LOAD_ELIMINATION.
- The method stackIndependent behaves according to the specified context, in effect implementing either the
sl algorithm or the sll algorithm.
- For the implementation of the Else clause in the sl and sll algorithm, stackIndependent calls
tryToMoveUnit.
- In implementating the Else If in the sl algorithm, stackIndependent calls pushStoreToLoad
- The outcome returned by stackIndependent to optimizeStoreLoads is one of the following:
- FAILURE when store load elimination is not possible in any form.
- SUCCESS when the store and load in of a store/load pair have are now next to each other (no interveaving sequence).
- MAKE_DUP when a sll triple can be replaced by a dup1 instruction.
- MAKE_DUP_X1 when sll triple can be replaced by a dup1_x1 instruction
- HAS_CHANGED when store load elimination is not possible in any form, but some unit reordering has occured.
- optimizeLoadStore then performs the action described by the above return codes.
3. Inter Block Optimizations
- All the optimizations introduced up to now required as a pre-condition that the store and load(s)
involved be contained in the same block.
- The transformation into Baf produces many stack variables that span across blocks.
- The vast majority of these can be caught and reduced using 2 simple inter-block optimizations.
Consider:
block1:
load.r r0
store.r $r3;
load.r r0;
fieldget ;
ifeq label2;
block2:
push 257;
store.i $i8;
goto label3;
block3:
push 256;
store.i $i8;
block4:
load.r $r3;
load.i $i8;
fieldput ;
|
- The stack variable $r3 is found in blocks 1 and 4.
- The stack variable $i8 is found in blocks 2, 3 and 4.
- This code derives from an simple if/else type construct, and as such occurs fairly often.
3.1 First Optimization
Description: Given 2 distinct blocks x and y, where x contains the store and y the load,
then if the set X of all successor blocks of x is the same as the set Y of all predecessor
blocks of y, allowing for y to be in Y if neccessary, then an attempt is made to relocate the
store into the y block.
At present this is done only if the the sequences from the store to the block tail in x
forms a level block as well as all the blocks contained in X.
If this is the case then the store is moved the head of block y.
This is the case in the previous example, where block1 corresponds to x and
block4 corresponds to y. The reordering yeilds:
block1:
load.r r0
load.r r0;
fieldget ;
ifeq label2;
block2:
push 257;
store.i $i8;
goto label3;
block3:
push 256;
store.i $i8;
block4:
store.r $r3;
load.r $r3;
load.i $i8;
fieldput ;
|
3.2 Second Optimization
Description: Given a block y containing a load, if this load is defined by 2 stores in blocks v
and w (v != w necessarily), and if both blocks v and w have as a unique successor block y, and
y has as unique predecessors blocks v and w, then an attempt is made to unite and relocate the
stores into the y block.
At present this is done only if the the sequences from the unit following both stores in blocks v and w to their
respective tails are level.
If this is the case then both stores are collapsed into one store inserted at the head of block y.
This is the case in the previous example, where y corresponds to block4, v to block2 and
w to block 3. The reordering yeilds:
block1:
load.r r0
load.r r0;
fieldget ;
ifeq label2;
block2:
push 257;
goto label3;
block3:
push 256;
block4:
store.i $i8;
store.r $r3;
load.r $r3;
load.i $i8;
fieldput ;
|
Now that the store/load pairs are all in the same basic block they will be caught and annihilated
by the standard optimizations and the final outcome will give:
block1:
load.r r0
load.r r0;
fieldget ;
ifeq label2;
block2:
push 257;
goto label3;
block3:
push 256;
block4:
fieldput ;
|
4. Wrap-up
- The optimizations presented here are all implemented by the class
LoadStoreOptimizer.
- The class is to be used as a singleton; the static method
v
is provided and returns a instance of the class.
- The only public method is
optimize
it takes a Body as an argument and performs all the optimizations
- The method
optimize
preforms a fixed point iteration calling in turn the methods optimizeStoreLoads and
doInterBlockOptimizations.
Many new optimizations could be added to the ones presented here. A few have even been implementated, but still
need to be tested to mesure their effectiveness on the runtime. Here are a few pointers:
- Often a store/load pair is followed by many more loads. These following loads come from dups under javac.
Thus an optimization could convert them to dups and dup2s. In some cases this can dramatically reduce the local
variable storage needs.
- Many single pops could be aggregated as pop2s.
- Optimizations using the swap instruction could be introduced. A few cases have been identified that
have stood up to the present optimizations, but could be easily reduced using a swap based optimization.
- The current interblock optimizations should be generalized and extended.
-
propagateLoadsBackwards
An easy way to convert multiple loads into dups is to propagate backwards as far as possible all the loads
in a block. Doing this aggregates many loads in a block that were previously seperated by interleaving sequences
of instructions. Once this is achieved, converting them into dup's is trivial.
- propagateLoadsForward
After calling propagateLoadsBackwards, (and possibly converting them to dups), the block's stack height
can soar enormously. A call to propagateLoadsForward will undo the effect of propagateLoadsBackward on the stack
height by propagating loads and dups as far forward as possible in the block.
- superDuper1
This converts a series of loads into dups
In particular, making use of the following sequence of method calls,
propagateLoadsBackwards
optimizeStoreLoads
propagateLoadsForward
can be usefull to reduce certain sl and sll patterns that have resisted the current optimazations.
Propagating the loads backwards can expose certain opportunities otherwise difficult to capture in a
genetric way.
Patrice POMINVILLE
Last modified: Wed Aug 25 13:14:20 EDT 1999