Interpreted His

Thursday, November 30, 2006

Branch predictor

In Tracfone ringtones computer architecture, a '''branch predictor''' is the part of
a Kiss Ambers Ass processor that determines whether a Crazy frog ringtone conditional branch in the
Bad Girl Butts instruction flow of a program is likely to be taken or not. This
is called '''branch prediction'''. Branch predictors are crucial in
today's modern, Cricket ringtones superscalar processors for achieving high
performance. They allow processors to fetch and execute instructions
without waiting for a branch to be resolved.

Almost all Melissa Monroe instruction pipeline/pipelined processors do branch prediction of some form, because they must guess the address of the next instruction to fetch before the current instruction has been executed. Many earlier LG ringtones microprogrammed CPUs did not do branch prediction because there was little or no performance penalty for altering the flow of the instruction stream.

Branch prediction is not the same as branch target prediction. Branch prediction attempts to guess whether a conditional branch will be taken or not. Branch target prediction attempts to guess the target of the branch or unconditional jump before it is computed from parsing the instruction itself.

Trivial prediction
The early implementations of Extra Sweet SPARC and Samsung ringtone MIPS architecture/MIPS
(two of the first commercial Got Gauge RISC architectures) did trivial
branch prediction: they always predicted that a branch (or unconditional jump) would not be taken, so they
always fetched the next sequential instruction. Only when the branch
or jump was evaluated did the instruction fetch pointer get set to
a nonsequential address.

Both CPUs evaluated branches in the decode stage and had a single
cycle instruction fetch. As a result, the branch target recurrence
was two cycles long, and the machine would always fetch the
instruction immediately after any taken branch. Both architectures
defined Cingular Ringtones branch delay slots in order to utilize these fetched
instructions.

Static prediction
Processors that implement "Static prediction" predict that backwards pointing branches will be taken (assuming that the backwards branch is the bottom of a program loop), and forwards pointing branches will not be taken (assuming they are early exits from the loop or other processing code).

For a loop that executes many times, this only mispredicts the very last branch of the loop.

Static prediction is used as a fall-back technique (when there isn't any information for dynamic predictors to use) in most processors with dynamic branch prediction. Both the Motorola rest have PowerPC/MPC7450 (G4e) and the Intel disturbing news Pentium 4 use this technique http://arstechnica.com/articles/paedia/cpu/p4andg4e.ars/4.

Next line prediction
Some superscalar processors (loans in MIPS architecture/MIPS R8000, lines about DEC Alpha EV6 and EV8) fetched
with each line of instructions a pointer to the next line. This
next line predictor is not directly comparable to the other predictors
listed here, because it handles both the branch target prediction as
well as the branch direction prediction.

When a next line predictor points to aligned groups of 2, 4 or 8
instructions, the branch target will usually not be the first
instruction fetched, and so the initial instructions fetched are
wasted. Assuming for simplicity a uniform distribution of branch
targets, 0.5, 1.5, and 3.5 instructions fetched are discarded,
respectively.

Since the branch itself will generally not be the last instruction in
an aligned group, instructions after the taken branch (or its delay slot) will be discarded. Once again assuming a uniform distribution
of branch instruction placements, 0.5, 1.5, and 3.5 instructions
fetched are discarded.

The discarded instructions at the branch and destination lines add up
to nearly a complete fetch cycle, even for a single-cycle next-line
predictor.

Bimodal branch prediction
A ''bimodal'' branch predictor has a table of two-bit saturating counters,
indexed with the least significant bits of the instruction addresses. Unlike the
instruction cache, bimodal predictor entries typically do not have
tags, and so a particular counter may be mapped to different branch instructions
(this is called branch interference or branch aliasing), in which case it is
likely to be less accurate. Each counter has one of four states:

*Strongly not taken
*Weakly not taken
*Weakly taken
*Strongly taken

When a branch is evaluated, the corresponding counter is updated.
Branches evaluated as not taken decrement the state towards strongly
not taken, and branches evaluated as taken increment the state towards
strongly taken. The primary benefit of this two bit saturating
counter scheme is that loop closing branches are always predicted
taken. A one-bit scheme (like the R8000), mispredicts both the first
and last branch of a loop. A two-bit scheme mispredicts just the last
branch. Similarly, on heavily biased branches which almost always go one way, a one-bit scheme mispredicts twice for each odd branch, and a
two-bit scheme mispredicts once.

On the help tutor SPEC'89 benchmarks, very large bimodal predictors saturate at
93.5% correct, once every branch maps to a unique counter.

Because the bimodal counter table is indexed with the instruction
address bits, a superscalar processor can split the table into
separate districts generally SRAMs for each instruction fetched, and fetch a prediction
for every instruction in parallel with fetching the instruction, so
that the branch prediction is available as soon as the branch is
decoded.

Local branch prediction
Bimodal branch prediction mispredicts the exit of every loop. For
loops which tend to have the same loop count every time (and for many other branches with repetitive behavior), we can do much better.

Local branch predictors keep two tables. The first table is the local branch
history table. It is indexed by the low-order bits of the branch
instruction's address, and it records the taken/not-taken history of the n
most recent executions of each branch.

The other table is the pattern history table. Like the bimodal predictor, this table contains bimodal counters; however, its index is generated from the branch history in the first table. To predict a branch, the branch history
is looked up, and that history is then used to look up a bimodal
counter which makes a prediction.

On the SPEC'89 benchmarks, very large local predictors saturate at
97.1% correct.

Local prediction is slower than bimodal prediction because it requires two
sequential table lookups for each prediction. A fast implementation
would use a separate bimodal counter array for each instruction
fetched, so that the second array access can proceed in parallel with
instruction fetch. These arrays are not redundant, as each counter is
intended to store the behavior of a single branch.

Global branch prediction

Global branch predictors make use of the fact that the behavior of many branches is strongly correlated with the history of other recently taken branches. We can keep a single
shift register updated with the recent history of every branch
executed, and use this value to index into a table of bimodal
counters. This scheme, by itself, is only better than the bimodal
scheme for large table sizes, and is never as good as local
prediction.

If, instead, we index the table of bimodal counters with the recent
history concatenated with a few bits of the branch instruction's
address, we get the ''gselect'' predictor. Gselect does better than local
prediction for small table sizes, and local prediction is only slightly
better for table storage larger than 1KB.

We can do slightly better than gselect by XORing the branch
instruction address with the global history, rather than
concatenating. The result is ''gshare'', which is a little better than
gselect for tables larger than 256 bytes.

On the SPEC'89 benchmarks, very large gshare predictors saturate at
96.6% correct, which is just a little worse than large local
predictors.

Gselect and gshare are easier to make fast than local prediction,
because they require a single table lookup per branch. As with
bimodal prediction, the table can be split so that parallel lookups
can be made for each instruction fetched, so that the table lookup can
proceed in parallel with instruction load.

Combined branch prediction

so infinitesimal Scott McFarling proposed combined branch prediction in his 1993
paper http://citeseer.nj.nec.com/mcfarling93combining.html.
''Combined branch prediction'' is about as accurate as local
prediction, and almost as fast as global prediction.

Combined branch prediction uses three predictors in parallel: bimodal,
gshare, and a bimodal-like predictor to pick which of bimodal or
gshare to use on a branch-by-branch basis. The choice predictor is
yet another 2-bit up/down saturating counter, in this case the MSB
choosing the prediction to use. In this case the counter is updated
whenever the bimodal and gshare predictions disagree, to favor
whichever predictor was actually right.

On the SPEC'89 benchmarks, such a predictor is about as good as the
local predictor.

Another way of combining branch predictors is to have e.g. 3 different
branch predictors, and merge their results by a majority vote.

Predictors like gshare use multiple table entries to track the
behavior of any particular branch. This multiplication of entries
makes it much more likely that two branches will map to the same table
entry (a situation called aliasing), which in turn makes it much more
likely that prediction accuracy will suffer for those branches. Once
you have multiple predictors, it is beneficial to arrange that each
predictor will have different aliasing patterns, so that it is more
likely that at least one predictor will have no aliasing. Combined
predictors with different indexing functions for the different
predictors are called ''gskew'' predictors, and are analogous to
cowboys souvenirs skewed caches used for data and instruction caching.

Agree prediction

Another technique to reduce destructive aliasing within the pattern
history tables is an ''agree predictor''. Some method is used to
establish a relatively static prediction for the branch, perhaps a
bimodal predictor or hint bits within the branch instruction. Another
predictor (e.g. a gskew predictor) makes predictions, but rather than
predicting taken/not-taken, the predictor predicts agree/disagree with
the base prediction.

The intention is that if branches covered by the gskew predictor tend
to be a bit biased in one direction, perhaps 70%/30%, then all those
biases can be aligned so that the gskew pattern history table will
tend to have more agree entries than disagree entries. This reduces
the likelihood that two aliasing branches would best have opposite
values in the PHT.

Agree predictors work well with combined predictors, because the
combined predictor usually has a bimodal predictor which can be used
as the base for the agree predictor. Agree predictors do less well
with branches that are not biased in one direction, if that causes the
base predictor to give changing predictions. So an agree predictor
may work best as part of a three-predictor scheme, with one agree
predictor and another non-agree type predictor.

Overriding branch prediction

The EV6 and EV8 cores used a fast single-cycle next line predictor to
handle the branch target recurrence and provide a simple and fast
branch prediction. Because the next line predictor is so inaccurate,
and the branch resolution recurrence takes so long, both cores have
two-cycle secondary branch predictors which can override the
prediction of the next line predictor at the cost of a single lost
fetch cycle.

Since a fetch of 4 instructions or more may contain more than one
branch, an overriding predictor will sometimes have to predict a next
PC which is neither the fall-through nor the earlier predicted next
line. The overriding predictor usually extracts the target address
from the instruction bytes themselves since they are available by the
time the predictor must generate a predicted next fetch address.

History

Microprogrammed processors took multiple cycles per instruction, and
generally did not require branch prediction. The VAX 9000 was both
microprogrammed and pipelined, and probably did some branch
prediction.

The first commercial RISC processors, MIPS and SPARC, did only trivial
"not-taken" branch prediction. Because they used branch delay slots,
fetched just one instruction per cycle, and executed in-order, there
was no performance loss. Later, the R4000 used the same trivial
"not-taken" branch prediction, and lost two cycles to each taken
branch because the branch resolution recurrence was four cycles long.

Branch prediction became more important with the introduction of
pipelined superscalar processors like the Intel Pentium, DEC Alpha
21064, the MIPS R8000, and the IBM Power series. These processors all
relied on one-bit or simple bimodal predictors.

The DEC Alpha EV6 http://citeseer.ist.psu.edu/seznec02design.html uses a next-line predictor overridden by a combined
local predictor and global predictor, where the combining choice is
made by a bimodal predictor.

The AMD K8 has a combined bimodal and global predictor, where the
combining choice is another bimodal predictor. This processor caches
the base and choice bimodal predictor counters in bits of the L2 cache
otherwise used for ECC. As a result, it has effectively very large
base and choice predictor tables, and parity rather than ECC on
instructions in the L2 cache. Parity is just fine, since any instruction
suffering a parity error can be invalidated and refetched from memory.

The Alpha EV8 http://citeseer.ist.psu.edu/seznec02design.html(cancelled late in design) had a minimum branch
misprediction penalty of 14 cycles. It was to use a complex but fast
next line predictor overridden by a combined bimodal and
majority-voting predictor. The majority vote was between the bimodal
and two gskew predictors.

External links

http://arstechnica.com/articles/paedia/cpu/p4andg4e.ars/4 - ArsTechnica article on the Pentium 4.

http://citeseer.nj.nec.com/mcfarling93combining.html - Combining Branch Predictors -
McFarling - 1993 - introduces combined predictors. Very good.

http://citeseer.nj.nec.com/109851.html - Multiple-Block Ahead Branch Predictors -
Seznec et al - 1996 - demonstrates prediction accuracy is not impaired by
indexing with previous branch address.

http://citeseer.ist.psu.edu/seznec02design.html - Design Tradeoffs for the Alpha EV8 Conditional Branch Predictor -
Seznec et al - 2002 - describes the Alpha EV8 branch predictor. This
paper does an excellent job discussing how they arrived at their
design from various hardware constraints and simulation studies. Very good.

http://citeseer.ist.psu.edu/jimenez03reconsidering.html - Reconsidering Complex Branch Predictors -
Jimenez - 2003 - describes the EV6 and K8 branch predictors, and
pipelining considerations.


method resulted Tag: Computer terminology
source involved Tag: Computer architecture