AMULET2e

Jim Garside
Department of Computer Science, The University of Manchester,
Oxford Road, Manchester, M13 9PL, U.K.
http://www.cs.man.ac.uk/amulet/
jgarside@cs.man.ac.uk

What is it?
- AMULET2e is an implementation of the ARM architecture
- It is a ‘system’ chip complete with cache, bus interface etc.
- The microprocessor contains performance-enhancing features
- It is completely asynchronous (i.e. self-timed)
Contents

❑ The ARM processor

❑ Asynchronous pipeline design

❑ The AMULET1 and AMULET2 microprocessor cores
  ❍ Some example problems and solutions

❑ The AMULET2e chip

❑ Results and analysis

❑ Conclusions
The ARM processor

General registers and Program Counter

- 16 visible registers at all times
- 2 private registers for each exception
- Private work registers for fast interrupt

<table>
<thead>
<tr>
<th>Register</th>
<th>User Mode</th>
<th>Fast Interrupt</th>
<th>Supervisor Mode</th>
<th>Abort Mode</th>
<th>Interrupt Mode</th>
<th>Undefined Mode</th>
</tr>
</thead>
<tbody>
<tr>
<td>r0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>r1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>r2</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>r3</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>r4</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>r5</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>r6</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>r7</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>r8</td>
<td></td>
<td>r8_fiq</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>r9</td>
<td></td>
<td>r9_fiq</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>r10</td>
<td></td>
<td>r10_fiq</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>r11</td>
<td></td>
<td>r11_fiq</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>r12</td>
<td></td>
<td>r12_fiq</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>r13</td>
<td></td>
<td>r13_fiq</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>r14</td>
<td></td>
<td>r14_fiq</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>r15(PC)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
The ARM processor

The ARM6 instruction set

<table>
<thead>
<tr>
<th>Cond</th>
<th>I</th>
<th>Opap</th>
<th>S</th>
<th>Rd</th>
<th>Rn</th>
<th>Rs</th>
<th>1001</th>
<th>Rm</th>
</tr>
</thead>
<tbody>
<tr>
<td>00</td>
<td>I</td>
<td></td>
<td></td>
<td>Rd</td>
<td>Rn</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>000000</td>
<td>000000</td>
<td>A</td>
<td>S</td>
<td>Rd</td>
<td>Rn</td>
<td>Rs</td>
<td>1001</td>
<td>Rm</td>
</tr>
<tr>
<td>0010</td>
<td>B</td>
<td>00</td>
<td>Rd</td>
<td>Rn</td>
<td>Rd</td>
<td>0000</td>
<td>1001</td>
<td>Rm</td>
</tr>
<tr>
<td>01</td>
<td>I</td>
<td>PU</td>
<td>BWL</td>
<td>Rn</td>
<td>Rd</td>
<td></td>
<td>Offset</td>
<td></td>
</tr>
<tr>
<td>011</td>
<td>x</td>
<td>x x x x x x x x x x x x x x x x</td>
<td>1</td>
<td>x x x</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>100</td>
<td>P</td>
<td>USWL</td>
<td>Rn</td>
<td></td>
<td></td>
<td>register list</td>
<td></td>
<td></td>
</tr>
<tr>
<td>101</td>
<td>L</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>offset</td>
<td></td>
<td></td>
</tr>
<tr>
<td>110</td>
<td>P</td>
<td>UNWL</td>
<td>Rn</td>
<td>CRd</td>
<td>CP#</td>
<td>offset</td>
<td></td>
<td></td>
</tr>
<tr>
<td>110</td>
<td>CPop</td>
<td>CRn</td>
<td>CRd</td>
<td>CP#</td>
<td>CP</td>
<td>0</td>
<td>CRm</td>
<td></td>
</tr>
<tr>
<td>110</td>
<td>CP</td>
<td>L</td>
<td>CRn</td>
<td>Rd</td>
<td>CP#</td>
<td>1</td>
<td>CRm</td>
<td></td>
</tr>
<tr>
<td>111</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>ignored by processor</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0

- data ops
- multiply
- swap mem/reg
- load/store reg
- undefined
- block reg transfer
- branch
- coproc L/S
- coproc op
- coproc reg xfer
- s/w interrupt
ARM as a demonstrator

Why choose an ARM as a technology demonstrator?

- ARM is a real, commercial microprocessor architecture
  - realistic demonstrator
  - software available

- Low power operation is the primary motivation
  - ARM is a leading low power architecture

- There are difficult problems to solve
  - some complex instructions (e.g. load multiple)
  - interrupts
  - data aborts
Why Asynchronous?

- Power saving
  - Every logic transition requires some energy transfer
  - Fewer transitions means lower power
  - Don’t ‘clock’ any unit when it is not required

- Potentially fast
  - Allows cycle time to be data dependent rather than always ‘worst case’
  - Can optimise frequent instructions and allow rare difficult operations more time

- Composability
  - Circuits can be assembled as “plug and play”

- Reduced EMI
  - Radiated noise should be broad-band
How does it work?

- Each unit runs autonomously, processing at its own speed
- Interactions occur through defined interfaces
- Communication is localised and performed by handshaking
- Timing and pipeline depth is a detail of functional unit design
- Each block can be decomposed similarly into smaller functional units
Inspiration ...

Sutherland’s *Micropipelines* (Communications of the ACM, June 1989)

- Bundled data interface with transition signalling
4-phase bundled data protocol

- the choice of active edges is arbitrary
- recovery transitions have no ‘meaning’ and therefore no well-defined timing
- if recovery transitions follow the same path as the active transitions the control cycle time will be long
- various ‘data hold time’ protocols may be employed
- easier to make fast latches and control circuits
2-phase versus 4-phase

4-phase control circuits are easier to build

2-phase data selector

4-phase data selector

The 4-phase also design scales much more easily to ‘wider’ selectors
Micropipelines

A FIFO - the canonical micropipeline:

The FIFO has two important performance parameters:

- Latency - how fast will new data pass through an empty FIFO?
- Throughput - what is the maximum sustainable data rate?

These are not as tightly related as they are in a synchronous pipeline.
Micropipelines

A FIFO with processing:

- Processing logic delays the forward data signals
- The “Request” signal should be delayed by the same time
  - Longer delays impact performance
  - Shorter delays reduce functionality!
Delay modelling

Modelling a propagation delay with another propagation delay is potentially hazardous

- Several techniques are used:
  - Registers have a 33rd bit which always makes a transition
  - ALU (adder) has delay routed along the carry chain according to the input data
  - Dynamic circuits (e.g. PLAs) already self-timed

- The delay model is local:
  - Same size transistors with same loads
  - Same manufacturing conditions
  - Same operating conditions (temperature, voltage)
Data dependent delay examples

- Adder delay means that:
  - *typical* additions can be fast without elaborate fast carry logic
  - slower cycles for rare slow cases
  - increment is typically very fast
- Cache cycles faster on sequential cache line accesses
  - CAM look-up averted
  - RAM not precharged between cycles
- Multiplication
  - multiplier uses carry-save adders, local iteration and early termination
  - cycles much faster than other pipeline stages
  - no delay or power consumption when not in use
AMULET1

AMULET1 was the first self-timed implementation of a commercial microprocessor architecture.

- It demonstrated that systems of this complexity were feasible
- It delivered about 67% of the performance of the comparable, synchronous part (ARM6) with a similar power consumption

But

- AMULET1 (1993) was first generation AMULET
- ARM6 (1992) was 4th generation ARM

Room for improvements in:

- Processor’s internal architecture
- Circuit implementation
AMULET1 Results

Results of tests on the prototype parts:

<table>
<thead>
<tr>
<th></th>
<th>AMULET1/CMP</th>
<th>AMULET1/GPS</th>
<th>ARM6</th>
</tr>
</thead>
<tbody>
<tr>
<td>Process</td>
<td>1µm ES2</td>
<td>0.7µm GPS</td>
<td>1µm</td>
</tr>
<tr>
<td>Area</td>
<td>5.5 x 4.1</td>
<td>3.9 x 2.9</td>
<td>4.1 x 2.7</td>
</tr>
<tr>
<td>Transistors</td>
<td>58,374</td>
<td>58,374</td>
<td>33,494</td>
</tr>
<tr>
<td>Performance</td>
<td>20.5Kdhry.</td>
<td>~40Kdhry.&lt;sup&gt;a&lt;/sup&gt;</td>
<td>31Kdhry.</td>
</tr>
<tr>
<td>Multiplier</td>
<td>5.3ns/bit</td>
<td>3ns/bit</td>
<td>25ns/bit</td>
</tr>
<tr>
<td>Conditions</td>
<td>5V, 20°C</td>
<td>5V, 20°C</td>
<td>5V, 20MHz</td>
</tr>
<tr>
<td>Power</td>
<td>125mW</td>
<td>N/A&lt;sup&gt;b&lt;/sup&gt;</td>
<td>148mW</td>
</tr>
<tr>
<td>MIPS/W</td>
<td>77</td>
<td>N/A</td>
<td>120</td>
</tr>
</tbody>
</table>

a. estimated maximum performance  
b. the GPS part does not have separate core supplies
AMULET2

AMULET2 is a the second generation AMULET microprocessor core.

Features:

- Data forwarding
  - load forwarding
  - last result reuse

- Branch target prediction

- “Sleep” mode
  - automatic power-down halt on “loop stop”

- Four phase micropipeline control circuits
  - AMULET1 2-phase control circuits complicated
    (slowed) latch implementation
AMULET2 Bus Structure

- Conventional bus layout
- Many buses deeply pipelined (not shown)
- Data flow down buses depends only on corresponding units
Example: Register Forwarding

Problem: inter-instruction dependencies cause pipeline stalls

Solution: register forwarding

Register forwarding is non-trivial when different pipeline stages are unsynchronised

A localised “last result” register provides a partial solution
Example: Branch Prediction

Problem: erroneous, speculative fetching wastes time & energy

Solution: branch prediction

Branch prediction must be performed with purely local knowledge

A branch target cache can overrule the PC incrementer for invariant branches
Address Interface Operation

- An instruction address is sent from the Memory Address Register (MAR)
- The next instruction address is normally ‘predicted’ by the Incrementer (INC)
- Occasionally an address is recognised as the address of a previously taken branch; the Branch Target Cache (BTC) then predicts a different, non-sequential address

If a non-predicted branch occurs it *interrupts* the loop (asynchronously) and substitutes a new PC value.

- Each sequence of prefetch addresses is “coloured”
- A new PC value changes the prefetch colour
- After a branch, subsequent (speculative) instructions in the same colour are discarded until the new colour arrives
- Prefetch depth is non-deterministic!
Multiplication

Synchronous system

- “Tasks” must occupy an integer number of clock cycles

Asynchronous system

- Periodicity of a given functional unit is arbitrary

Example:

AMULET2 performs multiplication using a free-running carry-save adder unit.

- Cycles in about 30% of the time for a typical instruction cycle
- Performs early termination

A synchronous unit would require either 3x the hardware (to fill the clock cycle) or 3x the clock frequency
AMULET2e

AMULET2e is an integrated processor based around the AMULET2 macrocell.

It comprises:

- AMULET2: a second generation asynchronous ARM
- 4Kbytes of self-timed cache (or memory mapped SRAM)
- A flexibly configurable bus interface
  - supports bus widths of 8/16/32 bits
  - wide range of access times
- The memory interface is simple and designed to interface directly to standard memory and peripheral chips
  - a system can be constructed with few extra components

The processor, cache, and memory controller are all 100% clock-free.
AMULET2e architecture

address decode

area enables

AMULET2 core

control registers

tag CAM

data RAM

line fill

funnel and memory control

delay

Data

Address

chip selects

DRAM control

data in

data out

pipeline latches

area enables

address
Cache

The 4Kbytes of on-board memory are new to AMULET2e

- Comprises 4 independent 1Kbyte cache blocks
  - Each 1Kbyte block is a fully associative, micropipelined CAM-RAM structure

- Cache cycle times vary according to the address pattern
  - CAM look-up is bypassed for sequential addresses in the same CAM line
  - RAM is not precharged between sequential cycles

- The cache line refill process is asynchronous and independent from ‘normal’ cache accesses
  - Forwarding and ‘hit under miss’ are automatic
Example cycle timing

Shows:
- address output (upper 6 traces)
- data input (lower 3 traces)

Note:
- Cycle time varies
- Prefetch depth varies
- Stall on locked register (R12) then prefetch buffer full
AMULET2e layout

4 Kbyte Cache

AMULET2 core

ADEC

Funnel

Regs
Results

- AMULET2 should be a fully functional ARM microprocessor
  - Simulation of layout runs standard ARM code

- 454,000 transistors
  - 93,000 transistors in microprocessor core

- Fabricated on VLSI Technology “cmn5” process
  - 0.5 μm, 3 layer metal process

- Core size 5mm x 5mm; die size 6.5mm x 6.5mm
- Average cycle time around 24ns (TimeMill simulation)
  - ‘Typical’ silicon
  - Room temperature
  - (Cycle time without cache CAM look-up ~ 20ns)
- Faster than AMULET1
  - roughly 30% faster cycling (allowing for process shrink)
  - load and result forwarding reduces stalls
  - branch prediction removes redundant cycles
- Not as fast as ARM8 or StrongARM
- Observations of traces show some unpredicted behaviour
  - there is significant scope for performance improvement
## AMULET2e Results

<table>
<thead>
<tr>
<th></th>
<th>ARM710</th>
<th>AMULET2e</th>
<th>ARM810</th>
</tr>
</thead>
<tbody>
<tr>
<td>Process</td>
<td>0.6μm</td>
<td>0.5μm</td>
<td>0.5μm</td>
</tr>
<tr>
<td>Area (mm²)</td>
<td>34</td>
<td>42a</td>
<td>40</td>
</tr>
<tr>
<td>Transistors</td>
<td>570 295</td>
<td>454 000</td>
<td>832 776</td>
</tr>
<tr>
<td>Cache size</td>
<td>8Kbytes</td>
<td>4Kbytes</td>
<td>8Kbytes</td>
</tr>
<tr>
<td>Performanceb</td>
<td>23 MIPS</td>
<td>38 MIPSc</td>
<td>80 MIPS</td>
</tr>
<tr>
<td>Multiplier</td>
<td>20ns/bit</td>
<td>1.7ns/bit</td>
<td>1.7ns/bit</td>
</tr>
<tr>
<td>Conditions</td>
<td>3.3V, 25MHz</td>
<td>3.3V, 20°C</td>
<td>3.3V, 75MHz</td>
</tr>
<tr>
<td>Power</td>
<td>150mW</td>
<td>as yet unknown</td>
<td>500mW</td>
</tr>
<tr>
<td>MIPS/W</td>
<td>192</td>
<td>as yet unknown</td>
<td>160</td>
</tr>
</tbody>
</table>

- a. Includes pad ring
- b. Dhrystone MIPs
- c. Simulated performance – ‘typical’ silicon

- AMULET2e submitted for fabrication 11/7/96
Preliminary Result Analysis

- 3.2 x faster than AMULET1
- Performance is roughly half that contemporary ARM (810)
  - Slight loss of ground :-(
- Performance not yet competitive with synchronous chips
- Power consumption measurements await silicon

Excuses

- Moving target
- Only ~12 man years effort
- Many complex pipeline interactions
  - Interactions between stages probably not fully exploited
  - Better tools for load balancing needed
- Cache speed is the major limitation
Critical paths

In an asynchronous system it is difficult to define a single critical path.

The worst areas are:

- Cache
  - especially CAM look up and subsequent decision tree
  - this was our first attempt at cache design

- Address generator
  - too many prefetch addresses buffered
  - (surprisingly) the PC increment loop is a slow cycle

- Data processing typically 20%-25% faster than instructions supplied
Some unsolved problems

- A design may be correct yet non-deterministic
  - design verification is a problem
  - ‘balancing’ performance is difficult

- Production test
  - scan paths through latches without a common clock?
  - non-determinism

- Marketing
  - how to sell computers which run at subtly different (and variable) speeds
Conclusions

- Many high-performance techniques can be incorporated in an asynchronous design
  - AMULET2e is roughly 30% faster cycling than AMULET1 (allowing for process shrink)
  - Architectural features also increase performance
- Performance not yet competitive with synchronous chips
- Power consumption measurements await silicon
- Many complex pipeline interactions
  - Interactions between stages probably not fully exploited
  - Better tools for load balancing needed
- Implementation-dependent instruction sets are a bad idea
  - many problems stem from features derived from original synchronous implementation
References

A number of publications on AMULET1 and some on AMULET2 are available electronically:

- http://www.cs.man.ac.uk/amulet/
- http://maveric0.uwaterloo.ca/amulet/

For example:

*The AMULET2e Cache System* J.D. Garside, S. Temple, R. Mehra

*Dynamic Logic in Four-Phase Micropipelines* S.B. Furber and J. Liu

A wide range of information on other asynchronous logic research is also available on this WWW site.
Acknowledgements

- the CEC (ESPRIT project 6909, OMI/DE-ARM)
- Advanced RISC Machines Limited
- VLSI Technology Limited
- Compass Design Automation Limited
- EPIC Design Technology, Inc.
- The AMULET development team