Applying ELAN Strategies in Simulating Processors over Simple Architectures

Mauricio Ayala-Rincón

Departamento de Matemática, Universidade de Brasília
Brasília D.F., Brasil

Rinaldi Maya Neto, Ricardo P. Jacobi

Departamento de Ciência da Computação, Universidade de Brasília
Brasília D.F., Brasil

Carlos H. Llanos

IESB, Brasília D.F., Brasil

Reiner W. Hartenstein

Fachbereich Informatik, Universität Kaiserslautern
Kaiserslautern, Germany

Abstract
The simulation of processors over simple architectures is important for enabling test and verification prior to the expensive implementation involved in the development of new hardware technologies. Arvind’s group has illustrated how to describe processors by term rewriting systems and has introduced a technique for proving the correctness of specifications for elaborated processors with respect to basic ones. They propose that the described processors should be simulated over standard hardware description languages such as Verilog, after translating these rewrite descriptions adequately, and not directly over the rewriting specifications. In this work we show how rewriting-logic may be applied for purely rewriting based specification as well as simulation of processors. Furthermore, we show how rewriting based simulation may be used for evaluating the performance of important hardware aspects of processors. Rewriting-logic environments such as ELAN, the one we use here, are sufficiently versatile to allow for adequate specifications and simulations which through easy modifications of the strategies enable a dynamic verification of aspects intrinsically related to hardware properties such as the size and control of reorder buffers and the method of predictions used by speculative processors.

©2002 Published by Elsevier Science B. V.
Keywords: Rewriting strategies and logic, hardware design and verification.

1 Introduction

In recent years some work on applying rewriting techniques to the design of hardware has been developed. In particular, Arvind’s group at MIT has treated the implementation of processors over simple architectures [12,13,1], rewrite based description and synthesis of simple logical digital circuits [8] and description of cache protocols over memory systems [14,16]. Their work has made evident the great capacity and possibilities of rewriting as an effective framework for dealing with simulation and estimation of hardware before expensive physical implementations are done. In this work we discuss the advantages and difficulties of a real rewrite based simulation of descriptions of processors over simple architectures. For this purpose we use the well-known rewriting-logic environment ELAN [5,4].

Our work, as that of Arvind group’s in [12,13,1], is focused on the implementation of processors over the AX RISC architecture. Rules for the processors are specified in the ELAN system and different architectural components such as memory, registers, etc. are discriminated in a natural way, taking advantage of this typed language. Proving the soundness of the processors is thus reduced to proving that they simulate and are simulated by a basic processor. In our ELAN approach the separation between logic and rewriting allows us to define rules for the instructions of the processors and to specify strategies describing architectural characteristics as the size of reorder buffers - ROBs. Unlike the approach of Arvind’s group, in our implementations we can simulate the execution of assembly description programs over our rewriting-specified processors; for instance, generation of the Fibonacci sequence, quick-sort, computation of the Knuth-Morris-Pratt jump function, etc., while dynamically changing strategies for estimating the most adequate form of implementing these architectural components. This is all done without translating the rewriting specifications into hardware description languages as suggested by the approach of Arvind’s group. After a simulation is performed, these estimations are given by an analysis of the ELAN statistics for the number of times each rewriting rule (i.e. processor instruction) is applied. Other important architectural aspects such as predictions in processors with speculative execution are implemented in their own rewriting rules. Rewrite based simulation of programs in assembly description does not correspond exactly to the execution of these programs over real architectures, since many additional

---

1 Work supported by the Brazilian/German cooperation of the CAPES/DFG foundations.
2 Partially supported by the FEMAT Brazilian foundation. ayala@mat.unb.br
3 {rinaldi, rjacobi}@cic.unb.br
4 llanos@unb.br
5 hartenst@rhrk.uni-kl.de
steps are executed in a rewriting based system; in particular, many unsuccessful attempts to apply rewriting rules slow down the simulation. However, ELAN statistics allow for a concrete estimation of how these processors should work in real hardware implementations.

Additionally, we point out some interesting problems inherent in the way rewriting rules (and strategies) are applied in true purely rewrite based systems. In particular, that rules are not naturally applied in a non-deterministic manner; they are selected as the first applicable rule found in the order the rules are defined and applied in the first positions (left-most, inner-most or similarly) that they match over the target term. In our implementation these problems arise when important architectural aspects as out-of-order execution of instruction templates over ROBs are to be simulated. Although our implementations are deterministic we comment on how one can overcome these problems in a non-purely rewriting system like ELAN, where some non-deterministic strategies are available.

2 Architecture and processors descriptions

We assume familiarity with the basic concepts of computer architecture and rewriting theory as presented respectively in [7] and [3]. Additionally, we suppose the reader familiar with rewriting-logic environments like ELAN. We briefly describe the AX RISC architecture and the specifications in ELAN of a basic processor over this architecture and a more elaborated one that allows for speculative execution over a reorder buffer.

2.1 The AX RISC architecture

$AX$ is a set of RISC instructions where all memory access is done by load and store instructions and the arithmetic operations are done over the registers at the register file ($rf$). A sequence of instructions that describes a program is placed at the instruction memory ($im$). The instructions are executed in-order and after each instruction execution the contents of the program counter ($pc$) is incremented by one except for branch instructions ($Jz$).

The set of different instructions of $AX$, $INST$, is described as:

$$INST \equiv r := Loadc(v) \quad Jz(r_1, r_2) \quad r := Loadp \quad r := Op(r_1, r_2) \quad r := Load(r_1) \quad Store(r_1, r_2)$$

The load-constant instruction, $r := Loadc(v)$, puts the constant $v$ into the register $r$. The load-program-counter instruction, $r := Loadp$, puts the content of the program counter into the register $r$. The arithmetic-operation instruction, $r := Op(r_1, r_2)$, performs the (abstract) arithmetic operation specified by $Op$ on the operands specified by the registers $r_1$ and $r_2$ and puts the result into the register $r$. The branch instruction $Jz(r_1, r_2)$, sets the program counter to the target instruction address specified by the register $r_2$ when the
contents of the register \( r_1 \) is zero and increments the program counter by one otherwise. The load instruction, \( r := \text{Load}(r_1) \), loads the memory cell specified by the register \( r_1 \) into the register \( r \). The store instruction, \( \text{Store}(r_1, r_2) \), stores the contents of the register \( r_2 \) into the memory cell specified by the register \( r_1 \).

2.2 Basic processor

The operational semantics of the \( \mathcal{AX} \) RISC instruction set is defined by a single-cycle, non-pipelined, in-order execution processor that we will call the basic processor. See figure 1 for a description in register transfer level.

The description of the system \( \text{Sys} \) is given by its memory \( m \) and processor \( \text{Proc}: \text{Sys}(m, \text{Proc}) \), the latter consists of the instruction address \( \text{ia} \) of the \( pc \), the register file \( \text{rf} \) and the program \( \text{prog}: \text{Proc}(\text{ia}, \text{rf}, \text{prog}) \).

The rewriting rules implementing the \( \mathcal{AX} \) instructions in ELAN are given in the Table 1. This follows straightforwardly from the operational semantics given earlier of these instructions. We explain the most complex of these rules: the branch instruction \( Jz \). All other rules are similarly explained.

Whenever the current instruction of the program \( \text{prog} \) at the position (of the instruction memory \( \text{im} \)) given by the instruction address \( \text{ia} \) is a branch instruction of the form \( Jz(r_1, r_2) \), the program counter should be changed either by the contents of the register \( r_2 \) or by \( \text{ia}+1 \). The former, in the case that the contents of the register \( r_1 \) equals zero (checked by \( \text{valueofReg}(r_1, \text{rf}) == 0 \)); the last, otherwise (checked by \( \text{valueofReg}(r_1, \text{rf}) != 0 \)). The auxiliary premise \( \text{isinstJz}(\text{selectinst}(\text{prog}, \text{ia})) \) checks whether the current instruction is a branch. The role of the "\( \text{where } _ :=() _ \)" commands is to set auxiliary variables.

2.3 Implementation of a processor with speculative execution over a ROB

As in [12,13,1] more sophisticated processors may be described by rewriting rules and then proved correct by showing that they are simulated by the basic processor and simulate the basic processor. Here we describe the implementation of a processor that does speculative execution over a ROB. See figure 2. A ROB holds instructions that have been decoded but have not completed their execution. Conceptually, the ROB divides the processor into two asyn-
Rewriting rules for the basic processor

Table 1

Fig 2. Description of the speculative processor

chronous parts. The first one fetches the instruction and after decoding and renaming registers, dumps it into the next available slot in the ROB. The ROB slot index serves the purpose of the renaming tag, and the instruction templates in the ROB (ITB) always contain tags or values instead of register names. An instruction template in the ROB can be solved (“executed”) if all its operands are available. The second part takes any enabled instruction
out of the ROB and dispatches it to an appropriate functional unit, including
the memory system (then “execution” is completed). This mechanism is very
similar to the execution mechanism in data flow architectures. Such an ar-
nitecture may execute instructions out-of-order, especially if functional units
have different latencies or there are data dependencies between instructions.
Additionally, speculative execution of instruction is allowed. The specula-
tive mechanisms predicts the address of the next instruction to be issued based on
the past behavior of the programs. The address of the speculative instruction
is determined by consulting a table known as the branch target buffer - BTB,
which can be indexed by the current content of the program counter. If the
prediction turns out to be wrong, the speculative instruction and all the in-
structions issued thereafter are abandoned and their effect on the processor
state nullified. The BTB is updated according to some prediction scheme after
each execution branch resolutions.

As for the basic processor the system $\text{Sys}$ is described by its memory $m$
and processor $\text{Proc}$: $\text{Sys}(m, \text{Proc})$. But in contrast, the processor consists of
the $\text{ia}$ of the program counter, the register file $\text{rf}$, the program $\text{prog}$ as well
as of the ITB ($\text{itb}$) and the BTB ($\text{ttb}$): $\text{Proc}((\text{ia}, \text{rf}, \text{itb}, \text{ttb}, \text{prog}))$.

In the sequel we present the ELAN implementation of the rules of the
speculative processor and after that we explain the operational semantics of
some of these rules. The rules are divided into four classes: arithmetic and
value propagation rules; instruction issue rules; branch completion rules and
memory access rules. These sets of rewriting rules are presented in the Tables
2, 3, 4 and 5, respectively. Instead of the symbol “:=”, that is reserved in
ELAN, setting in the issued rules at the ITB is denoted by “|−”.

Arithmetic operation and value propagation rules (Table 2) deal with the
computation of arithmetic operations ($\text{PsOp}$), the propagation of its results
through the ITB ($\text{PsValueForward}$) and the exclusion of the instruction template
from the ITB when the result had already been solved and committed to the

Table 2
Arithmetic Operation and Value Propagation Rules

\[
\text{Table 2}
\]

\begin{center}
\begin{tabular}{|c|c|}
\hline
$\text{PsOp}$ & \\
\hline
$\text{Sys}(m, \text{Proc}(\text{ia}, \text{rf}, \text{ITB}(\text{ia}, k, t(k)|−\text{Op}(v, v1, wf, sf).\text{itb}, \text{ttb}, \text{prog})) \Rightarrow$

$\text{Sys}(m, \text{Proc}(\text{ia}, \text{rf}, \text{ITB}(\text{ia}, k, t(k)|−\text{execOpval}(v, v1, wf, sf).\text{itb}, \text{ttb}, \text{prog})) \text{ end}$

$\text{PsValueForward}$ & \\
$\text{Sys}(m, \text{Proc}(\text{ia}, \text{rf}, \text{ITB}(\text{ia}, k, t(k)|−v, wf, sf).\text{itb}, \text{ttb}, \text{prog})) \Rightarrow$

$\text{Sys}(m, \text{Proc}(\text{ia}, \text{rf}, \text{ITB}(\text{ia}, k, t(k)|−v, wf, sf).\text{ValueForward}(t(k), v, \text{itb}, \text{ttb}, \text{prog}) \text{ if TagExists}(t(k), \text{itb}) \text{ end}$

$\text{PsValueCommit}$ & \\
$\text{Sys}(m, \text{Proc}(\text{ia}, \text{rf}, \text{ITB}(\text{ia}, k, t(k)|−v, \text{Wreg}(r), sf).\text{itb}, \text{ttb}, \text{prog})) \Rightarrow$

$\text{Sys}(m, \text{Proc}(\text{ia}, \text{rf}, \text{itb}, \text{ttb}, \text{prog}) \text{ if not TagExists}(t(k), \text{itb}) \text{ end}$

\hline
\end{tabular}
\end{center}
register file (this means the renaming tag it addresses does not occur in the buffer anymore, what is decided by $TagExists(t(k),itbs)$). A value is only committed to the register file when the instruction referencing it is on the head of the ITB, this approach is conservative since it avoids the need to reconstruct the state of the register file in the event of wrong speculations.

Issue rules (Table 3) are those used for the issuing of the instructions which generate templates stored in the ITB. Branch completion rules (Table 4) are those which deal with the resolution of speculations. When a branch instruction is issued the processor has to know which will be the next instruction to be fetched. The next predicted instruction is indicated by the BTB that is an indexed table. Suppose that the program instruction at position $ia$ is being issued, the next value of the program counter, called $pia$, is looked up in the BTB using as index the current program counter ($pia :=() getb$($ia,pia$)) and then the execution resumes at the $pia$ value. When the ITB element containing the branch instruction reaches the head of the ITB it is the time to check if the speculation was done correctly or if the processor needs to fix the mistake and restart the execution at the correct program counter value. In the last case, the remaining instructions already in the ITB should be ignored. The rules in the Table 4 deal with this issue. Exemplifying, suppose that the head of the ITB is of the form $ITB(ia,k,Jx(v,nia),wf,Spec(pia))$, the branch completion rule has to check whether the value $v$ is zero or not and then, respectively, check whether either the speculated address $pia$ coincides with $nia$ or with $ia+1$. In this event the prediction has been proved correct and the execution resumes. Otherwise the program counter must be set, respectively, either to the value of $ia+1$ or to the correct value of the branch represented by $nia$, depending on whether the wrong speculation was a no jump or a jump, and the ITB must be completely emptied because the remaining instructions should not be executed. These rules also control the updating of the BTB for dynamic speculation (through the rules which define $changeb$).

The memory access rules (Table 5) $P$s$Load$ and $P$s$Store$ deal with the ROB and the data memory communication. These rules are applied after the processor has resolved all values of the tags of the instruction templates stored in the ITB by the $P$s$Load$Issue and $P$s$Store$Issue issue rules, respectively.

2.4 Proving correctness of processors

One useful feature of this rewrite based specification of processors is the possibility of proving the correctness of the implementation of some instruction set describing a processor. This is done by showing that one implementation simulates another in regard of some observation function $[12,13,1]$. The main idea is to design a function that can extract all the programmer visible states; i.e., the program counter, the register file and the memory from the system. The proof of the correctness of our specification of the speculative processor, here given for completeness of the presentation, follows the lines in $[12]$. 

7
|PsLoadcIssue| $\text{Sys}(m, \text{Proc}(ia, rf, itbs, btb, prog)) \Rightarrow \text{Sys}(m, \text{Proc}(ia+1, rf, \text{insEndITBs}(\text{ITB}(ia, k, t(k)\rightleftharpoons v, \text{Wreg}(r), \text{NoSpec}), itbs), btb, prog))$

\hspace{1em}where \ instTa := () selectinst(prog, ia) if isinstLoadc(instTa)

\hspace{1em}where \ r := () nameofLoadc(instTa)

\hspace{1em}where \ v := () valueofLoadc(instTa)

\hspace{1em}where \ k := () lengthof(itbs)+1 end

|PsLoadpcIssue| $\text{Sys}(m, \text{Proc}(ia, rf, itbs, btb, prog)) \Rightarrow \text{Sys}(m, \text{Proc}(ia+1, rf, \text{insEndITBs}(\text{ITB}(ia, k, t(k)\rightleftharpoons \text{ia}, \text{Wreg}(r), \text{NoSpec}), itbs), btb, prog))$

\hspace{1em}where \ instTa := () selectinst(prog, ia) if isinstLoadpc(instTa)

\hspace{1em}where \ r := () nameofLoadpc(instTa)

\hspace{1em}where \ k := () lengthof(itbs)+1 end

|PsOpIssue| $\text{Sys}(m, \text{Proc}(ia, rf, itbs, btb, prog)) \Rightarrow \text{Sys}(m, \text{Proc}(ia+1, rf, \text{insEndITBs}(\text{ITB}(ia, k, t(k)\rightleftharpoons \text{Op}(k1, k2), \text{Wreg}(r), \text{NoSpec}), itbs), btb, prog))$

\hspace{1em}where \ instTa := () selectinst(prog, ia) if isinstOp(instTa)

\hspace{1em}where \ r1 := () reg1ofOp(instTa) where \ r2 := () reg2ofOp(instTa)

\hspace{1em}where \ k := () lengthof(itbs)+1

\hspace{1em}where \ k1 := () searchforLastTag(r1, rf, itbs)

\hspace{1em}where \ k2 := () searchforLastTag(r2, rf, itbs) end

|PsJzIssue| $\text{Sys}(m, \text{Proc}(ia, rf, itbs, btb, prog)) \Rightarrow \text{Sys}(m, \text{Proc}(ia, rf, \text{insEndITBs}(\text{ITB}(ia, k, Jz(k0, k1), \text{NoWreg}, \text{Spec}(pia)), itbs), btb, prog))$

\hspace{1em}where \ instTa := () selectinst(prog, ia) if isinstJz(instTa)

\hspace{1em}where \ r := () reg1ofJz(instTa)

\hspace{1em}where \ k := () lengthof(itbs)+1

\hspace{1em}where \ k0 := () searchforLastTag(r1, rf, itbs)

\hspace{1em}where \ pia := () getttb(ia, btb) end

|PsLoadIssue| $\text{Sys}(m, \text{Proc}(ia, rf, itbs, btb, prog)) \Rightarrow \text{Sys}(m, \text{Proc}(ia+1, rf, \text{insEndITBs}(\text{ITB}(ia, k, t(k)\rightleftharpoons \text{Load}(k1), \text{Wreg}(r), \text{NoSpec}), itbs), btb, prog))$

\hspace{1em}where \ instTa := () selectinst(prog, ia) if isinstLoad(instTa)

\hspace{1em}where \ r := () nameofLoad(instTa)

\hspace{1em}where \ r0 := () reg1ofLoad(instTa)

\hspace{1em}where \ k := () lengthof(itbs)+1

\hspace{1em}where \ k1 := () searchforLastTag(r0, rf, itbs) end

|PsStoreIssue| $\text{Sys}(m, \text{Proc}(ia, rf, itbs, btb, prog)) \Rightarrow \text{Sys}(m, \text{Proc}(ia+1, rf, \text{insEndITBs}(\text{ITB}(ia, k, \text{Store(k0, k1), NoWreg, NoSpec}), itbs), btb, prog))$

\hspace{1em}where \ instTa := () selectinst(prog, ia) if isinstStore(instTa)

\hspace{1em}where \ r0 := () nameofStoreR1(instTa)

\hspace{1em}where \ r1 := () nameofStoreR2(instTa)

\hspace{1em}where \ k := () lengthof(itbs)+1

\hspace{1em}where \ k0 := () searchforLastTag(r0, rf, itbs)

\hspace{1em}where \ k1 := () searchforLastTag(r1, rf, itbs) end

\textbf{Table 3}

\textbf{Instruction Issue Rules}

8
\begin{table}[h]
\centering
\begin{tabular}{|l|}
\hline
\textbf{PsJumpCorrectSpec} \hspace{1cm} \textbf{PsJumpWrongSpec} \\
\begin{align*}
\text{Sys}(m, \text{Proc}(ia, rf, \text{ITB}(ia1, k, \text{Jz}(0, nia), \text{wf}, \text{Spec}(pia)), \text{itbs}, \text{btb}, \text{prog})) & \Rightarrow \\
\text{Sys}(m, \text{Proc}(ia, rf, \text{itbs}, \text{btb}, \text{prog})) & \text{ if pia} = \text{nia} \text{ end}
\end{align*} \\
\begin{align*}
\text{Sys}(m, \text{Proc}(ia, rf, \text{ITB}(ia1, k, \text{Jz}(0, nia), \text{wf}, \text{Spec}(pia)), \text{itbs}, \text{btb}, \text{prog})) & \Rightarrow \\
\text{Sys}(m, \text{Proc}(nia, rf, \text{nilitb}, \text{btb1}, \text{prog})) & \\
\text{ if pia} \neq \text{nia} \text{ where btb1} := () \text{ changebtb}(ia1, \text{nia}, \text{btb}) \text{ end}
\end{align*} \\
\text{Sys}(m, \text{Proc}(ia, rf, \text{ITB}(ia1, k, \text{Jz}(v, nia), \text{wf}, \text{Spec}(pia)), \text{itbs}, \text{btb}, \text{prog})) & \Rightarrow \\
\text{Sys}(m, \text{Proc}(ia, rf, \text{itbs}, \text{btb}, \text{prog})) & \text{ if } v \neq 0 \text{ and pia} = ia1+1 \text{ end}
\end{tabular}
\\hline
\end{table}

\begin{table}[h]
\centering
\begin{tabular}{|l|}
\hline
\textbf{PsNoJumpCorrectSpec} \hspace{1cm} \textbf{PsNoJumpWrongSpec} \\
\begin{align*}
\text{Sys}(m, \text{Proc}(ia, rf, \text{ITB}(ia1, k, \text{Jz}(v, nia), \text{wf}, \text{Spec}(pia)), \text{itbs}, \text{btb}, \text{prog})) & \Rightarrow \\
\text{Sys}(m, \text{Proc}(ia1+1, rf, \text{nilitb}, \text{btb1}, \text{prog})) & \\
\text{ if } v \neq 0 \text{ and pia} \neq ia1+1 & \\
\text{ where btb1} := () \text{ changebtb}(ia1, ia1+1, \text{btb}) \text{ end}
\end{align*}
\\hline
\end{table}

It is easy to show that the speculative processor simulates the basic one. One basic processor term can be “upgraded” to one of the speculative processor simply by adding an empty ITB and an arbitrary BTB to the processor.

\textbf{Definition 1 (ITBL)} \textit{The Instruction Template Buffer Lift of a basic processor term is defined by}

\[ \text{ITBL}(\text{Sys}(m, \text{Proc}(ia, rf, \text{prog}))) \equiv \text{Sys}(m, \text{Proc}(ia, rf, \text{nilitb}, \text{btb}, \text{prog})) \]

where \text{btb} is an arbitrary \text{BTB} and \text{nilitb} and empty ITB.

\textbf{Theorem 1} \textit{Let s and t be system terms of the basic processor. If } s \rightarrow^* t \text{ in the basic processor, then } \text{ITBL}(s) \rightarrow^* \text{ITBL}(t) \text{ in the speculative processor.}

\textbf{Proof.} Sequences of rules of the speculative processor can simulate each basic processor rule. For example, the Op rule in the basic processor can be simulated by consecutively applying the PSoPIssue, PsOp and PsValueCommit rules in the speculative processor; the Load rule in the basic processor can be simulated by consecutively applying the PsLoadIssue, PsValueForward, PsLoad and PsValueCommit rules in the speculative processor; etc. \hfill \Box

Now we need to define a projection function from the speculative processor to the basic processor. This is not simple because of the partially executed
instructions. The approach in [12] is based on flushing instructions in the ITB. The key observation is that during some time of execution over an speculative processor, if no instruction is issued then the ITB will soon become empty. Only instruction issue rules can further expand the ITB. So, we can define another rewriting system which uses the same grammar as the speculative processor and include all its rules except the instruction issue ones.

**Definition 2** The rewriting system $R_{ITBF}$ over terms of the speculative processor is given by the set of rewriting rules \{PsOp, PsValueForward, PsValueCommit, PsJumpCorrectSpec, PsJumpWrongSpec, PsNoJumpCorrectSpec, PsNoJumpWrongSpec, PsLoad, PsStore\}.

One can prove that the rewriting system $R_{ITBF}$ is strongly terminating and confluent and that its normal forms have always empty ITBs.

**Definition 3 (ITBF)** Let $\text{Sys}(m, \text{Proc}(ia, rf, nilitb, btb, prog))$ be the $R_{ITBF}$ normal form of a given term of the speculative processor $s$. The instruction template buffer flush of $s$, denoted by $ITBF(s)$, is the result of deleting from this $R_{ITBF}$ normal form its empty ITB and its $BTB$: $\text{Sys}(m, \text{Proc}(ia, rf, prog))$.

**Theorem 2** Let $s$ and $t$ be system terms of the speculative processor. If $s \rightarrow^* t$, then $ITBF(s) \rightarrow^* ITBF(t)$ in the basic processor.

<table>
<thead>
<tr>
<th>Speculative processor issue rules</th>
<th>basic processor rules</th>
</tr>
</thead>
<tbody>
<tr>
<td>PsLoadcIssue</td>
<td>Loadc</td>
</tr>
<tr>
<td>PsLoadpcIssue</td>
<td>Loadpc</td>
</tr>
<tr>
<td>PsOpIssue</td>
<td>Op</td>
</tr>
<tr>
<td>PsJzIssue</td>
<td>Jz</td>
</tr>
<tr>
<td>PsLoadIssue</td>
<td>Load</td>
</tr>
<tr>
<td>PsStoreIssue</td>
<td>Store</td>
</tr>
</tbody>
</table>

Table 6
Correspondence between speculative issue rules and basic processor rules

**Proof.** The proof is by induction on the number of rewrite steps $n$ on the derivation $s \rightarrow^n t$. For $n = 0$ this is obvious. For the inductive step, assume $s \rightarrow i$ by applying the rule $\alpha$. If $\alpha \in R_{ITBF}$, then $ITBF(s)$ and $ITBF(t)$ coincide. If $\alpha \notin R_{ITBF}$, that is $\alpha$ is an instruction issue rule, then we will prove that either $ITBF(s)$ and $ITBF(t)$ coincide or $ITBF(s)$ can be rewritten into $ITBF(t)$ by applying an appropriate basic processor rule.

Suppose $s \rightarrow s_1$ by applying a rule $\beta \in R_{ITBF}$. We have two cases:

Case 1. $\beta$ is a mis-prediction-recover rule: PsJumpWrongSpec or PsNoJumpWrongSpec. By applying $\beta$ to $t$ we have $t \rightarrow s_1$, since the instruction issuing will be canceled by the mis-prediction-recover rule.

Case 2. $\beta$ is not a mis-prediction-recover rule. In this case we can notice that $\alpha$ can also be applied to $s_1$. Suppose that $s_1 \rightarrow t_1$ by applying $\alpha$. If $\alpha$ is PsValueCommit and the register to which the value is committed is
referenced as an operand register in the instruction issued by $\alpha$, then $t \rightarrow^* t_1$
by first applying once or twice $\text{PsValueForward}$, and then applying the rule $\beta$.
Otherwise $t \rightarrow t_1$ by applying $\beta$.

Let $s_n$ be the $R_{ITBF}$ normal form of $s$. We have two cases to consider:
Case 1. If this normalization from $s$ into $s_n$ invokes one mis-prediction-recover rule, then there are terms $s_i, t_i$ and $s_{i+1}$ such that $s_i \rightarrow t_i$ by applying $\alpha$, $s_i \rightarrow s_{i+1}$ and $t_i \rightarrow s_{i+1}$ by applying the mis-prediction-recover rule. This implies that $s$ and $t$ have identical $R_{ITBF}$ normal forms.
Case 2. In the other case, by induction, we have that $\alpha$ can be applied to $s_n$ to yield $t_n$ such that $t \rightarrow^* t_n$ by applying just $R_{ITBF}$ rules.

Let $t_{n+1}$ be the $R_{ITBF}$ normal form of $t_n$. Since $s_n$ and $t_{n+1}$ both have an empty ITB, we can easily show that $ITBF(s_n) \rightarrow ITBF(t_{n+1})$ by applying the corresponding basic processor rule according to the Table 6.

\[ \square \]

3 Benefits of the separation between logic and rewriting in simulating processors

The natural separation in ELAN between rewriting and logic enables the controlled application of rules (i.e., processor instructions) and the adequate simulation of many interesting elements of hardware. For instance, the size of ROBs is one of the basic hardware ingredients of the speculative processor that is controlled by ELAN strategies. In fact, ROBs are controlled by specifying strategies which restrict the number of applications of issue rules. Suppose you want to simulate a ROB of size $n$, that should completely be filled and emptied alternately. Then the following simple ELAN strategy is used:

\[
\text{repeat} * \left\{ \begin{array}{l}
\text{first one(issue rules);}
\text{first one(issue rules$\cup$ id);} \\
\vdots
\text{first one(issue rules$\cup$ id);} \\
\text{normalise(first one(non_issue rules))}
\end{array} \right\} \text{n-1}
\]

Other strategies for handling the ROBs can similarly be specified. For example, for maintaining a ROB of size $n$ filled during the whole execution, one can start as before, but in the subsequent normalization with all non issue rules ($R_{ITBF}$ normalization) these rules should be treated individually. This treatment depends on whether the given non issue rule maintains or decreases the number of instruction templates in the ROB. For instance, since after a wrong branch speculation (rule $\text{PsJumpWrongSpec}$) the ROB is emptied the strategy should immediately fill the ROB by applying $n$ issue rules. Below we sketch this strategy showing the case of the rule $\text{PsJumpWrongSpec}$. 

11
In contrast to the control of ROBs other interesting aspects of processors as the method of branching prediction are directly controlled by the rewriting rules. The advantages of having ROBs is that instruction templates may be charged and these templates partially executed by the pipeline control. When at a point of the computation, determined by the program counter ia, a branch instruction template Jz (r1, r2) is charged into the ROB, it is undecided which is the following instruction template to be charged into the ROB, since at this point of the computation the values of the tags associated with the registers r1 and r2 are not necessarily resolved. Thus in speculative processors one has to decide which instruction template is the next to be charged according to the contents for the ia in the BTB. Well-known dynamic branch prediction schemes are specified by simple rewriting rules. We mention here the 1-bit and 2-bit dynamic prediction methods [15]. Initially, any prediction is given in the BTB. For instance, one can give pairs (1, 2), ..., (j, j + 1), ..., (n, n + 1) meaning that after execution of the jth instruction the prediction is to jump to the next instruction (j + 1th) of the program. These predictions (i.e., pairs) are only necessary for the addresses of branch instructions in the program. Subsequently, the predictions are modified according to the execution history.

In 1-bit dynamic prediction, the prediction for the nth instruction is actualized according to the next instruction to be executed. Once a prediction fails the corresponding value in the BTB is changed to the correct address of the instruction to be executed.

In 2-bit dynamic prediction, there are four different states of the prediction: strongly taken, weakly taken, weakly not taken, strongly not taken. If the state is either strongly (not) taken or weakly (not) taken and the prediction is correct: “jump” (“next instruction”), then the state is changed to strongly (not) taken. If the state is strongly (not) taken and the prediction fails: “next instruction” (“jump”), then the state is changed to weakly (not) taken. If the state is weakly (not) taken and the prediction fails: “next instruction” (“jump”), then the BTB is modified according to the correct address given by the contents of the second register of the branch instruction and the state is changed to weakly not taken (weakly taken).
Both prediction strategies are specified and simulated by purely rewriting. This is implemented by simple boolean conditions over the branch completion rules: comparisons between \texttt{pia} (predicted instruction address), \texttt{nia} (next correct instruction address) and \texttt{ia+1} (next instruction address in the program) for the four branch completion rules in the Table 4. Once a prediction fails, the BTB is modified by the function \texttt{changebtb}, that is specified by purely rewriting and adapted for the two prediction methods.

Furthermore, the performance of different ways to implement proposed processors can be determined by analyzing the ELAN statistics. For instance, one can estimate whether 1-bit performs better than 2-bit prediction for the execution of an assembly description of quick-sort over the speculative processor implemented with the strategy of alternatively filling and emptying the ROB. The total number of wrong and correct predictions (i.e., number of applications of branch completion rules in the Table 4) for ordered (the worst-case of quick-sort) and (the average for) random lists are given in the Table 7.

The observation of the differences between the number of wrong predictions for both methods gives an important insight about the advantages of 2-bit over 1-bit prediction, since in the worst-case a wrong prediction flushes the ROB which has been filled with instruction templates over which previous operations have been executed. One can check on the table that the differences between the number of wrong predictions for the two methods is much more significant for ordered lists than for random lists. Consequently, the physical hardware implementation of a processor dedicated to this kind of sorting for random inputs can be performed with the simplest (and cheaper) 1-bit method.

Important aspects like out-of-order execution are not easy to implement in practical purely rewrite based programming environments. In fact, out-of-order execution of instruction templates over a ROB can only be simulated by allowing a truly non-deterministic application of the rewriting rules (i.e., processor instructions) over the ROB during any time of the computation. For allowing out-of-order execution, instead of the usual \texttt{cons} operator “,” of instruction templates and ITBs (which appears as \texttt{inst_temp.itbs} in our implementation) a new operator “#” is defined for concatenating ITBs and/or instruction templates. Thus ITBs are represented as \texttt{itbs1#inst_temp#itbs2} being \texttt{itbs1} and \texttt{itbs2} lists of instruction templates and \texttt{inst_temp} a sole instruction template. The rewriting rules are modified by replacing all their

\begin{table}[!h]
\centering
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|}
\hline
  & \multicolumn{2}{c|}{10} & \multicolumn{2}{c|}{10} & \multicolumn{2}{c|}{20} & \multicolumn{2}{c|}{20} & \multicolumn{2}{c|}{30} & \multicolumn{2}{c|}{30} & \multicolumn{2}{c|}{40} & \multicolumn{2}{c|}{40} & \multicolumn{2}{c|}{50} & \multicolumn{2}{c|}{50} \\
\hline
\text{Size} & \text{ran} & \text{ord} & \text{ran} & \text{ord} & \text{ran} & \text{ord} & \text{ran} & \text{ord} & \text{ran} & \text{ord} & \text{ran} & \text{ord} & \text{ran} & \text{ord} & \text{ran} & \text{ord} & \text{ran} & \text{ord} & \text{ran} & \text{ord} \\
\hline
1-bit & \text{correct} & 30 & 60 & 109 & 225 & 185 & 490 & 278 & 855 & 401 & 1320 & \text{wrong} & 34 & 34 & 72 & 74 & 114 & 114 & 159 & 154 & 196 & 194 \\
\hline
2-bit & \text{correct} & 28 & 73 & 120 & 258 & 194 & 543 & 286 & 928 & 407 & 1413 & \text{wrong} & 36 & 21 & 61 & 41 & 105 & 61 & 151 & 81 & 200 & 101 \\
\hline
\end{tabular}
\caption{Table 7
Elan statistics for quick-sort executed with 1-bit and 2-bit dynamic predictions}
\end{table}
ITBs with this new representation as we illustrate for the [PsOp] rule below:

\[
\text{Sys(m,Proc(ia,rf,Itbs1#ITB(ia1,k,t(k)|-Op(v,v1),wf,\text{sf})#itbs2,btb,prog))} \Rightarrow \\
\text{Sys(m,Proc(ia,rf,Itbs1#ITB(ia1,k,t(k)|-exec\text{Op\text{on\text{val}}}(v,v1),wf,\text{sf})#itbs2,btb,prog)) end}
\]

By matching the instruction template, ITB(ia1,k,t(k)|-0p(v,v1),wf,\text{sf}), the new [PsOp] rule can be applied not only at the first but at any position of the current ITB: itbs1#ITB(ia1,k,t(k)|-0p(v,v1),wf,\text{sf})#itbs2. In the theory, the rewriting system obtained by modifying all the rules as suggested above enables out-of-order execution, since rewriting rules are applied nondeterministically. But in the practice, in purely rewrite-based programming environments, this solution does not work since the application of a rule is decided by searching for either left-most or right-most (inner-most) reduces over the ITBs (according to the way the constructor “#” is defined) [9].

For rewriting-based implementations of a real out-of-order execution mechanism, the availability of true non-deterministic strategies is necessary. With some additional effort, in a rewriting-logic based system as ELAN strategy constructors like don't know choose (that gives all possible reducts) can be adapted for simulating the needed non-determinism over the ROBs [17,10,11].

4 Conclusions and future work

We have shown how processors may be specified and their execution simulated over rewriting(-logic) systems. Unlike Arvind’s group, who proposes the simulation of the execution of these specifications over standard hardware description languages, we address the simulation of the execution of processors directly over the rewriting specification avoiding the cost of program translation. Furthermore, we have illustrated why the rewriting part as well as the logical part of ELAN are adequate for the simulation of simple hardware components like the method of prediction in speculative processors (done in our case by pure rewriting) and control of the size of ROBs (done in our case by logic strategies). After having specified the rewriting rules for the instruction set of a processor, the intrinsic separation between logic and rewriting in ELAN results in enough versatility for dealing with different conceptions of manipulation of ROBs without additional effort in these rewrite specifications. Additionally, we illustrate how statistics of the application of rewriting rules may be used for estimating and comparing the performance of different processors. Although not done in our implementation, non-deterministic strategies implemented in ELAN also may be shown to be adequate for simulating essential hardware conceptions of these processors like the out-of-order execution of instruction templates in ROBs.

Through rewriting-logic one can describe an architecture as precisely as one wants. For example, rules of the speculative processor may be atomized in order to reflect the behavior of lower-level hardware components such as pipelines and functional units of processors like fetch, decode and execution units. Also, the (higher-order) rewriting-logic based simulation of reconfig-
urable processors [6], which are non-standard models of computing where two layers of instructions are needed (the one for the instruction set and the other for the processor reconfiguration) is of great interest, since no simulation is possible over standard hardware description languages such as Verilog and VHDL. One of our current goals is to analyze the possibilities of using rewriting for synthesizing (logic components for building) logical circuits for arithmetic operators at their layout level [2]. One of the interesting aspects that emerges from this study is the necessity of new hardware oriented notions of normal forms, since the more adequate algebraic expressions to be transformed into circuits are the ones with more regularities. These are consequently the ones that can be implemented with the smallest number of different classes of atomic hardware components, and are not the simplest ones from the algebraic point of view, which is the norm in rewriting.

References


