main go back

16b3x CPU

date: 2023-04 to 2023-06 ; 2023-08 to 2023-09 ; 2023-12 to 2024-01 ; 2024-04 to 2024-05
desc: third CPU design, vastly more complicated than previous designs ; implemented cache, pipelining, out of order execution
proj: https://github.com/sarvl/16b3x

OVERVIEW

to read about current implementation and skip historical info, read following sections:
OVERVIEW
ISA
ASSEMBLER
SIMULATOR
TESTS
CHIP DESIGN
CACHE V1
OUT OF ORDER EXECUTION V2
BRANCH PREDICTION
PERFORMANCE EVALUATION
THE GOOD
THE BAD
THE UGLY

there are a lot more features in this one than in previous designs as indicated by length of this page
there are around 11000 words

this is RTL simulation only project in VHDL
implementing CPU with memory and cache that supports custom ISA and some interesting computer architecture features

current CPU is very aggressive in extracting performance from every program by implementing:

Out Of Order Execution
Superscalar Execution
Branch Prediction
Instruction Elimination
Register Renaming
Physical Register Sharing
Load Store Queue
Internal Caches

speedup of around 1.4 is easily achievable for most programs without particular optimizations
with optimizations 1.85 speedup has been achieved, making me think that with extreme care, 2x might be possible or is just slightly out of reach

entire project implements also:
assembler with support for macros that greatly simplify any bigger project
simulator in c++ that checks logic of programs and generates some useful performance info
test framework to make sure everything really works

unfortunately the project does not concern itself with memory too much and the main CPU does not interface out of the box with any of the caches

the entire project went through many iterations and followed me on my path of learning how modern high performance processors work
the project went much further than I though, it was supposed to be 1 month experimentation with caches and pipeline

while there still are interesting things that could be done it is much more interesting to work on something that can be executed on real HW
except maybe for learning different HDL
so this is the last project that focuses exclusively on simulation and there will be nothing new added
the website still may change if bug in the description or explanation is found, if so there will be note on the main site

some sections have only high level overview without much interesting details, that is because I simply do not remember enough

since this text is so long it is inevitable that some mistakes crawled in
if there is anything that does not make sense or weirdly phrased, let me know

ISA

this time it does not make sense to write entire ISA here, this is just overview, see full document

Instructions

nop # ---
hlt ccc # IF(FL & ccc) { halt }
#
#
#
mov Rd, Rs/imm8 # Rd <-- Rs/imm16
rdm Rd, Rs/imm8 # Rd <-- M[Rs/imm16]
wrm Rd, Rs/imm8 # M[Rs/imm16] <-- Rd
#
#
#
#
rdx Rd, Es # Rd <-- Es
wrx Ed, Rs/imm8 # Ed <-- Rs/imm16
psh Rd # SP <-- SP - 2 ; M[SP] <-- Rd
pop Rd # Rd <-- M[SP] ; SP <-- SP + 2;
mul Rd, Rs/imm8 # Rd <-- Rd * Rs/imm16
cmp Rd, Rs/imm8 # Rd - Rs/imm16
#
tst Rd, Rs/imm8 # Rd & Rs/imm16
jmp ccc Rs/imm8 # IF(FL & ccc) { IP <-- Rd/imm16 }
cal ccc Rs/imm8 # IF(FL & ccc) { LR <-- IP; IP <-- Rd/imm16 }
ret ccc # IF(FL & ccc) { IP <-- LR }
#
add Rd, Rs/imm8 # Rd <-- Rd + Rs/imm16
sub Rd, Rs/imm8 # Rd <-- Rd - Rs/imm16
not Rd, Rs/imm8 # Rd <-- ~ Rs/imm16
and Rd, Rs/imm8 # Rd <-- Rd & Rs/imm16
orr Rd, Rs/imm8 # Rd <-- Rd | Rs/imm16
xor Rd, Rs/imm8 # Rd <-- Rd ^ Rs/imm16
sll Rd, Rs/imm4 # Rd <-- Rd << Rs/imm4
slr Rd, Rs/imm4 # Rd <-- Rd >> Rs/imm4

M[x] means memory at address x
Rx means value of register x
Ex means value of External Register x
// external is not the best name for special purpose registers
IP LR SP FL are particular ERs
instructions here take 16bit immediate even though only 8bits of immediate can be specified in instruction, this is thanks to UI ER

Instruction Format

in previous designs, opcode is always in the same place, this is convienient but wastes space for reg-reg instructions
since r-r instructions require less bits then for them first 5 bits are 00000 and opcode is located at the end
if the first 5 bits are not 00000 then instruction is r-immediate
this way 63 different instructions are possible, here opcode place determines only whether second operand is immediate or register
registers and immediate are still always in the same spot which makes decoding fairly straightforward

	R-format:
		00000BBB CCCDDDDD
		
		from MSb to LSb:
		     5 bits are 0 
		then 3 bits denote Rd/ccc/Ed
		then 3 bits denote Rs/Es
		then 5 bits denote opcode
	
	
	I-format:
		AAAAABBB IIIIIIII
	
		from MSb to LSb
		     5 bits denote opcode
		then 3 bits denote Rd/ccc/Ed
		then 8 bits denote imm8

for example, mov R1, R2 is encoded as 00000 001 010 00101
mov R1, 0xAB is encoded as 00101 001 1010 1011

Registers

General Purpose

External

IP ; Instruction Pointer ; points to NEXT instruction
SP ; Stack Pointer ; points to bottom of the stack
LR ; Link Register ; points to return address of call
; ;
UI ; Upper Immiediate ; 8 LSbits denote 8 MSb of imm16 for instructions with immediate
FL ; Flags register ; 3 LSbits of this register denote flags set by previous instruction
; ;
CF ; CPU Feature Flags ; each bit denotes whether particular extension is present or not

Example Program


#progam computing fibonacci number

	mov 	R0, 0
	mov 	R1, 1

	mov 	R7, 7

	cmp 	R7, 0
	jmp 	LE  end

loop: 
	mov 	R2, R0
	mov 	R0, R1
	add 	R1, R2

	sub 	R7, 1
	jmp 	GE  loop

end:
	hlt


#UI usage
#R0 = xFF00
	wrx 	UI, 0xFF
	mov 	R0, 0x00

SIMPLE IMPLEMENTATION

compared to previous design, this time VHDL code was organized and entire thing was split into subcomponents and files
//unfortunately more complex implementations still have too much in main file, im working on improving design skills to fix it
testbenches were in separate file so it is was no longer necessary to reanalyze them every single time any change was made

as for the design, for reasons unknown to me in the present, component was created for every possible logic gate usage
for example, AND of 2 3bit vectors was needed so component gate_and2_3bit was created
clearly this approach does not scale, it is also completely unneccessary as a & b where a and b are bit vectors is completely valid vhdl
similarly decode /*badly named as control*/ component outputs long list of signals and instead of defining custom type, every signal was created separately within processor again

each cycle instruction is decoded and everything works almost exactly as it is specified in the ISA, small glue code is needed to make it work
overall this is quite straightforward implementation with not many interesting features
there is 1 significant improvement compared to previous VHDL CPU, unified memory was implemented by use of additional cycle_advance control signal
its important to note that this project STILL does not concern itself with reality too much
even though it is more detailed than HLL behavioral implementation, there is no concern with real gate delay, synthesis and putting it to anything physical
though I tried to use something that at least could synthesise, I have absolutely no expierience in it and so cannot make good judgments

PIPELINE

ommiting description of atrocities mentioned above on a larger scale

the pipeline is very shallow, has only 3 stages
fetch/decode - execute/memory - writeback
compared to classical RISC pipeline:

fetch/decode are merged as this was simpler to implement
execute/memory are merged as there was no reason to split them, only PSH and POP use mem and ALU but this can be done in parallel

thanks to this simplicity, data hazards are significantly reduced and each can be solved by forwarding from deeper stage

control hazards dealt with by simply not acknowledging their existence
as mentioned before, this project does not concern itself with reality so if branch is detected in execute stage then address to fetch is updated
fetch/decode stage then starts fetching from next address and still has time to decode in 1 cycle
//there is some additional code to deal with cal/ret cases but its nothing noteworthy
//assuming very slow clock and very fast memory, it could work though

the same mechanism used to share memory in simple design is used to protect from structural hazard
whenever memory instruction is detected, pipeline is stalled and currently executing instruction accesses memory
this is slightly suboptimal as instructions could proceed anyway, it is enough to insert pipeline bubble
but since this is the only possible source of stall, no performance is ever lost because of it

external registers, despite being a nice idea, are somewhat annoying to implement correctly in all cases
for example, current instruction might be writing to IP so next instruction is invalid, this requires detecting that instruction is WRX and writes to IP
fortunately for simple pipeline and lack of exceptions/interrupts, they are quite easy to do right
each instruction reading from/writing to external register does so in execute stage /*which kind of makes this stage also writeback*/
this guarantees that even if next instruction needs that register, it can be forwarded from execute to fetch stage

there is 1 more problem, self modifying code does not work
only the most recent OOE implementation deals with this so all non-simple designs from now on have this problem

CACHE V0

the cache is 2 way set associative, write through, write allocate
it is completely invisible to the CPU and is implemented as subcomponent of memory
assuming memory access is 10x as expensive as cache access, entire program 4x as fast as without cache

delay can be modeled simply as:
check whether data is in cache
if not => long delay
if yes => short delay

implementation does not concern itself with implementing write through, again through wishful thinking it just works
//Cache V1 has much more interesting description

waveform of cahe

Full Image

each cycle is denoted by vertical edge (since clock waits in high state)
it is very easy to see loops, when instructions are cached it takes less time to access them
delays within loops are caused by some other memory access

ASSEMBLER

this description is not about old implementation but rather current state since it is only a tool used for main project part
clearly I do not have a lot of expierience in compiler development and so some design choices are at least suboptimal but it does its job
as a way to simplify coding it was necessary to implement macros, it is nowhere near macro system of modern assemblers but it did help write more complicated programs

General Operation

first tokenization is performed for each line
tokenizer detects special characters, start and end of various structures, labels, etc
then each token is added to a vector of tokens which is then to be verified and processed

initial processing evaluates and expands macros (see Macros), processes attributes (see Attributes), finds labels and counts instructions
this step also ouputs labels to symbols.txt to allow for easier debugging with simulator

then each label is identified and replaced with correct offset
labels MUST be referencable before they are created to allow any useful structures so this part has to be AFTER label identification

then expressions are processed in order (see Expressions)
each expression is inside {} so it is very easy to find them
this step also has to be last because expressions can use labels to calculate offset
for example {array 2 +} to access 2nd entry of array

finally, output is created and verified
this step is really boring and contains no interesting features

Expressions

I did not want to bother with implementation of calculator but it was important to have expression evaluation
therefore the assembler uses postfix which can be evaluated by walking left to right and popping/pushing data onto/from stack
for example instead of writing 3 * (4 + 5), one writes {4 5 + 3 *}
all binary operations are implemented (except shift for which divide can be used)

Attributes

this feature is still a stub, currently supports only ALIGN which inserts NOPs to align code

Macros

Overview

macros can be created with

@def constants (expressions) that are used (and implemented) like labels
@macro [...] @end multiline (can also be used for single line) that have their entire body copy-pasted

example syntax:


@macro bne(x, y, z)
	cmp 	_x, _y
	jmp 	LG, _z
@end

@macro halt
	hlt 	LEG
@end

@def count 10

@macro zero
	0
@end

@macro add_one(x) 
	{_x 1 +}
@end

	mov 	R0, @zero
	mov 	R1, count
loop:
	add 	R0, @add_one(0) 
	@bne(R0, R1, loop)

	@halt

main limitation is lack of support for nested macro use, due to how they are implemented

Implementation

only single pass is used to implement inital processing so multiline macros have to be defined before they are used
there is no error for using undefined macros /*yet*/
@def macros are very simple to implement by hijacking label system
this also means they dont have to be defined before they are used because labels have additional pass

whenever @macro is encountered, it is simultaneously parsed, verified and added to special vector with macros
after parameters are parsed, entire body is copied and saved (with small modifications to detect parameters)
then whenever macro is referenced it is checked for argument count
(only one macro can have a given name, even though they could be differentiated by different argument count)
if they match, macro's body is pasted and arguments are inserted, code is later evaluated as if it was written normally
one inconvienience is that line number is lost, this is to be fixed

SIMULATOR

Overview

the most important feature of simulator is to test whether particular program is written correctly
this is possible because simulator is the simplest implementation of ISA and so is trivial to get correct
each instruction is simulated according to ISA and the simulator does not concern itself with details of any particular implementation
except of gathering enough data to precisely output cycles taken by simple and pipelined implementation
//the most modern ones, described in sections below

additionally with certain options simulator can be used for debugging and detailed performance evaluation
all options descriptions can be obtained with sim -h

Verification

there are 3 options dedicated for verification

-r outputs register dump
-m outputs memory dump
-w outputs potential warnings

-r is mostly used for manual verification
tests heavily rely on -m output to check whether correct output is the same as implementation's output
(see Tests)

warnings verify few features, currently check whether each address is aligned and whether SP is back to 0 at the end

Debugging

there are 3 options dedicated to debugging

-d interactively step through instructions
-v verbose output of what happens
-s replace magic numbers with labels

it rarely makes sense to use -v or -s without -d
while debugging, certain commands can be used, unfortunately they are stubs of what should be available and this is still to be improved
for this reason they are also not documented
fortunately most of the time it is enough to simply see currently executing instruction

verbose output allows to see instructions in more detail
eg each arithmetic instruction prints its operands' values and result of operation

symbols option uses output from assembler (symbols.txt) which assigns labels to their addresses
main use of this feature is to simplify following of JMP and CAL where destination IP gets replaced with name used in program

Performance Evaluation

there are 4 options dedicated to performance testing

-i outputs how many times each instruction was executed
-b outputs branch prediction info
-p outputs useful info for perf testing
-P outputs more precise useful info for perf testing

-i is very straightforward and mainly useful to check what should be improved
for example, if most instructions are MOV then it makes sense to see whether that can be improved

-b outputs info of how different branch predictions would perform

-p outputs generally more useful info

how many instructions execute in total
memory reference count
branch count
branch taken count
cycle time particular program would take on simple and pipelined implementation

-P more pecise version of -p

how many instructions execute in total
memory reference count
memory reference count with address in register
memory reference count with address as immediate
mov register register
stack operation count
branch count
branch count taken
branch count conditional
branch count unconditional
arithtmetic operations count
cycle time particular program would take on simple and pipelined implementation

the code to check branch predictor is a mess because it updates a lot of data and decreases readability of code
I am still looking for some solution

simulator peformance output

Full Image

TESTS

Overview

for any project this size proper testing is a must to make sure implementation is at least somewhat correct
tests are grouped based on what they test

ALU - arithmetic operations
JMP - jumps, calls, returns
MEM - memory operations
STK - stack operations
EXR - external registers
OTH - other, mainly big programs with no particular test goal
PRF - performance, currently only branch predictor

some of these tests came from hours full of blood, swears, tears, swears and swears

main purpose of tests is to check tricky scenarios like self modifying code or dependency chain
as well as big programs which often expose problems not found elsewhere due to their complexity

Implementation

each test has 3 components

code
binary
correct output

code is used only to generate binary
binary is used as input to tested implementation
tested implementation outputs memory dump which is then compared with correct output

clearly this implementation has major flaw, it requires memory to work which is not given when implementation is new
before memory works somewhat /*which is not that long time frame*/ it is enough and not too hard to look at waveform

REWORK

ISA Extension

up to this point, lack of multiply instruction caused significant slowdown of EVERYTHING that required multiplication
because mul had to be implemented in SW and so took significantly longer
for example
with SW multiply factorial takes 156 instructions
with HW multiply it takes 86 instructions
since noone uses this ISA it could have been just modified but I decided to go proper route and add proper extension flags
this required introducing some ISA change but now new features can be added and simply set bit in CPU Feature Flags Register
this register is readonly external register (writing is possible to encode but has no effect)

for now this is the only extension provided, there are some new in next draft which could speed up common operations
not sure when or if they will be added
//writing this as if it was something more than personal project

Simple

overall architecture did not improve drastically /*because, well, there is not a lot to improve*/
main improvements came in fixing stupidies that arose in first implementation
control signals were bundled together and useless constants were removed
and logic gate duplication was removed /*FINALLY*/

Cache

removed code duplication so now cache became single file that can be easily swapped with RAM and integrated into pretty much any implementation
overall architecture did not change /*still wishful thinking and not implementing write buffer*/

Pipeline

similar improvement were added to pipeline implementation as to simple one (see Simple Rework)

pipeline now has 4 stages

fetch
decode
execute/memory
writeback

as before, some writeback also happens in execute stage
whenever stage 2 (execute) detects that bubbles should be inserted (see reasons below) then instruction in decode is overwritten with NOPs
every potential hazard that modifies any ER (in particular, IP) has been resolved by now and instructions are fetched from correct IP
//this part is still not quite reasonable unless one assumes that memory and branch detection are fast
//and former happens in second clock cycle half and latter happens in first clock cycle half

bubbles (NOPs) can be inserted for 2 reasons

branch taken, since each branch is predict not taken (much simpler implementation)
write to external register

branch predict NT is generally speaking a mistake that simplifies implementation at the cost of performance
similarly second reason should not happen always, current OOE implementation (see OOEV2) covers case WRX UI
said implementation features RAS which removes the need to implement LR fast
since pipeline does not have it, each procedure call induces SIGNIFICANT cost
it is usual to see same code take 1.3 as many cycles on pipeline as on simple implementation
(of course, that 1.3 cycles is still much faster if pipeline's clock is 1/3 of simple's clock, but it could be better with even simple BP)

Aside: Branch Delay Slot

branch delay slot is not a solution, it is terrible idea that introduces subtle complexities
see https://devblogs.microsoft.com/oldnewthing/20180416-00/?p=98515

it exposes microarchitectural details that may not (and probably are not) relevant later
MIPS R10000 is out of order processor that has to support branch delay slot even though it is no longer relevant, quote:

In a pipelined scalar processor, this delay slot instruction can be executed for free, while the target instruction is read from the cache.
This technique improved branch efficiency in early RISC microprocessors. 
For a superscalar design, however, it has no performance advantage, but we retained the feature in the R1OOOO for compatibility

earlier example: R4000 is 8 stage pipeline with branch in 4th stage (counting from 1)
aside within aside: despite first MIPS having execute as 3rd stage, branch target is generated in 2nd half of 2nd stage
branch condition is checked in 1st half of 3rd stage and i$ is accessed in 2nd half of 1st stage
so, in the same cycle, branch is checked and i$ is accessed using previously generated address
//I guess that my design is not so unrealistic then
end of aside within aside
but this processor still supports 1 instruction branch delay slot because that how ISA specified things
even though in this microarchitecture it makes no sense

simliar but significantly less annoying problem happened to ARM with its IP
essentially at each point IP is IP of that instruction + 2 because of 3 stage pipeline
see https://stackoverflow.com/questions/24091566/why-does-the-arm-pc-register-point-to-the-instruction-after-the-next-one-to-be-e

OUT OF ORDER EXECUTION V0

//note: OOE implementations could really benefit from pipeline but for simplicity it was omitted

First Try

in general it is worthwhile to implement something knowing only how that something is supposed to work
this allows to see why something is done in some way /*and maybe more importantly, why something is not done in some way*/
at that time I only knew that OOE CPUs use some kind of buffer and dispatch instructions to execution ports
additionally since each instruction takes only 1 cycle ive decided to go superscalar as well

the main idea is to fetch instructions and if an instruction can be executed, execute it
if not then it is put away to buffer
this works quite differently than implementations that rely on ROB where instruction is fetched to buffer either way
whenever only 1 currently fetched execution executes, get one from buffer that by this time is free to execute
whenever buffer has 2 entries, fetch 2 instructions from buffer
in case they have dependency on eachother, fetch only first
constantly check dependency on buffered instructions
I dont remember the exact reason why this approach was scraped, it was because something could not possibly be implemented using buffer and memory at once
unfortunately it is not documented exactly anywhere and the code is long gone

I definitely want to revisit this idea because it is different than current implementations and sounds interesting
important thing to consider is implementation of precise interrupts/exceptions because there is nothing that can invalidate instructions

diagram I drew then to show how instructions are moved around:
Y means fetch from memory while N indicates that data was fetched from buffer only

Full Image

Second Try

Implementation

this time I decided to see how generally OOE processors work and ive seen the term reorder buffer
to not dwelve into details and spoil the fun ive implemented /*without knowing terminology for it*/ collapsing queue
the obvious downside, visible on the picture, is HUGE logic that is required to determine dependences
dependency checking is also quite complex and has to check for pretty much every ordering constraint


	each possible execution hazard for X and Y:
		1. Y writes to register used by X  
		2. X/Y is HLT 
		3. X is WRM and Y is WRM/RDM
		4. Y is first and Y is RDM/WRM and X is RDM
			two WRM/RDM instructions cant execute at once 
			case when X is WRM is covered by 3 and 4 
		5. Y is first and Y is MUL and x is MUL 
			two MUL cant execute at once
		6. X/Y is jmp/cal/ret/wrx/rdx
			this is more restrictive than neccessary
			however this simplifies circuit significantly 
			and the rdx/wrx are not frequent enough 
			 for this restriction to have significant (if any) impact on performance
	 	7. X modifies register and Y is WRM and writes that register

//above snipped is slightly modified comment from code
comment on frequency of RDX/WRX is wrong, for example g05t05.asm executes 245 WRX compared to only 201 RDM
that is, it is more likely that a given instruction will modify external register than read data from memory
of course this highly depends on tested program, but the point is, WRX has to work fast, at least for UI

the CPU works like follows:
find up to 2 instructions such that nothing depends on them (check for dep_XonY)
//obviously always at least 1 instruction can be found, first instruction depends on nothing
fetch and decode instructions from memory
move 1 or 2 instructions from buffer and simulateously insert up to 2 new instructions (decoded in previous cycle)
due to the way it works, startup (and branch misprediction) delay is 2 cycles

lack of proper branch prediction and quite high branch misprediction penalty made performance much worse that it could be
some speedup was achieved by starting branch execution while still in buffer but that just reduced the penalty
it works by checking if first instruction is branch and that branch is T
(T is determined by having additional copy of separate early flag register)
if so and branch target is known early (uses immediate as destination) then fetch address is changed
note that there is no danger with WRX because that branch MUST be first in buffer
and really it could also be done without limitation on immediate destination but this way it could be done without extending datapath too much
since registers could only be read by execution units

Performance

fibonacci got WORSE from 54 to 58 cycles
main problem is that there is not a lot of parallelism to exploit and branches are very costly
after aligning main loop, performance improved from 54 to 45

factorial got twice as bad as it was on simple implementation
again due to many calls and branches

matrix multiply (after optimizations) executes in around 0.9 time of simple implementation
thats abot 400 cycles, compared to 272 max achievable right now (see Benchmarks)

highly advanced mechanism used to see what should happen when:
//yes it's slightly wrong

Full Image

TOOLING IMPROVEMENT

//this is where second period of working on this project
//it was quite short
test framework has been implemented (see Tests) so bugs are easier to spot
assembler and simulator (see Assembler see Simulator) were mostly improved around this time period

OUT OF ORDER EXECUTION V1

Overview

this implementation works quite similarly to typical OOE implementation
instructions are fetched to 8 entry ROB which also serves as storage for in-flight data
if instruction has all dependences satisfied and it is first (or second) that can be executed then it is executed
if oldest instruction(s) have completed then they are retired
up to 2 instructions are fetched, executed and retired per cycle

Dependency Management

main problem of previous implementation was very costly depedency check
in principle ROB and register renaming solve this issue
my implementation, however, was so terrible that it was undebuggable mess with many edge cases
it was never fully finished, instead next implementation chose different approach and dep management became trivial
code fragment that is responsible for proper dependency forwarding:


rob0_src0   := (2 DOWNTO 0 => instr0_r0, OTHERS => '0')        WHEN instr0_cf = '1'
  ELSE i0_val                                                  WHEN RAT(rat00_e).in_rf = '1' AND i0_rd = instr0_r0 AND i0_we = '1'
  ELSE i1_val                                                  WHEN RAT(rat00_e).in_rf = '1' AND i1_rd = instr0_r0 AND i1_we = '1'
  ELSE instr0_r0v                                              WHEN RAT(rat00_e).in_rf
  ELSE ROB(to_integer(unsigned(RAT(rat00_e).rob_entry))).src1  WHEN ROB(to_integer(unsigned(RAT(rat00_e).rob_entry))).complete = '1' AND ROB(to_integer(unsigned(RAT(rat00_e).rob_entry))).controls.sro = '1'
  ELSE ROB(to_integer(unsigned(RAT(rat00_e).rob_entry))).value WHEN ROB(to_integer(unsigned(RAT(rat00_e).rob_entry))).complete = '1'
  --please forgive me for this monster line, very bad
  ELSE mul0_res WHEN exe_entry0p = '1' AND ROB(to_integer(unsigned(exe_entry0))).dest = instr0_r0 AND RAT(to_integer(unsigned(instr0_r0))).rob_entry = exe_entry0 AND ROB(to_integer(unsigned(exe_entry0))).controls.mul = '1'
  ELSE mul1_res WHEN exe_entry1p = '1' AND ROB(to_integer(unsigned(exe_entry1))).dest = instr0_r0 AND RAT(to_integer(unsigned(instr0_r0))).rob_entry = exe_entry1 AND ROB(to_integer(unsigned(exe_entry1))).controls.mul = '1'
  ELSE alu0_res WHEN exe_entry0p = '1' AND ROB(to_integer(unsigned(exe_entry0))).dest = instr0_r0 AND RAT(to_integer(unsigned(instr0_r0))).rob_entry = exe_entry0
  ELSE alu1_res WHEN exe_entry1p = '1' AND ROB(to_integer(unsigned(exe_entry1))).dest = instr0_r0 AND RAT(to_integer(unsigned(instr0_r0))).rob_entry = exe_entry1
  ELSE (2 DOWNTO 0 => RAT(rat00_e).rob_entry, OTHERS => '0');

slightly different code was for each possible source, there were 4 sources for operations in single cycle
details of what, why and where are not really important, whats important is that this is stupidly complex and could not work well
//I am certain that it could be done better and work well
//main benefit of tests is that I know that implementation does not work, but that does not mean I have to fix it
//also, vhdl verbosity is first time I have questioned strong typing verbosity
//I have solved it by creating additional signals that are all generated automatically, but still

Branch Prediction

simple 2bc branch predictor (and not working RAS) have been implemented here
there are not many differences between it and current design except current is more robust and not broken
(see Branch Prediciton)

Performance Comparison

despite being broken, some tests passed and so performance could be compared at least somewhat
matrix multiply improved from 485 cycles on simple implementation to 276 on OOE implementation
//I have honestly no clue why is it so fast, I need to dwelve into that
//seriously, half working implementation has almost better running time than current one? what happened

CHIP DESIGN

//this is where third period of working on this project starts
previous designs were all sort of inserted into CPU, while this time CPU is part of system
this provides much better ability to extend it with more memory chips, IO or multiprocessing
//its very unlikely any of these features will be implemented in this project
this allows all components to work independently of each other knowing just BUS interface and using same clock
this goal is much harder to achieve than I thought
cache implementations do require modifications to how data is sent to/from memory from/to CPU
especially write back
additionally simple and pipeline implementations were rewritten again but there is nothing interesting about that

CACHE V1

Overview

given how the system works, cache must work much more realistically
there are 2 implementations of cache, write-through and write-back

both work with simple and pipelined implementation and could work with Out Of Order one
but the problem is that cache operates on 2B chunks and OoO implementation has 4B wide memory
//I really dont want to make cache more general at this point
//it is not fully obvious how to make it work for bigger sizes too
//consider 2B write (as architecture specifies) to 4B memory location (as implementation works)
//should it be written partially? that requires partial hit support
//should full line be first fetched? that will degrade performance
//should memory operate in 2B chunks? that will SEVERELY degrade performance
//much better memory management is something I want to work on since there are many possible places to extract performance

System Interaction

waiting for data is implemented by chip stalling clock to a given component
it could /*maybe should*/ be implemented with each component having internal wait state but it works quite well

Write Through

Read:

CPU sends request to read address
cache checks said address
if address is contained by cache, request is satisfied
else, request is send to memory
data comes back, is sent to both CPU and cache

Write:

CPU sends request to write address
request is sent both to cache and to memory
if address is present in cache, it is updated
CPU returns back to execution

Write Back

Read:

CPU sends request to read address
cache checks said address
if address is contained by cache, request is satisfied
else, if something has to be evicted, then it is evicted first
eviction process looks like regular write from CPU with no cache, except CPU is completely stalled and request is sent from cache
conflicting entry is no longer present in cache and proper one is fetched from memory
data comes back, is sent to both CPU and cache

Write:

CPU sends request to write to address
cache checks said address
if address is contained by cache, request is fully satisfied
else, if something has to be evicted, then it is evicted first
eviction process look like regular write from CPU with no cache, except CPU is completely stalled and request is sent from cache
conflicting entry is no longer present in cache and proper one is written to cache

note that read can potentially cause write to memory but write cannot

Inner Workings

none of implementation details can be observed by CPU, except by measuring time to satisfy requests

Write Through

the cache is write allocate, DM and has 128 entries, each containing 2B of data

on read, address is compared against stored tag
if they are equal, data is returned
else miss is asserted

on write, data is always written and entry is updated accordingly

Write Back

the cache is write allocate, 2 way SA and has 256 entries, each containing 2B of data

on read, cache compares data from 2 entries
if hit occured, return data and set LRU to other entry
if there is a read miss, then entry to evict (denoted by LRU) is sent to memory and removed from cache
after that there is free entry to which data from memory (requested by CPU) is inserted
at the end LRU bit is set to point to OTHER entry

writes work exactly like reads except after a miss and potential write to memory data is satisfied from CPU
note that read causes at most 2 memory accesses while write causes at most 1

obvious and commont optimization is to include dirty bit to evict entry only when necessary
in fact, the implementation does exactly this

one problem came up because tests (see Tests) work by dumping memory contents
in cache write back then by design, memory does not contain most recent copy
therefore there is no guarantee /*and indeed I wasted some time debugging code that worked*/ that on halt correct value is dumped
the solution copies memory content to a buffer (simulation only)
then checks cache contents and changes values of that buffer wherever cache contains newer copy

//by the way, the vhdl support sucks
//feature required to implement this check is from VHDL-2008 and (during 2023-12) ghdl did not support it in backend that I was using
//this by no means critizes ghdl, I switched backend and it worked, I have encountered few compiler crashes on the way and they were fixed very quickly
//this issue is with pretty much every tool, almost every recent-ish post on SO talks about VHDL-2008 even though the most recent version is VHDL-2019
//VHDL-2019 support is pretty much non existent

Performance

write through cache waveform is mostly the same as previously

write back cache waveform

Full Image
notice that:

around 0ns there are no cache misses and so cpu is stalled only for single cycle when accessing memory
this is because initially instructions are loaded into cache
around 100ns cache misses occur but they do not cause eviction
therefore theres single memory access
around 200ns misses that cause eviction appear
first they do some internal work, then access memory TWICE

with write back cache, mat mult takes 879 cycles
without mat mult takes 4363 cycles
in other words, cache provides near 5x speedup

OUT OF ORDER EXECUTION V2

Overview

out of order 2 way superscalar CPU with internal cache, branch prediction, and instruction elimination

Physical Registers

in this implementation data is not stored in ROB but instead ROB holds pointers to data
managing dependences now is as simple as managing Physical Register pointers
additionally, less data has to be moved each time an operation is to be made, but this is not big concern for simulation
entire dependency management now (for one execution port):


  FOR i IN 0 TO par_rob_size - 1 LOOP 
    rob(i).prfs0_p <= '1' WHEN rob(i).present = '1' AND rob(i).prfs0_id = eu0.rd AND eu0.signals.rwr = '1' ELSE UNAFFECTED;
    rob(i).prfs1_p <= '1' WHEN rob(i).present = '1' AND rob(i).prfs1_id = eu0.rd AND eu0.signals.rwr = '1' ELSE UNAFFECTED;
  END LOOP; 
  prf_present(to_integer(unsigned(eu0.rd))) <= '1' WHEN eu0.signals.rwr ELSE UNAFFECTED;

inserting data into ROB is as simple as setting correct PR id and present bit

to simplify implementation, 1 PR is provided for each ROB entry but this is not necessary
in fact this is pretty wasteful as it is very unlikely that every instruction writes to register
for example Intel haswell has 192 entry ROB and 168 entry PRF
//haswell has mov elimination which makes it need even less PRs

however PRF is much less convienient to debug, after all the entire point is that at one point values from 1 AR can be in many PRs
since now there is no ARF and it is really useful to check actual values, CPU contains simulation only signals that read current mapping
these mappings are then shown in waveform in addition to PRF so it is straightforward to see current mapping and check whether something goes wrong

PR sharing allows to reuse same data without copying it to another register
each PR has associated counter that keeps track how many instructions point to it
see see Mov Elimination for more details

Instruction Elimination

NOP

just not add it to ROB

UI

Upper Immediate is absolutely crucial to get fast, as anytime data requires more than 8 bits UI is used
for example when there are more than 256 instructions, jump dest will require upper immediate
//lack of relative jumps is major shortcoming of ISA
most of the time wrx UI, imm8 instructions fall into category of ignored ones so they dont need to execute at all
whenever second operand is not immediate but register, execution is serialized
I have not yet encountered single real program use for this instruction
//it could replace mov R0, R1 ; sll R0, 8 ; orr R0, imm8 with wrx UI, R1 ; mov R0, imm8
//but this is too unlikely to justify higher complexity

there are two cases
1. wrx UI, imm8 is aligned
UI value is directly accessible for next instruction to use so this immediate is merged with next one
2. wrx UI is not aligned
value has to wait until next instruction is fetched so data is simply stored for some number of cycles
either way, wrx UI is not even inserted into ROB
this optimization is possible since UI is cleared after consecutive instruction anyway (as definde by ISA)
so under no circumstances wrx UI has observable effects after the following instruction

Mov Override

whenever instructions of the form mov R0, R1 ; op R0, R2 are fetched aligned
they are internally replaced with op R0, R1, R2
this mechanism skips mov elimination (see below)
//I am honestly surprised that it worked first time with mov elimination, I thought that it will be major pain

there is no need to consider only alligned instructions but fetch buffer is only 2 instruction wide
even then, in simulator there are 492 cases of mov R, R, around 150 of which can use this optimization
//quite significant drawback of 2 address ISA

Mov Elimination

instructions mov Rd, Rs are executed in frontend by the renamer and are NOT sent to any execution units
this is slightly more tricky than it initially appears but most of mechanism revolves around making sure proper registers are allocated/freed
especially in superscalar implementation where many instructions can be fetched/retired at once

each PR has associated counter that keeps track how many ROB and RAT entries point to it
each cycle between 0 and 4 instructions need to allocate/deallocate between 0 and 4 registers
//it is possible that 4 need to access 1 counter
this is managed by 4 separate adders, nth adder checks how many instructions access nth register
if many instructions write to same PR then they are simply redundant
then writes are prioritized from 0th to 3rd adder to not write from 2 sources to same destination, even though the result would be the same

for example, assume that PR0 has to be deallocated and PR0 and PR1 have to be allocated
adder0 will see that PR0 has to be deallocated 1 time and allocated 1 so it will not change the value of PR0's counter
adder1 will see nothing since only 1 PR is deallocated
adder2 will see the same thing as adder0 but it result does not matter since adder0 gets priority
adder3 will see that PR1 has to be allocated 1 time so it will change the value of PR1's counter by 1 up

Register Free List operation changes slightly
instead of new instruction => allocate and pr overwritten => deallocate
whenever value of counter changes FROM 0, register is taken from RFL
whenever value of counter changes TO 0, register RFL put back there

reference counting makes flushes more expensive since without it, it is enough to restore RFL pointer and commited RAT
here, it is necessary to checkpoint state that is also updated on commit (or walk through all instructions in RAT but that is slow)
otherwise, before that checkpoint state is used, some instruction may have committed and decremented counter
then on checkpoint restore, that counter is bringed back to previous value, overriding mentioned decrement and putting that PR into unfreeable state

this can be solved by keeping separate counters for deallocate and allocate, checkpointing only allocate
instead, since flush can be caused only be retiring instruction
I have decided to simply keep additional commited copy of every PR counter that gets restored on flush
//quite expensive but simple and works
//full ref counting is expensive either way, real CPUs use separate sets for keeping track of ref counted PRs
//and most PRs do not get such set, because there is no need to

other than that, it is of course necessary to copy RAT mapping
//most of the stuff just works really, as long as RFL and counters behave properly

overall the speedup is not as major as I thought it will be
simulator despite having 492 mov R, R, gets only 12 cycle speedup (1935 => 1923)
on the other hand, execution ports' utilization dropped significantly
port0 from 1418 downto 1238 (cycles active)
port1 from 974 downto 874
and the execution goes fast enough to make 16entry ROB full, increasing the size further does not improve performance

//Lack of speedup is caused simply by execution resources not being a bottleneck
//small improvement to external register management gained about twice as much
//not described anywhere else because its boring, simple check whether RDX can be executed cycle earlier
//for this particula test, later improvement in indirect branch handling (see Branch Prediction) improved performance by nearly 500 cycles
//not a typo

Flush Mechanism

each time a branch is mispredicted or write to instruction occurs or UI value could not be retrieved a flush occurs
all instructions are removed from the ROB and execution restarts at first known point, that is after instruction that caused flush
or branch destination for mispredicted branches
flush can only happen because of retiring instruction
after flush, RAT is restored to commited state and ROB pointers are restored to initial entry
note that PRF is NOT changed, there is no need to as architectural state is held by RAT
there is also no need to restore register free list as it contains only registers that are free to use

this is (should) be very well tested mechanism without any flaws
because if something is flushed when it shouldnt or not flushed when is should, illusion of sequential execution is broken and all bets are off

//previous iteration of this webpage contained quite detailed explanation of why something went wrong
//that explanation was somewhat correct but the root issue was different
//it is available on webarchive
//the actual problem was caused by SMC detection too but i honestly have no clue why it broke in that particular way
//in short, the issue was that IP was not properly hashed into bloom filter
//and this somehow caused instruction to reexecute instead of fetching next one

this works but obviously has a flaw that any flush can happen only during retirement, and even more it has to be detectable during retirement
with current implementation it is fully possible to do so but in general it is not enough and flush (exception) info should be stored within rob entry
for example, SMC detection can be triggered by any memory write so if memory writes are in any way speculative and executed before retirement then it wont work
//writes can be sent to separate write buffer before commitment to memory so early execution is possible
much more significant problem is lack of ability to resolve branch before it is retiring
for 8 entry ROB this is not a big deal but for larger ones it starts to be problematic
this requires to do much better management of PRs to be able to unwind translation
or store checkpoint state like MIPS R10000 or BOB on skylake
//how many references to same paper are too many?

Self Modifying Code Detection

the very annoying yet necessary thing to protect execution against
each time an instruction is fetched, its address is stored into bloom filter
specifically: entries corresponding to {low 8bits ; high 8bits ; middle 8bits} are all marked with '1'
then each time any sort of write is performed, it is checked whether bloom filter contains that particular address
if it does then machine flush is performed and currently problematic instruction reexecutes

importantly if there IS write to instruction then it IS detected
converse is not true, its relatively easy to construct addresses that are different but map to the same entry
for example, with current hashing x0500 and x0050 map to the same entry
as long as it happens rarely enough, it's fine because of significant space reduction
64kib (1 bit for each address) is compressed down to 256b

//this feature does not quite work right now since flush is specific to that feature instead of reusing branch flush,
//this has happened so many times that I have to rewrite entire flush mechanism

another solution is to have 2 separate bit vectors, each with 256b
first stores whether high part of address appeared somewhere and second stores whether low part of address appeared somewhere
this way false positive requires write to high and low part that appeared somewhere but not at once, this is less likely
especially because code is usually at x00XX and data is usually stored at higher addresses
(either because of stack or simply convienience)
//I have not tested precisely how well it works but it seems to be better

Internal Instruction Cache

Overview

external cache requires a lot more time to access, even if it is much faster than memory
internal cache can be wired independently of memory bus so code being supplied from cache makes memory bus free
so instructions that use memory can do so whenever they need without fighting with processor fetching next instructions

Implementation

I$ is implemented simply as DM cache with space for between up to 64 instructions (assuming ideal distribution and alignment)
each time instructions are fetched they are added to that cache
since this cache is assumed read only (if that assumption is violated, see below) there is no problem of writethrough/writeback

Self Modifying Code

such cache only makes SMC worse (see Self Modifying Code) as not only ROB but every instruction that was ever fetched have to be flushed too
it is possible to augment SMC protection such that cache flush is not necessary
if write to instruction is detected from BF then flush that particular line from cache (if it is present)
since it is possible that write happens to instruction that was fetched before BF flush
there should also be additional check against cache content
for example, instruction at index i caused flush of BF, instruction at index i + 2 is still in cache but instruction i + 1 writes to i + 2
this should improve performance against false positive SMC detection
//this smarter handling is not implemented so this is only speculation
//but PLEASE do not write SMC

Performance

matrix multiply without i$ takes 321 cycles
with i$ it takes 313 cycles
//truly incredible!
i$ waveform

Full Image
main gain comes initially, notice that a bus is often yellow, that indicates it is unused

Memory Operations

Motivation

//for any multiprocessor design there should be some mechanism to keep memory consistent
//memory can be defined to not have to be consistent but thats so boring answer to a problem
having to execute memory operations strictly in order needlessly slows down execution, especially for loads
main problem comes from long dependency chains where even though load can execute safely it has to wait for previous mem operations
additional problem with this particular implementation is when memory operations are in order they have to be executed as first-to-retire
this slows down retirement as retirement can happen only the following cycle so constant retirement cannot be sustained
however rob is usually not full and so at some point that 1 cycle loss usually disappears

Speculation

quite simple and effective method utilizing bloom filter similarly to SMC protection (see Self Modifying Code Detection)
works like follows:
execute loads and stores as soon as possible, marking written addresses in bloom filter
whenever operation hits in BF then it is quite likely that ordering violation happened, so flush

of course this is suboptimal when data is read from memory after being written often
but for compute programs that rarely use memory for data, this is great solution
see Performance below for evaluation
//numbers there may be slightly wrong as bloom filter implementation had to be ported from older version of code
//but it shows the potential, especially on memory heavy programs

Load Store Queue

a different approach is to keep track of which loads can execute out of order
to do this, it is necessary to keep track of all memory operations as there may be preceeding store
there is still speculation as loads are executed out of order and it may happen that this particular address was bogus
it is not a problem as in this particular implementation memory reads do not leave architectural side effects

when intructions get added into ROB and they use memory they are also added into LSQ
//this introduces another stall condition - full LSQ - but it is usually not significant
write is just added but read has to check for dependences
this is achieved by keeping bit array for each entry, if instruction X depends on Y then Yth entry in X's bit array is 1, otherwise it is 0
this is not speculative so if Y is store and its address is unknown, X depends on Y
when store address is known then its dependents are updated and dependence is either kept or removed, depending on address
similarly if X is load and its address is unknown, X depends on Y
if X has no more dependences, it can be executed by execution unit
//my implementation introduced another execution unit to not have too big slowdown, but it is not strictly necessary
loads never depend on loads and stores never depend on stores

so memory operation goes through execution like follows:
fetch - insert into ROB - executed by putting address (possibly data) into LSQ - executed by sending to memory

small optimization happens for pop and psh since their address is almost always known early
whenever SP is modified with wrx, flush occurs since it invalidates speculation on stack
internally CPU keeps its own SP that is updated and used in the front end, this allows to fill LSQ entries for stack ops on allocation
this allows to reduce conservative dependences and for psh removes the need to be executed twice
//x86 CPUs do this too and are much smarter, they can handle modifying SP with 1 stack sync uop

it is possible to separate filling of address and data of store, this may be beneficial for when address is known much earlier
it may not be that beneficial here because it would make stores much more time consuming, hard to guess, it is not implemented
//thats what modern x86 CPUs do by splitting stores into 2 uops, 1 for address generation and 1 for data

in a case when load depends on store and store data is known, then data is forwarded from store to load
this makes said load not have to go to memory which reduces contention on bus
see Performance below for evaluation
//"evaluation" also known as "disappoinment"

LSQ itself can also speculate on dependences like previous approach but flushes only on definite violation
not like above approach that flushes when it is likely
this is implemented to some degree
each read goes as soon as it can and if it is speculative it just fills its own data field
if that speculation is succesfull then on load finish memory does not have to be accessed
if it was incorrect then some write will overwrite data retrieved from memory
in other words, there is no need to flush
see Performance below for evaluation
//actually it is somewhat broken but the potential gain is very small so I did not bother too much with it

another optimization is store->load forwarding which works by supplying store data to depedent load
and yet another one is to coalesce reads that access same 4B block since memory supplies entire such chunk

Performance

simulator (in asm):
baseline : 2009 cycles
speculation: 1903 cycles
lsq : 1947 cycles

matrix multiply:
baseline : 313 cycles
speculation: 272 cycles
lsq : 271 cycles

clearly each method provides speedup over lack of it however it appears that speculation is generally better than LSQ
massive downside of LSQ implementation is 2 cycles/execution of most memory operations
which is probably the reason why g05t05 benefits more from prediction
but for memory heavy program like matrix multiply, LSQ is slightly better which is great

for LSQ it seems that store load forwarding, speculation and read coalescing (see above) do not improve performance meaningfully
mostly because there is simply not many occasions when they can be activated
coalescing however probably fails because as soon as load gets its address into LSQ then it is usually executed straight away
so there is no way to satisfy more than 1 request at once

these benchmarks are not lower bound for performance of memory bound programs, write cache can be implemented
essentially, store last N commited writes and forward data from them to reads
for example, 4 entry cache reduced execution time of simulator from 1947 down to 1935 cycles and for mat mult from 271 down to 266

I dont know the definitive reason why write cache improves performance but SL forwarding does not
my hypotheses is that the execution window is simply too small to see reuse of data

Performance Bugs

I-Cache Slowdown

I$ caused performance of matrix multiplication to be worse than without it
it turned out SMC protection (see Self Modifying Code Detection) caused this degradation
since I$ allows to load instructions faster, more addresses are hashed into BF
pure chance caused them to conflict more often with current writes
since there were more conflicts, there were more flushes which are quite expensive
with SMC protection disabled, performance improved by few cycles
//underwhelming but better
//it turned out cache was implemented wrongly and later it was fixed

Bigger ROB But Worse Execution

bigger ROB should allow for more instructions to be in flight
and thus improve performance because there should be more instructions to extract parallelism from
especially with i$ which should reduce the need to have better memory bandwidth
it turned however that after extending ROB from 8 to to 16 entries, performance was worse

the problem was execution priority
arithmetic instructions had higher priority than memory operations
(even if memory operation was to-be-retired instruction that blocked retirement)
it is less of a problem with 8 entry ROB where it can cause at most 4 cycles (2 execution units, 2 instr/cycle)
but with 16 entries there can be twice as large slowdown
in practice such slowdown is VERY unlikely but smaller one really did happen
especially in the larger programs that execute for more time

(tested on some program, dont remember which)
originally performance (with i$) degraded from 2105 => 2132
after fix, baseline improved to 1980 and bigger rob did not improve perf
//lack of improvement is still annoying but it highly depends on particular program being run
//for matrix multiplication larger ROB makes program run in 0.9 * time of 8entry ROB

Bigger ROB But Worse Execution, Again

//identifying these bugs is annoying because my gtkwave setup has only 8 entries of ROB
//and with more entries there are MANY waves and its getting hard to scroll
//also I have almost run out of memory because debugging it requires having 2 gtkwave windows open at once
//then tracing each point to identify when bigger ROB gets slower

this one was particularly annoying because there was no obvious mistake, and indeed there was no mistake at all in implementation
particular program tested was g05t05 and exec time rised from 1947 to 1962
other programs showed similar results on different scales (sometimes none)

as it turns out due to incredibly bad luck, memory operations get scheduled EXACTLY on indirect branch which ALWAYS misses
then in some steady state this code executes over and over again causing this situation to repeat
//BTB would not really solve the issue here as its very unlikely that same direction repeats in this case
now the flush has 2 or 3 cycle of penalty depending whether memory bus is currently occupied
if the bus is occupied then proper address cannot be put on the bus and it is put next cycle
if the bus is free, then proper address is put right away and next cycle resumes fetching
so if the performance really was worse because of it then stopping memory from using bus would improve it

this hypotheses was quickly tested with


	FOR i IN 0 TO par_lsq_size - 1 LOOP
	--checks whether to-be-retired entry is branch
		IF rob(rob_SE_i(0)).branch THEN 
			EXIT;
		END IF;
	[...]

and later improved to


	FOR i IN 0 TO par_lsq_size - 1 LOOP
		IF flush THEN 
			EXIT;
		END IF;
	[...]

initially performance IMPROVED by 4 cycles even though EVERY branch blocked memory operation
with latter fix, performance is the same regardless of ROB size

still underwhelming but if ROB is not the bottleneck then not a lot can be done by increasing its size

BRANCH PREDICTION

//Im considering turning part of this section into aside

well working BP is essential for any meaningfull perf gain
//check for example https://danluu.com/branch-prediction/ to see how many options are there
//that list is by no means exhaustive
determining best BP to use is quite complicated problem and my use of small programs does not fully address complexities
simulator implementes few branch predictors but they occupy most of the code and drastically reduce readability (see Simulator)
//also Im fairly sure that tournament predictor is broken but I did not ever debug it
//debugging BP is tricky since properly implemented BP is always correct
//in a sense that, even if it has 0% accuraccy then it can never cause wrong execution
//AFAIK debugging it is either single stepping it to make sure it works as intended
//or generate some test that should perform with certain accuraccy
out of the ones implemented, usually 2BP start T performs the best
therefore thats what what CPU implements for branch prediction
although with quicksort before optimizations, 2bc history performed better
//current framework also has the problem that all history predictors have same history length
//but global history requires more bits to operate as good as local history
//that is, more per history register but in total it requires less

CALs and RETs are always predict taken and operate using RAS
this prediction scheme makes fast LR not necessary as it is implemented using microarchitecture feature
typically thats all, CALs and RETs are always taken so there is no problem
however this ISA allows them to be executed conditionally, this comes mostly from how easy it was to encode it
//in retrospect, it may have been smarter to use extra bits for greater address range and provide separate conditional relative jump
as long as not taken path is very rare then it is fine to use it, for anything else this is bad for performance
unconditional JMPs, CALs and RETs are always predict taken

indirect branches require knowing the address early enough, the CPU just waits until that address is available
and when it is, it knowns where to jump
this introduced massive speedup for g05t05.asm which improved by almost 200c
with improved code it is possible to reduce that even further, in total reducing execution time from 1923 to 1441 cycles
improved code means more time between when destination is availabe and actual jump
//I am genuinely shocked by how much it gains
//in real processors need for BTB comes mostly from the fact that branch target is not known early enough

verification happens at retire when branch direction is compared against flag register (also updated at retire)
if the branch direction does not agree, flush is performed
one exception being, when branch is correctly predicted it can be retired while being second-to-retire, not first
if specific conditions allow it to be known by then, then it is retired
//specific conditions are not interesting, they just make sure nothing breaks
after misprediction is detected, Register Allocation Table is restored to its commited state
then execution begins from last known instructions (see Flush Mechanism)
such recovery is fine for small ROB but for more agressive CPUs with 100s of ROB entries this is terrible approach
MIPS R10000 implements 4 such "commit" states, so up to 4 branches can be in flight at once
then if misprediction is detected state of machine is restored to corresponding branch's "commit" state
//there is also approach of unwinding register mappings but, honestly, I dont know how that works in detail

after all, particular program /*I really shouldve written what is what then*/ originally had 256 misses and took 1562 cycles
with 2BC branch predictor it has 3 misses and takes 1055 cycles

PERFORMANCE EVALUATION

//as a performance freak, it is THE MOST satisfying part of entire project

Performance Counters

despite what simulator provides, it is inconvienient to try to replicate implementations, especially complex ones, in another simulator
//that said, simulator in c++ runs significantly faster than implementation in vhdl so replication is not a bad idea
therefore CPU has some counters to check how many cycles there were with specific event
this is very useful to identify bottlenecks and find what potentially could be optimized
current list:

miss - branch misprediction
fetch - fetched at least 1 instruction from memory
cache - fetched at least 1 instruction from cache
retire0 - retired 0 instructions
retire1 - retired 1 instruction
retire2 - retired 2 instructions
exec0 - port 0 executed something
exec1 - port 1 executed something
exec2 - port 2 executed something
noexec - no execution during that cycle
movrr - mov register register
movrrov - mov register register that can be merged with following instruction
ignore - instructions not added to ROB because there was no need to
flush - flushes
w2i - write to instruction detected
robfull - ROB full
indstal - indirect branch wait stall
lsqfull - LSQ full
lsqflst - LSQ full stall
lsqfrwd - LSQ forward
lsqmrge - LSQ merge reads
lsqspec - LSQ speculative execution
instr - instructions retired
wb1 - wait caused by branch in slot 1
rb1 - branch in slot 1 retired
wrchit - hit in write cache
specex - cycles spent when there was some branch in ROB

some of these counters are more useful than others and there is always need for more and more detailed counters
unfortunately they are still usually not able to capture the real bottlneck
//or maybe more likely, I am not able to properly identify one

Benchmarks

quicksort:
instructions: 1437
simple : 1809 cycles
OOE : 1272 cycles

IPC : 1.13
OOE/simple : 0.70

matrix multiply:
instructions: 322
simple : 437 cycles
OOE : 250 cycles

IPC : 1.29
OOE/simple : 0.57

simulator:
instructions: 2332
simple : 2662 cycles
OOE : 1570 cycles

IPC : 1.49
OOE/simple : 0.59

after slight source code modifications, simulator can be run at 1441 cycles making at run at ~0.54 time of simple one with IPC ~ 1.62

one problem with benchmarks is that it is usually not possible to make same program run well on all implementations
for example, additional alignment is potentially very useful for OOE as it may make branches effectively free while simple will be punished by additional NOPs to execute
that said, usually OOE is so much faster that simple implementation and slowdown is within noise range
//despite there being no noise

THE GOOD

the absolute best thing is, the speedup is very close to 2x which is about the max of 2 way superscalar execution
in other words, the machinery does what it is supposed to do

many mechanisms are very reliable and have not failed even once after initial implementation
mostly: dependency management, execution units, PR reference counting, branch verification

I learned a lot about how features really work and not just high level overview

THE BAD

the only major flaw is the project organization which really, really sucks
it improved somewhat but most features are in the single file
most cannot be turned off (though not all) and debugged on their own

THE UGLY

performance extraction is somewhat fragile and it hard to reason why something breaks
significant portion of this comes from how sensitive fetch is to alignment

some mechanisms are ugly and barely work and I do not trust them
I do not want to add more tests to discover bugs caused by them and spend any more time with them

there should be many more tests but the simulation time is so slow
so tests should be split into 2 groups: all tests and often failing tests

main go back