Linter running on my bootloader result Code https://github.com/enathang/riscv-linter

Introduction

Recently, I’ve started writing an Operating System. I have been following Sarah Lewis’s tutorial and MIT’s xv6 OS textbook. Both encourage the use of assembly for parts of the OS (bootloader, certain trap handling, etc.) This was my first time writing in assembly. As I was writing and researching more about RISC-V assembly, I learned that the assembly should follow RISC-V’s ABI. The Application Binary Interface standardizes how assembly should be written, such as which registers can be overwritten, which registers should not, etc. This standardization allows interoperability between RISC-V assembly written by different sources. In my case, the interoperability between my bootloader assembly and my kernel C code compiled (by clang) to assembly.

Overview of RISC-V ABI

The RISC-V ABI (defined here) defines what assumptions a programmer/compiler can make when writing assembly by defining a contract between the state of the CPU and all assembly code.

Specifically, during execution, the CPU maintains the state of all registers. Say we have an assembly function ¹function1 which stores intermediate values in register x1 and x2, then calls function2. Function2 does not know which registers function1 is using and therefore overwrites x1 and x2 with new values, destroying data of function1. When function2 returns, function1 will continue operating with incorrect values. It’s true function1 could store the data in the stack before calling the other function (not a bad idea). However, pushing to the stack for every function call is non-performant and clutters the code.

The natural solution is to separate the registers into two categories: volatile and non-volatile registers. Volatile registers are registers that cannot be assumed to be safe unless the assembly knows it locally defines the value in the register. Non-volatile registers are registers that all assembly functions promise to restore before returning, so we can assume values in these registers are save even after function calls (but must ourselves also restore them before returning from any of our own functions).

The ABI also defines which registers are used to pass return values between functions, useful aliases for registers (such as zero for the register hard-wired to 0 or t0...t6 for temporary registers), etc.

Example of non ABI-compliant assembly

Here’s an (non ABI-compliant) assembly function

# a0 = value to increment
_increment:
    # Load 1 into t0
    li t0, 1
    # Add a0 and t0, store in a2
    add a2, a0, t0

    ret

and here’s the C code that invokes it

extern _increment
int x = 5;
int y = _increment(5);

In the above example, we would expect y = 6. However, because the C code is compiled according to the ABI, it expects the return value from _increment to be in a0. However, our assembly returns the value in a2. You might wonder, what fool would try to use a2 as the assembly return register? Perhaps someone writing an OS for the first time. Whoops :) In my defense, a0 is also used for the first argument value passed in, so I was scared to overwrite it.

These types of ABI failures can be frustrating because assembly, being low level, will never inform you that you’ve done something wrong. In fact, assembly could not inform you of these issues if it wanted to, because it has no concept of them.

Another issue I’ve run into a few times (and is a pain to debug) is below.

Example of stylistic issue

_increment:
    li t0, 1
    add a2, a0, t0

_decrement:
    li t0, -1
    add a2, a0, t0

...

If you compare the example with the _increment definition before, you may notice something’s missing. The CPU will continue to execute instructions in order until it encounters a jump, branch, call, or return instruction. Barring those, if we call this definition of _increment from another part of our code, it will execute _increment, then _decrement, then whatever’s after it until it encounters a return statement. And the CPU would be correct in doing this, because that’s technically correct. The issue is that’s not what we intend the CPU to do, because we create a higher-level concept of a function that the assembler has no idea about.

It’s a difficult mistake to detect and debug because your registers suddenly have unexpected values or another part of your code unexpected executes and you have no idea why. Especially considering the bug changes based on the location of code in the file, something we might not even pay attention to.

As you might be starting to deduce, our issues are arising between the mismatch of the higher-level we are writing our assembly and the low-level in which the assembly actually executes. In fact, the ABI suffers from the same problem, as it defines things such as “calling conventions” which do not exist without higher-level concepts such as functions and closures. We want to define a higher-level abstraction to enforce these definitions, so that we can fix our current OS assembly and ensure future assembly is also compliant.

We need a linter.

Definitions

Per Wikipedia

“Lint is the computer science term for a static code analysis tool used to flag programming errors, bugs, stylistic errors and suspicious constructs.”

Notably bugs and suspicious constructs stand out as respectively describing our non-ABI assembly and missing jump/return statements.

It’s worth noting that suspicious constructs do not mean incorrect constructs. It’s perfectly valid assembly, and even sometimes quite useful, to allow parts of code to fall into each other. However, in my code, I want to always explicitly define where each parts of the code steps next through jump or return, even if that section is the next section in assembly.

Example of explicit vs implicit jumping

_func_entry:
    li t0, 1
    j _func_main

_func_main:
    li t1, 1
    add a2, t0, t1

    ret

seems more explicit and less dangerous than

_func_entry:
    li t0, 1

_func_main:
    li t1, 1
    add a2, t0, t1

    ret

even though they have the same behavior (assuming you don’t refactor the code and move things around, which is a questionable assumption).

We include some useful definitions below:

clobber: a technical term meaning to overwrite a register value with another value, presumably unknowingly
volating/non-volatile registers: my personal term referring to whether registers can be considered ‘safe from modification’ (s0 is safe) or not (t0, a0, etc.) Volatile means unsafe from modification, and non-volatile means safe, as defined in the RiscV ABI.

Of course, a more experienced assembly programmer would not run into these issues. But we probably shouldn’t rely on the mantra of “just write it right the first time” and anyway, I have to migrate the existing codebase to be ABI compliant. Therefore, it makes sense to explore using a RiscV linter to identify existing issues and pre-emptively identify new ones.

Existing tools

As far as my preliminary searching goes, there is not an existing RiscV linter (or really any assembly linter) publicly available. There’s probably a few reasons for the absence:

few people write in assembly
the people that do write assembly already know how to write good assembly
if assembly linter tools exist they’re probably hidden in search results because assembly is a niche topic (see 1).

In addition, because an AST generates assembly and not the other way around, there is no parser for RiscV assembly. There’s not even a Treesitter grammar for assembly for syntax highlighting. Wild.

There is a straigntforward parser written by Austin Henley here. It’s a lookahead(1) parser, where it only keeps track of a state and peeking ahead one token for context. Good to know if we can’t find anything more applicable.

I know there are tools like YACC (yet another compiler-compiler), which can generate parsers/compilers from language specifications. If there’s YACC, is there YALC (Yet another linter compiler) or even TFLC (The first linter compiler)? It looks like Coala exists as a language-agnostic linter framework. However, it does not seem to be actively maintained and the documentation to onboard new languages is sparse.

Writing a linter

So, we need (okay, want) a linter and one doesn’t exist already. Let’s just write one!

Intuition for the linter

Thus far we’ve been pretty hand-wavey about what we mean by a “function” and other higher-level definitions. That’s all well and good until we actually try to turn it into code.

Inspired by LLVM, we define the following terms:

A code block is a label followed by one or more instructions. A code block continues (has a closure) until the next label is found, regardless if it’s a code block or not.
A function call is denoted by the call instruction
A control flow statement is defined by the j or ret instruction. Note beq is not considered a control flow statement (even though it is one) because it’s a conditional jump and may let the code continue through. Therefore, we find it useful to not consider it as one for our code.

Notably, even though in certain parts of my code I define a function using multiple code blocks, for convenience the linter will consider each code block a separate function. This decision has the benefit of making the linter much simpler to write (since we don’t need to keep track of which code blocks call other code blocks to create function definitions), it has the drawback of throwing warnings when the assembly construction may be perfectly fine.

Linter specification

Use case 1

Given a set of assembly symbols, check that each symbol has a jump or return statement as the last statement before the next symbol.

Code example

# a0 = number to add
_increment:
    li t0, 1
    add a0, a0, t1

    # Note we forgot to add a ret instruction here

_decrement:
    li t0, -1
    add a0, a0, t1

    ret

_func:
    li a0, 5
    call _increment

    # we expect a0=6, but from our code a0=5
    ret

The bug

Because we forgot to add a return statement to the _increment function, the CPU will naively (though technically correctly) step through to the _decrement function. When the function returns, a0 will not have been incremented.

How the linter should fix it

The linter should throw an error saying _increment has no explicit control-flow jump at the end (or perhaps just a warning).

Linter rule

We define a code block as a label followed by one or more instructions before another label. This definition distinguishes from a label used as a text segment, such as _stack. We define an explicit control flow jump as one of [j, jr, ret]. We do not define call or jalr as an explicit control flow jump because ending a code block with one of those implies a return to the end of the code block.

Use case 2

Given a set of volatile and non-volatile registers (as defined explicitly by the ABI), check that any volatile registers are not read from until they have been considered “safe” and non-volatile registers are restored before exiting.

Code example 1

_func:
    li s0, 1
    add a0, a0, s0 
   
    # Note we do not restore the value of s0 before returning, breaking the ABI
    ret

The bug

Any code that calls _func will have s0 clobbered, even though the ABI guarantees s0 will be safe from being overwritten.

How the linter should fix it

The linter should throw an error saying s0 was not restored before the end of the block.

Linter rule

The linter will need to maintain a state of all non-volatile register values. As the linter lints instructions, it will update the values of the non-volatile registers. If the values are not equal to the initial values in the code block, the linter will throw an error for each non-volatile value that is changed.

Code example 2

_func: 
    # Note the ABI considers t0 unsafe because it is a volatile register, until we have written a "safe" value to it in this code block
    add a0, a0, t0

    ret

The bug

Our _func reads from t0, but the ABI cannot guarantee what value t0 will have. So this function cannot have “expected” or “well-defined” behavior.

How the linter should fix it

The linter should throw an error saying t0 is unsafe to read from.

Linter rule

The linter will need to maintain a list of all “safe” and “unsafe” registers. At the beginning of each code block, the register status is reset. As the linter lints instructions, it updates the status of the registers. If an unsafe register is ever read from, the linter will throw a warning/error.

Use case(s) not covered by linter

_add:
    mv t0, a0
    push_onto_stack t0
    mv a0, t0
    call _print_register
    pop_off_stack t0

    ret

.macro push_onto_stack reg
li t0, 8
add sp, sp, t0
sd \reg, (sp)
.endm

...

Here is a somewhat insidious bug I’ve run into multiple times during OS development. Because the macro abstracts away the fact that t0 gets clobbered between the push_onto_stack and pop_off_stack the _print_register call will always print 0 despite any initial value of t0. There’s no easy way to lint this, because the macro is substituted during pre-processing and we can’t really make the same general claims to macro usage as we can to call/ret. The solution here would probably be to create a new macro grumble grumble.

Turning it into code

+----------------------+                 +-----------+
| RISC-V documentation | --------------> | RiscvDict |
+----------------------+                 +-----------+
                                               |
                                               V
+------+      +-------+     +--------+     +--------+      +------------------+
| Code |  --> | Lexer | --> | Parser | --> | Linter | <--> | Code Block state |
+------+      +-------+     +--------+     +--------+      +------------------+

Ideally, we would like to parse the assembly into our high-level constructs, such as an AST, and then define rules to lint the AST. That way, we could easily define new rules based on a common interface. However, as previously mentioned in the existing tools section, I couldn’t find a tool to convert assembly to AST. Such a project shouldn’t be hard, especially with some of the existing parser tools, but sometimes it’s better to have a passable tool that works for your specific use case than a semi-finished optimal tool that nobody will use anyway. So, let’s hack someething together to see what we’re dealing with first and we can always abstract things later.

Parser and CodeBlock state

(2 t. ago) (prev token) (token) (next token)                                        (2 t. ago)
                           |  ,------'                                             (prev token)
                           V  V                             update(opcode,t0)         (token)
                      +--------+    (token)     +--------+ ------------------> +------------------+
                      | Parser | -------------> | Linter |                     | Code Block state |
                      +--------+                +--------+     checkValid()    |------------------|
                                                     ^  |--------------------> | t0: unsafe       |
                                                     |         response        | t1: safe         |
                                                     |------------------------ | ...              |
                                                                               | hasSeenRet = 0   |
                                                                               +------------------+

Austin’s parser (thanks Austin!) only looks ahead one token at a time to determine context. This approach is fine for a simple grammar but can’t scale to large context grammars like assembly. Specifically we have to look waaaay into the future to see if there’s a return statement or not for a function, etc.

To circumvent the limitations of the parser, we create our own custom stack to maintain our own state and update accordingly. Basically we turn the parser from lookahead(1) to lookahead(untilISeeTheEndOfTheFunction). This allows us to expand the context of our grammar without expanding the parser itself.

We define a Code Block state. This state contains the list of all safe/unsafe registers, whether we’ve seen a jump statement yet, etc. The linter reads from this state to determine if an instruction is valid (if all the read registers are “safe”) and writes to this state (updates registers as safe/unsafe based on instructions seen). If the linter sees a label, it reads the Code Block state and validates that all non-volatile registers were restored properly etc.

RiscV Dict

In order to lint RiscV, we need to know the registers read from and written to for each instruction definition. RiscV has A LOT of instructions, so I decided to grab the existing instruction definitions from the RiscV documentation and parse them programatically. This approach allows the linter to seamlessly update with new instruction definitions as they’re published. Unfortunately, the documentation I’m using isn’t perfect and doesn’t include aliased instructions (such as call instead of jalr). I’ve hardcoded the common ones into the dictionary. I realize I could just as easily add them to the documentation, but this approach leaves space for an alias document to be added neatly in the future.

You can read the code yourself at https://github.com/enathang/riscv-linter

Annotations

One high-level aspect not covered is pushing to/popping off the stack. As our linter stands, it assumes all stack pushes/pops are safe and well-mannered. One could add annotations to the linter to be explicit when values are pushed/popped, which would allow for more information for linting.

Example code annotation

.macro push_ra
#:+ ra
sd ra, 0(sp)
addi sp, sp, 8
.endm

.macro pop_ra
#:- ra
addi, sp, sp, -8
ld ra, 0(sp)
.endm

In fact, the parser and linter starts to implement these annotations, but as of this post does not finish or use them due to time constraints.

Conclusion

The linter code is contained at https://github.com/enathang/riscv-linter . Feel free to take a look, modify it, roast it, or toast it. I may continue to work on it, but art is never truly finished, merely abandoned.

My OS

Since we wrote this tool for a purpose, it’s worth seeing the payoff.

Linter running on my bootloader result

Pretty cool!

Is any of this necessary?

By now, you should understand we are enforcing a set of style contraints on the RISC-V assembly to create a layer of abstraction. This layer basically creates the explicit enforcement of a call stack in the assembly. The question remains: why bother? If you’re enforcing most or all of the requirements of a higher level programming language like C (amusingly in this scenario C is considered to be “high level”), why not just use C?

I have no real response to that. Especially considering, barring corner cases, the C compiler will optimize the assembly to be more efficient than anything I would write. Perhaps certain programs are more ergonimic to write in assembly (OS bootloader comes to mind) and need to be verified that they still comply with the RISC-V ABI. Still, sometimes you need to journey up a mountain before you can say “eh, it was better in the photos " ;)

Glossary of questions that arose during this project

Do assembly labels need to be uniquely named?

Yes. Each need to be uniquely named within the global scope of the assembler/linker (linker may depend on if symbols are statically or dynamically linked). One way to circumvent this is a context stack (https://www.nasm.us/xdoc/2.15.05/html/nasmdoc4.html#section-4.7) but this solution simply offers a higher level way of specifying names that will be switched to unique names in compilation/assembly.

Put another way, the requirement still exists, but context stacks allow us to push it down the abstraction layer stack to the point where we don’t care about it, as we do with most other things such as labels allow us to do with physical addresses.

What does ‘good’ RiscV assembly look like?

I’m not sure if this linter is indicative of good abstraction for assembly or simply that I’ve been trained in “high-level programming languages” so I’d be interested to see what professional RiscV assembly looks like and if they follow a similar abstracted structure.

A recurring issue with nomenclature is assembly doesn’t explicitly have a native “function” type. Rather, we define a function as a set of assembly code that is designed to be jumped to (with a linked return address), perform some computation, and then jump back to the linked return address. A function therefore is defined by the properties it has, rather than a technical term. Hopefully it’ll make sense by the 700th reference to this distinction. ↩︎