Wednesday, January 31, 2018

Two Pass Versus One-and-a-half Pass Assembly

The label problem discussed last time is the reason that early assemblers were two pass assemblers. The first pass of assembly simply figures out what address each statement will be on while created a table of labels indicating what address each label referred to. The second pass generates the machine language instructions and replaces the labels with the addresses that were worked out during the first pass. The downside of this approach is you need to process the file twice or generate some type of intermediate assembly language file.

As we process an assembly language file, we know which statements need an address and which label they require. This means that we know where in the machine language output address information needs to be added. Keeping this information in memory is not that hard to do, especially for today’s machines which hold millions of times more memory than the resulting machine language file that we are trying to generate. This approach will require only a single pass over the source assembly language file but will require two passes over the machine language file that is being output.

This means that we would need some type of label class which would hold the label, what the actual address for the label is, and a list of links that use that label. As relative links need to be handled different from a regular link so this starts becoming a bit complicated. If we consider variables, a feature I would like in the future, then things get even more complex. The following enumeration covers the different types of labels that I have come up with.

enum class AssemblerLabelTypes {
    TARGET_VALUE,      // Holds the address or value that the label represents
    HIGH_BYTE,         // treat label as variable and use high byte of it
    LOW_BYTE,          // treat label as variable and use low byte of it
    RELATIVE_ADDRESS,  // for relative addressing mode so distance from target
    ZERO_PAGE_ADDRESS, // low byte of address as there should be no high byte
    ADDRESS            // treat as absolute address
}

This got me thinking that if I stored the label in a hash map then all I would need to do is have a list with a simple structure holding the information. Something like the following:

data class AssemblerLabel(val labelName:String,
                          val typeOfLabel:AssemblerLabelTypes,
                          val addressOrValue: Int,
                          val bank:Int = 0)

This, of course, has a redundant label which is not really necessary and is really just a waste of memory but I just couldn’t bring myself around to removing it at this point. I suppose it is something that should be done, but a label that doesn’t have a name does not seem right. If I refactor this code in the future, I would probably rename this as AssemblerLabelNode so the lack of a label string won’t bother me. That is something for the future, however.

Adding a label now simply becomes finding the map entry with the label name in it, creating an entry if one does not exist and adding the AssemblerLabel onto the list associated with that entry. I am using an ArrayList for my list though a linked list would probably be the more correct structure from a data structures viewpoint.  While ArrayLists seem to perform better, for the usage that the assembler does, the linked list would be as efficient

   fun addLabel(label:AssemblerLabel) {
        // see if already a label with this name
        if ( !labelList.containsKey(label.labelName)) {
            val resultList = ArrayList<AssemblerLabel>();
            resultList.add(label)
            labelList.put(label.labelName, resultList)
        } else {
            val resultList:ArrayList<AssemblerLabel> =  labelList[label.labelName]!!
            resultList.add(label)
        }
    }

This stores the labels during the pass over the source assembly file but still requires a second pass over the list, which will be covered next time.

Wednesday, January 24, 2018

Generating Machine Language and the Label Problem

Once you know the mnemonic,  value or address to use, and the addressing mode then generating the machine language for a line is trivial. You simply find the mnemonic and look up the addressing mode being used which gives you the op code. The op code is the first number to write followed by the value or address. The value that is written depends on the address mode, but there is an interesting factor here.

Implied and accumulator addressing modes do not have any additional bytes added as the value that the CPU uses is already in the register(s) being manipulated. I do consider the flags to be a register as well as the stack pointer being a register.

Immediate mode, relative mode, and the zero page variants all use a single byte. For relative mode you subtract the next instruction address from the target address to find the offset.  For the other modes, the address is used being stored with the low byte first followed by the high byte. This is known as little endian while storing the bytes in reading order is big endian. 

So, essentially you just need to know the length of the instruction to determine the bytes that get written out to the byte stream. This is what my machine language generator does.

    fun createAssemblyInstruction(opString:String, mode:AddressMode, target:Int):Array<Int> {
        val opCode = getOpcodeWithAddressMode(opString, mode);
        if (opCode == -1)
            return arrayOf()
        val instructionSize = m6502.commands[opCode].size
        if (instructionSize == 1)
            return arrayOf(opCode)
        else if (instructionSize == 2) {
            return arrayOf(opCode, target and 255)
        } else {
            val targetHigh:Int = (target / 256) and 255
            val targetLow = target and 255
            return arrayOf(opCode, targetLow, targetHigh)
        }
    }

This is fine for when you do know the address that a a branch or jump is going to be accessing, but what about people who want to use labels so they don’t have to manually figure out the address of everything? We already said that the labels at the start of a line get the current address of the machine language instructions being written so we simply need to use that. The problem is that we don’t always know the address before it is used, as the following example demonstrates:

           LDX #10
loop:      JSR doSomething
           DEX
           BNE loop
...
doSomething:
...
           RTS

This code clearly does something ten times. However, while the branch for the loop knows where the loop address is, the jump to subroutine (JSR) instruction does not know where it is supposed  to branch to. You could put the subroutine earlier in the code, but you don’t want to run it early so you would need to jump over that code so it doesn’t run and jumping forward would still require knowledge of an address that has not yet been discovered.


So, while we can generate code for known addresses, we have a problem with any code that has forward branches in it. The traditional approach is a two pass approach, but for my assembler I am going with a one and a half pass approach which will be explained next time.

Wednesday, January 17, 2018

Parsing 6502 Assembly

The tokens that make up the line being processed need to be converted into machine language. There are essentially 5 valid lines that can be in an assembly language file which are blank lines, labelled lines, labelled statements, directives, and statements.  Blank lines can simply be ignored. Labelled lines simply add the current address being processed to the list of links (more on that later). Labelled statement lines simply act like the label, then remove the label and act like a statement. Directives are assembler commands which are processed separately and will be covered in a later article. Finally, statements are what the assembler turns into machine language.

Assembly language has a very simple structure for the statements making it much easier than parsing a higher-level language. The format for a 6502 assembly language statement is: mnemonic [value]. The value is either a number that the command will act on immediately or is an address and the mode in which the address is going to be used in. If you know the number or addresses being used then this is simply the process of taking the base mnemonic and then looking at how the address was specified to determine the address mode to use. The following table demonstrates this.

MNEMONIC
EOL
IMPLIED
MNEMONIC
A
EOL
ACCUMULATOR
MNEMONIC
NUMBER
EOL
ABSOLUTE or RELATIVE or ZERO_PAGE
MNEMONIC
NUMBER
,X
EOL
ABSOLUTE_X or ZERO_PAGE_X
MNEMONIC
NUMBER
,Y
EOL
ABSOLUTE_Y or ZERO_PAGE_Y
MNEMONIC
#
NUMBER
EOL
IMMEDIATE
MNEMONIC
(
NUMBER
)
EOL
INDIRECT
MNEMONIC
(
NUMBER
)
,Y
EOL
INDIRECT_Y
MNEMONIC
(
NUMBER
,X
)
EOL
INDIRECT_X

As you can see, the different combinations of tokens form different address modes so simply finding which sequence of tokens is used but you will notice that there are three cases where different address modes use the same syntax. The distinction between absolute addressing (with or without X and Y) and zero page addressing is simply the address used. If the address will fit into a byte then the more compact zero page instructions can be used.

Determining if relative mode should be used is a bit trickier as you need to look up the mnemonic and determine if it has a relative address mode in which case you use that mode. We know this is correct as relative address modes are only used for branching instructions which only support the relative address mode.

We now have enough information that we can parse the statement and generate some machine language, which we will cover next week. 

Wednesday, January 10, 2018

Tokenizing

I finally got around to putting up my source on GitHub. The code is in a very rough state as it is more prototyping than production and it is a bit ahead of what I am posting but anybody who is interested in it can find it at github.com/BillySpelchan/VM2600 . My original plans were to self-host it on 2600Dragons, but that just seemed like too much work. Once I get to the point of getting the emulator running in JavaScript I may post it to 2600Dragons but that may be a few months or so.

As explained last week, tokens are broken down into a number of types which are stored in the AssemberTokenTypes enumeration. Tokens are effectively just a simple data structure so I take advantage of Kotlin’s data class support to define the class which is all that is needed.

data class AssemblerToken(val type:AssemblerTokenTypes, val contents:String, val num:Int)

The contents holds the actual string that forms this token while the value is the numeric representation of the token if appropriate or 0 otherwise. In the case of label declarations, the contents does not include the postpending colon.

The tokenize function is quite long but is simple. Essentially it simply loops through each character in the string that is about to be processed basing what to do based on the character encountered. Spaces, tabs, and semicolons as well as everything after them get converted into whitespace. Commas are matched with an x or a y. Brackets form the indirect start and indirect end tokens. The hash becomes an immediate token. The period becomes a directive token. The percent indicates the start of a binary number so the ones and zeros after it are converted into the appropriate base 2 number. Likewise, the dollar sign signifies hexadecimal numbers so the hex characters after it are taken and used to form a base 16 number. Numbers get converted into a decimal number. I could have taken the route of making an octal number if the first number starts with a 0 but don’t really use octal so never bothered.

Strings of letters form labels with a colon indicating the label is a link label. If not a link label, the text is checked against the list of mnemonics and if it matches becomes a mnemonic token.

When I was planning my tokenizer, I created the following diagram which explains it. This was probably not necessary as the logic is very straight forward, but this may help make the process easier to understand.



The tokens are stored in an array list of assembler tokens, with whitespace tokens not being added to the list unless the function is told to include it by passing false to the ignoreWhiteSpace parameter. Testing is simply verifying that passed strings produce the appropriate list of tokens.


Next we need to take the tokens and convert them into 6502 assembly which will start being covered next week.

Wednesday, January 3, 2018

Assembling the Tokens

Assemblers are easier to write than a complier would be as the structure of the code is a lot simpler. There are still a lot of similarities between a compiler and an assembler that some of the common techniques used to write a compiler are also useful for writing an assembler. With a compiler you first tokenized the source code (Lexical Analysis), parse the tokens into something meaningful (syntax and semantic analysis), generate some intermediate code which then finally generate the machine code. With the assembler, I am taking a similar approach. I am tokenizing the source code, parsing it into intermediate machine language, then generating the final machine language.

Java does have some tokenizer classes that will do part of what I want done but while it is possible to use Java classes in Kotlin I am not sure how well those will work if I try to compile my Kotlin code to JavaScript, which is something I do want to do eventually. For this reason, and my NIH syndrome kicking in, I opted to write my own tokenizer for the assembler. I created a simple enumeration to hold the different types of tokens that my assembler will use, though suspect that I will be adding some new ones later.

enum class AssemblerTokenTypes {
    DIRECTIVE, ERROR, IMMEDIATE,
    INDEX_X, INDEX_Y,
    INDIRECT_START, INDIRECT_END
    LABEL_DECLARATION, LABEL_LINK,
    NUMBER, OPCODE, WHITESPACE }

A DIRECTIVE is a command for the assembler. There are a number of different ways that these can be handled but I am going to start my directives with a dot. I have not worked out all the directives that I plan on supporting but at a minimum I will need .ORG, .BANK, .BYTE and .DATA with .INCLUDE and .CONST being nice to have. More on these when I actually get to the directives portion of my assembler.

ERROR is an indication of a tokenization error which kind of breaks the assembling of the file. Invalid characters used would be the likely culprit.

Some of the 6502 instructions have an immediate mode that lets the programmer specify a constant value to use in the next byte. This is indicated by prefacing the constant with the hash (#) symbol. The tokenizer simply indicates that an immediate mode value is going to be next by having an IMMEDIATE token.

The 6502 supports offsets of an address using “,X” or “,Y” so the tokens indicate such an index is being used. These indexes are used for zero page indexing, indirect indexing, as well as your normal indexing which is officially called absolute indexing. The particular type of indexing address mode that will be used will be determined by the parser which will be covered later.

Indirect addressing modes prefix the address with an open bracket and postfix the address with a closed bracket. To indicate this the INDIRECT_START, and INDIRECT_END tokens are used.

It is certainly possible to write an assembler that does not track the addresses of locations for you but requires you to know all the addresses that you are using but one of the reasons that assemblers were invented was to remove this busywork. This means that we need to have some type of support for labels in our assembler. Most 6502 assemblers will indicate the location within the code by having a label that is followed by a colon at the beginning of the line. This  is indicated by the  LABEL_DECLARATION token with LABEL_LINK tokens being used for links within the code.
As assembly language revolves around numbers, we obviously need a NUMBER token. This is a special token for processing as I am supporting binary, decimal, and hexadecimal formats for numbers. My Machine Architecture teacher will probably be upset that I am not including support for octal numbers but I never use that format in code so didn’t see the point in adding that. I am using the pretty standard 6502 convention of representing hex numbers by prefixing them with a $ and by prefixing binary numbers with a % symbol. Supporting binary is not vital but very handy to have, especially for a machine like the 2600 where you are doing a lot of bit manipulation.

While I probably should have used the term MNEMONIC instead of OPCODE for the enumeration, I often call the mnemonic an op code even though technically the op code is the actual numeric value that the assembler ultimately converts the mnemonic into. Should I change this in my code, probably. Will I?

Finally, WHITESPACE is the spaces, tabs, and comments. In most assemblers comments are designated with a ; so that works fine for me. Most the time the whitespace characters will be ignored so I could arguably not have a token for whitespace and simply ignore it.

Now that we have the tokens out of the way, we need to write the tokenizer.