Assemblers
are easier to write than a complier would be as the structure of the code is a
lot simpler. There are still a lot of similarities between a compiler and an
assembler that some of the common techniques used to write a compiler are also
useful for writing an assembler. With a compiler you first tokenized the source
code (Lexical Analysis), parse the tokens into something meaningful (syntax and
semantic analysis), generate some intermediate code which then finally generate
the machine code. With the assembler, I am taking a similar approach. I am
tokenizing the source code, parsing it into intermediate machine language, then
generating the final machine language.
Java does
have some tokenizer classes that will do part of what I want done but while it
is possible to use Java classes in Kotlin I am not sure how well those will work if I try
to compile my Kotlin code to JavaScript, which is something I do want to do eventually.
For this reason, and my NIH syndrome kicking in, I opted to write my own
tokenizer for the assembler. I created a simple enumeration to hold the
different types of tokens that my assembler will use, though suspect that I
will be adding some new ones later.
enum class
AssemblerTokenTypes {
DIRECTIVE, ERROR, IMMEDIATE,
INDEX_X, INDEX_Y,
INDIRECT_START, INDIRECT_END
LABEL_DECLARATION, LABEL_LINK,
NUMBER, OPCODE, WHITESPACE }
A DIRECTIVE
is a command for the assembler. There are a number of different ways that these
can be handled but I am going to start my directives with a dot. I have not
worked out all the directives that I plan on supporting but at a minimum I will
need .ORG, .BANK, .BYTE and .DATA with .INCLUDE and .CONST being nice to have.
More on these when I actually get to the directives portion of my assembler.
ERROR is an
indication of a tokenization error which kind of breaks the assembling of the
file. Invalid characters used would be the likely culprit.
Some of the
6502 instructions have an immediate mode that lets the programmer specify a
constant value to use in the next byte. This is indicated by prefacing the
constant with the hash (#) symbol. The tokenizer simply indicates that an
immediate mode value is going to be next by having an IMMEDIATE token.
The 6502
supports offsets of an address using “,X” or “,Y” so the tokens indicate such
an index is being used. These indexes are used for zero page indexing, indirect
indexing, as well as your normal indexing which is officially called absolute
indexing. The particular type of indexing address mode that will be used will
be determined by the parser which will be covered later.
Indirect
addressing modes prefix the address with an open bracket and postfix the
address with a closed bracket. To indicate this the INDIRECT_START, and INDIRECT_END
tokens are used.
It is
certainly possible to write an assembler that does not track the addresses of
locations for you but requires you to know all the addresses that you are using
but one of the reasons that assemblers were invented was to remove this
busywork. This means that we need to have some type of support for labels in
our assembler. Most 6502 assemblers will indicate the location within the code
by having a label that is followed by a colon at the beginning of the line. This
is indicated by the LABEL_DECLARATION token with LABEL_LINK tokens
being used for links within the code.
As assembly
language revolves around numbers, we obviously need a NUMBER token. This is a
special token for processing as I am supporting binary, decimal, and
hexadecimal formats for numbers. My Machine Architecture teacher will probably
be upset that I am not including support for octal numbers but I never use that
format in code so didn’t see the point in adding that. I am using the pretty
standard 6502 convention of representing hex numbers by prefixing them with a $
and by prefixing binary numbers with a % symbol. Supporting binary is not vital
but very handy to have, especially for a machine like the 2600 where you are
doing a lot of bit manipulation.
While I
probably should have used the term MNEMONIC instead of OPCODE for the
enumeration, I often call the mnemonic an op code even though technically the
op code is the actual numeric value that the assembler ultimately converts the
mnemonic into. Should I change this in my code, probably. Will I?
Finally, WHITESPACE
is the spaces, tabs, and comments. In most assemblers comments are designated
with a ; so that works fine for me. Most the time the whitespace characters
will be ignored so I could arguably not have a token for whitespace and simply
ignore it.
No comments:
Post a Comment