Saturday, May 11, 2019

Building a Randogram Emulator

As I stated in last year's review of the PCG paper, Randograms are an informal way of looking at random number generators. They are generated by creating a 256x256 grid and then taking 32768 pairs of numbers,  plotting them on the grid. The code for doing this was given in that previous article and it is not too difficult so I will omit the drawing code. Since the numbers for the random number generator will be coming from assembly language code, the RNG component that provides the randogram it's data will need to be written. We will simply have the RNG have an array of 65536 elements that get looped through. Our 6502 interpreter will fill out this array when it is ran. This is pretty trivial code:

open class RNG {
val randomBytes = arrayListOf<Int>()
var indx = 0

init {
for (i in 0..65535) {
randomBytes.add(0)
}
}

open fun nextUByte():Int {
val temp = randomBytes[indx]
++indx
if (indx > 65535)
indx = 0
return temp and 255
}

}

We are now ready to build our Randogram Emulator's main display. This version is being done using JavaFX but could have been done in JavaScript. It would be nice if there was a standard cross-platform GUI library to use, but there are some third party attempts at such a beasts. Until there is a clear winner in this area, I will use JavaFX and plan on porting any GUI code to JavaScript when I am ready to get the js version of the emulator running. The GUI code simply sets up a canvas to draw the randogram on and a button for loading an assembly language file to be processed.

    override fun start(primaryStage: Stage?) {
randogram.generate()
randogram.drawToCanvas(canvas, 2.0)

val btn = Button("Test Assembly")
btn.setOnAction {handleLoadAsm()}
val root = VBox(10.0, canvas, btn)
val scene = Scene(root, 800.0, 600.0)

primaryStage!!.title = "JavaFX test"
primaryStage.scene = scene
primaryStage.show()
}

The button handler is where all the real work takes place. We will simply set up the emulator environment, then bring up a file dialog for obtaining the assembly language file that should be processed. As this is an internal tool, we are not doing any sanity checking on the contents of the assembly language file but instead are simply calling our assembler class and getting it to assemble the file. A proper program would check the results of the assembler to make sure the file assembled properly, but this is a quick and dirty test so this will not be done.

The code for running the machine language is very simple as we just run instructions until we hit a break instruction. The twist, and where we are cheating big time, is that when we run into a NOP instruction, we handle that instruction by taking the value of the accumulator and adding it to the RNG instances list of numbers. Once we have written a proper assembly testing framework, this will result in 65536 numbers being generated which will fill out the list.

Once the assembly language code has ran, we generate the randogram and update the display. We now have a way of testing our assembly language random number generators.

    private fun handleLoadAsm() {
var memoryManager = Cartridge()
var m6502 = M6502( memoryManager )
var byteData =  ByteArray(4096)

val fileChooser = FileChooser()
var assemblyFile = fileChooser.showOpenDialog(null)
if (assemblyFile.isFile) {
var assemblyList:ArrayList<String> = ArrayList(assemblyFile.readLines())

var assembler = Assembler(m6502)
assembler.assembleProgram(assemblyList)
for (cntrRom in 0..assembler.currentBank.size-1) {
byteData[cntrRom] = assembler.currentBank.readBankAddress(cntrRom).toByte()
memoryManager.write(cntrRom, assembler.currentBank.readBankAddress(cntrRom))
}

m6502.state.ip = 0
var ipAddress = m6502.state.ip
var indx = 0
while (memoryManager.read( ipAddress ) != 0) {
if (memoryManager.read( ipAddress )==0xEA) {
rng.randomBytes[indx] = m6502.state.acc
++indx
}
m6502.step()
ipAddress = m6502.state.ip
}
}

randogram.clear()
randogram.generate()
randogram.drawToCanvas(canvas, 2.0)
}

The problem we now have is that we need to write a testing framework for actually testing our random number generators. This will be done next fortnight.

Saturday, April 27, 2019

Building a 16-bit Linear Congruential Generator

The 16 bit seed version of the Linear Congruential Generator (LCG) is very similar to the 8 bit version in that every time it is called it returns a byte. The big difference is that the seed is 16 bits instead of 8 bits so we will have a longer time before numbers start repeating and a bit more randomness. To get the longer cycle (and more randomness) we need to use the upper byte of the results.

For those of you who haven't read the earlier blog entries in this series or who want a quick refresher, LCG uses the formula $S_n = A*S_{n-1}+C$ for getting the next seed so we are simply multiplying the seed by a constant A and then adding a constant C to the result. We will use 141
for A and 3 for C though other carefully chosen values could be used.

We already have the multiplication which we wrote last fortnight, so now all we have to do is use that multiplication routine and then add a constant to that value. This is very simple code as you can see.

; set up our constants for A and C
.EQU LCG_A 141
.EQU LCG_C 3
; create placeholder for RNG seed
seedLo: .BYTE 0
seedHi: .BYTE 0
; ...
; multiply16x8 setup and code would be put here
; ...

lcg16bit:
; load in the current value of the seed into the multiply registers
LDA seedLo
LDY seedHi
; now multiply by the A constant
LDX #LCG_A
JSR multiply16x8
; add the C constant to the reult
CLC
ADC #LCG_C
BCC lcg16bit_update_seed
INY
lcg16bit_update_seed:
; save the results in the seed
STA seedLo
STY seedHi
; put results in accumulator for return value
TYA
RTS

To test this simply call the random number generator 257 times and look at the sequence of numbers that are generated.

test:
JSR lcg16bit
; now lets do this 256 more times
LDX #0
test_loop:
TXA
PHA
JSR lcg16bit
PLA
TAX
DEX
BNE test_loop
NOP ; using this as easy to find separator
BRK

While this does demonstrate that it is working and gives us a longer sequence than the first method, it is not a very good test. For our needs we don't need statistically sound random number generation, but a simple test like generating randograms would be sufficient for getting a rough idea of how the generator was working. Randograms were covered in an earlier blog post but are essentially a 256 * 256 grid that you plot by taking two random numbers and plotting the point on the graph. This is a good idea but not really implementable on a 64k machine as the minimum size for holding the grid is 65536 bytes and we would still need some way to save the data. However, we have our own emulator so could cheat a bit to get the result. More on this next fortnight.

Saturday, April 13, 2019

Multi-byte Multiplication 8x16

When you get right down to it, there is an interesting property about the number of bits involved in a multiplication. The maximum value for a multiplication is the number of bits in the multiplicand plus the number of bits in the multiplier. This means that our 8x8 multiplier should produce a 16 bit number. Likewise a 16x16 bit multiply would properly need 32 bits to store the results. For my needs we don't care about overflow situations as for the random number generator but if you wanted proper accuracy chaining more bytes to have larger results is certainly something that could easily be done. What we need to have a 16 bit seed is to multiply a 16 bit number by an 8 bit number.

This is actually not that much different than the 8x8 multiplier that we created earlier. There are two big differences. First, we are going to need to shift a 16 bit number. This can be done simply enough using the carry flag just as was done with the multi-byte additions. The shift left would then look like:

CLC
ROL LOW
ROL HIGH

This would work just fine, but we can save a byte and a few cycles by getting rid of the CLC instruction and instead using the ASL instruction in the place of the first ROL instruction. This always rotates in a 0 while still putting the shifted out bit into the carry flag.

The second issue is how to deal with a 16-bit result. There are two solutions to this problem. We could store the results in some reserved memory (ideally in page 0) which if we were going to output a 24 bit number would be a requirement. With a little bit of register shifting, we can actually keep the results of the multiplication in the registers winch gives us a bit of a speed boost as well as reducing the bytes needed for the program and memory storage. The new multiplication routine now looks like this:

JMP test
; create a placeholder for the multiplicand
multiplicandLo: 
.BYTE 0
multiplicandHi: 
.BYTE 0
multiplierLo: 
.BYTE 0
;
; Multiply method 
; A = multiplicand lo, Y = multiplicand Hi, X = multiplier
; result returned in A (lo) and Y (hi)
multiply16x8:
; init by setting zero page vars
STX multiplierLo
STA multiplicandLo
STY multiplicandHi
; start result at zero
LDA #0
TAY
multiply16x8_loop:
; see if still bits in multiplier 
LDX multiplierLo
BEQ multiply16x8_done
; shift multiplier right. C set indicates need to add
LSR multiplierLo
BCC multiply16x8_adjust
; add current power of 2 multiplicand
CLC
ADC multiplicandLo
TAX ; save low result in X register 
TYA ; transfer high result to Acc for second half of add
ADC multiplicandHi
TAY ; return high result to Y
TXA ; return low result to Acc
multiply16x8_adjust:
; Set up next power of 2
ASL multiplicandLo
ROL multiplicandHi
JMP multiply16x8_loop
multiply16x8_done:
RTS

test:

; 300 x 5 test
LDA #$2C
LDY #$1
LDX #$5
JSR multiply16x8

; fourty-two test
LDA #7
LDY #0
LDX #6
JSR multiply16x8

; LCG Mulitply step test
LDA #42
LDY #0
LDX #141
JSR multiply16x8

NOP
BRK

Here is a trace of the first test.

(0) 0:JMP $31
(3) 31:LDA #$2C ; A:$0 -> $2c
(5) 33:LDY #$1 ; Y:0 -> $1
(7) 35:LDX #$5 ; X:0 -> $5
(9) 37:JSR $6
(15) 6:STX $5 ; [5]:-> $5
(19) 9:STA $3 ; [3]:-> $2c
(23) c:STY $4 ; [4]:-> 1
(27) f:LDA #$0  ; A:$44 -> $0, Flags:32 -> $22
(29) 11:TAY ; Y:1 -> $0
(31) 12:LDX $5 ; X:5 -> 5
(35) 15:BEQ $30
(37) 17:LSR $5 ; [5]:$5 -> $2, C=1
(43) 1a:BCC $27
(45) 1c:CLC ; C=0
(47) 1d:ADC $3 ; A:$0 -> $2c
(51) 20:TAX ; X:5 -> $2c
(53) 21:TYA  ; A:$2C -> $0
(55) 22:ADC $4 ; A:$0 -> $1
(59) 25:TAY ; Y:0 -> $1
(61) 26:TXA  ; A:$1 -> $2c
(63) 27:ASL $3 ; [3]:$2C -> $58
(69) 2a:ROL $4 ; [4]:$1 -> $2
(75) 2d:JMP $12
(78) 12:LDX $5 ; X:$2c -> $2
(82) 15:BEQ $30
(84) 17:LSR $5 ; [4]:$2 -> $1
(90) 1a:BCC $27
(93) 27:ASL $3 ; [3]:$58 -> $B0
(99) 2a:ROL $4 ; [4]:$2 -> 4
(105) 2d:JMP $12
(108) 12:LDX $5, X:$2 -> $1
(112) 15:BEQ $30
(114) 17:LSR $5 ; [5]:1 -> 0, C=1
(120) 1a:BCC $27
(122) 1c:CLC ; C=0
(124) 1d:ADC $3 ; A:$2c -> $dc
(128) 20:TAX ; X:1 -> $dc
(130) 21:TYA  ; A:$dc -> $1
(132) 22:ADC $4 ; A:$1 -> $5
(136) 25:TAY ; Y:1 -> $5
(138) 26:TXA  ; A:$5 -> $dc
(140) 27:ASL $3
(146) 2a:ROL $4
(152) 2d:JMP $12
(155) 12:LDX $5, X:$dc -> $0, Z=1
(159) 15:BEQ $30
(162) 30:RTS 

As you can see that this works very similar to the original version with only a few changes. A better 8x8 multiply that returns a 16 bit result would be easy to do by simply using this version but always setting Y to 0 when calling. We now have a multiply routine that will allow us to implement both the 16-bit seed version of the LCG algorithm as well as the PCG algorithm. We will start implementing those next fortnight!

Saturday, March 30, 2019

Multi-byte Numbers

Just because the 6502 is an 8-bit processor does not mean that it is limited to dealing with numbers that are larger than 8-bits. You can have as many bits as your memory will support though dealing with larger numbers does add a lot of work that the program has to do. The purpose of the carry flag is for dealing with multi-byte numbers. The big issue is that you can only work with 8 bits at a time of the number. While you can use the x and y registers as temporary holders for parts of the larger number, you most likely will want to store that number somewhere in memory.

There are many ways of storing a multi-byte number but the two most common formats are big-endian and little-endian in reference to  Gulliver's Travels where there were groups of people who fought over whether it was proper to crack a soft-boiled egg from the big end (the big-endians) or the little end (little-endians). CPU makers have been having a similar fight with the decision being should the high order byte be stored in the lower address followed by the lower order byte(s) which we call big-endian or should we do this with the lowest order byte first followed by the higher bytes.


The 6502 is considered to be a little-endian architecture due to the way memory addresses are stored. This is not necessarily how you need to store multi-byte numbers in memory, but this tends to be the convention for 6502 programmers. However, since the registers are only 8 bits, it really does not matter what convention you use as long as you are consistent. Here is a sample of doing a 16-bit addition.

JMP test

; Big Endian variables
BE_NumberAHi: .BYTE $3
BE_NumberALo: .BYTE $E8
BE_NumberBHi: .BYTE $3
BE_NumberBLo: .BYTE $E8

; Little Endian Variables
LE_NumberALo: .BYTE $E8
LE_NumberAHi: .BYTE $3
LE_NumberBLo: .BYTE $E8
LE_NumberBHi: .BYTE $3

test:
; Big endian A = A + B
LDA BE_NumberALo
CLC
ADC BE_NumberBLo ; add low btye
STA BE_NumberALo
LDA BE_NumberAHi ; add high byte
ADC BE_NumberBHi ; using carry from prev add.
STA BE_NumberAHi
; Little endian A = A + B
LDA LE_NumberALo
CLC
ADC LE_NumberBLo ; add low btye
STA LE_NumberALo
LDA LE_NumberAHi 
ADC LE_NumberBHi ; add high byte
STA LE_NumberAHi ; using carry from prev add.
NOP
BRK

As you can see there is no real difference which order you store the data in but this is not the case with processors that have larger registers. Multi-byte math is done by taking advantage of the carry flag as the ADC instruction will add an additional 1 to the result of the add if the carry flag is set. This is why you clear the carry flag before adding the low byte, you then add the higher bytes. Note that you can have as many bytes as you want in your number as long as you continue to chain the bytes.

The same procedure for multi-byte addition works for rotation and subtraction, though with subtraction you need to set the carry flag (SEC) not clear it. We will be seeing a demo of multi-byte rotation next fortnight when I implement my 8x16-bit multiplier.

Saturday, March 16, 2019

Shifted Multiply

Adding numbers together in a loop works for small numbers but the execution time quickly increases as the multiplier increases. Another way of multiplying numbers is by shifting them. Each shift to the left is the equivalent of multiplying by 2. This means that if we shift to the lefft 3 times we have effectively multiplied by 8. The problem is that this only works with powers of two. When you think about it, though, 5x3 is the same as 5x2+5x1. Likewise, 7x6 is just 7x4+7x2.

This means that we cam drastically improve multiplication by simply looping through the bits shifting the multiplier to the right and seeing if it is set. If it is we add the current value of the shifted multiplicand to the result. We then shifting the multiplicand to the left to multiply it by the next power. This is repeated until the multiplier becomes 0. The code that does this is as follows:

; brute force 8-bit multiplication
JMP test
; create a placeholder for the multiplicand
multiplicandLo: 
.BYTE 0
multiplierLo: 
.BYTE 0
;
; Multiply method A = multiplicand, X = multiplier result returned in A
multiply8x8:
; init by setting zero page vars
STX multiplierLo
STA multiplicandLo
; start result at zero
LDA #0
multiply8x8_loop:
; see if still bits in multiplier 
LDX multiplierLo
BEQ multiply8x8_done
; shift multiplier right. C set indicates need to add
LSR multiplierLo
BCC multiply8x8_adjust
; add current power of 2 multiplicand
CLC
ADC multiplicandLo
multiply8x8_adjust:
; Set up next power of 2
ASL multiplicandLo
JMP multiply8x8_loop
multiply8x8_done:
RTS

test:
; fourty-two test
LDA #7
LDX #6
JSR multiply8x8
; 10*20 test
LDA #10
LDX #20
JSR multiply8x8
; LCG Mulitply step test
LDA #42
LDX #141
JSR multiply8x8
NOP
BRK

When ran, we get the following trace for the first test. The second and third tests are pretty similar.

(0) 0:JMP $22
(3) 22:LDA #$7 ; A:0 -> 7
(5) 24:LDX #$6 ; X:0 -> 6
(7) 26:JSR $5
(13) 5:STX $4 ; [4]:0 -> 6
(17) 8:STA $3 ; [3]:0 -> 7
(21) b:LDA #$0 ; A:7 -> 0
(23) d:LDX $4 ; X:6 -> 6
(27) 10:BEQ $21
(29) 12:LSR $4 ; [4]:6 -> 3
(35) 15:BCC $1B
(38) 1b:ASL $3 ; [3]:7 -> $E
(44) 1e:JMP $D
(47) d:LDX $4 ; X:6 -> 3
(51) 10:BEQ $21
(53) 12:LSR $4 ; [4]:3 -> 1, C=1
(59) 15:BCC $1B
(61) 17:CLC ; C=0
(63) 18:ADC $3 ; A:0 -> $E
(67) 1b:ASL $3 ; [3]:$E -> 1C
(73) 1e:JMP $D
(76) d:LDX $4 ; X:3 -> 1
(80) 10:BEQ $21
(82) 12:LSR $4 ; [4]:1->0, C=1
(88) 15:BCC $1B
(90) 17:CLC ; C=0
(92) 18:ADC $3 ; A:$E -> 2A
(96) 1b:ASL $3 ; [3]:$1C -> $38
(102) 1e:JMP $D
(105) d:LDX $4 ; X:1 -> 0, Z=1
(109) 10:BEQ $21
(112) 21:RTS 

The call starts at 3 and runs until 118 for a 115 cycle time. This is slightly worse than the brute force way, but if we continued with the other two tests we would find that the 10x20 test takes 163 cycles which is significantly faster than the 272 cycles for the brute force multiply. And the 42x141 test only took 245 cycles compared to 1845 cycles that the brute force method required.

We now have everything necessary to create a LCG random number generator, which we will be doing next fortnight.

Saturday, March 2, 2019

Brute-forcing 8-bit multiplication

As the 6502 does not have instructions for multiplication, if we want to do any multiplication then we need to write our own routine to multiply numbers. There is a very easy way to do this, but it is not efficient. Still, lets take a look at the obvious brute force method before working out a more efficient way of doing this.

In the discrete math courses I took, multiplication was defined as adding a number which is called the multiplicand by itself the number of times specified by another number which is called the multiplier. We have an add instruction and we have looping constructs to repeat the addition a number of times. To make this a proper function we need to have a way of getting information into the function and returning the results. In this case, we have two bytes so we will have the multiplicand in the accumulator and the multiplier in x returning the result in the accumulator.

The following code is a very rough way of doing this but is not what I would consider production code which would check for the 0 case, and would have a nicer loop. As I am planning on writing a better version, this will do for demonstration purposes.

; brute force 8-bit multiplication
JMP test
; create a placeholder for the multiplicand
multiplicandLo: 
.BYTE 0
;
; Multiply method A = multiplicand, X = multiplier result returned in A
multiply:
STA multiplicandLo
multiplyLoop:
DEX
BEQ multiplyDone
CLC
ADC multiplicand
JMP multiplyLoop
multiplyDone:
RTS

test:
; fourty-two test
LDA #7
LDX #6
JSR multiply
; 10*20 test
LDA #10
LDX #20
JSR multiply
BRK


The trace results shows us pretty much what we would expect. The first test does take a while to run as it starts at cycle 3 and returns at cycle 93 (the instruction after the RTS) giving us a 90 cycle run time.

(0) 0:JMP $12
(3) 12:LDA #$7 ' A: 0 -> 7
(5) 14:LDX #$6 ' X: 0 -> 6
(7) 16:JSR $4# 
(13) 4:STA $3# ' [3]: 0 -> 7
(17) 7:DEX ' X:6 -> 5
(19) 8:BEQ $11
(21) a:CLC 
(23) b:ADC $3 ' A: 7 -> $E
(27) e:JMP $7
(30) 7:DEX ' X: 5 -> 4
(32) 8:BEQ $11
(34) a:CLC
(36) b:ADC $3 ' A: $E -> $15
(40) e:JMP $7
...
Additional loops removed as you get the idea!
...
(75) b:ADC $3 ' A: $23 -> $2A
(79) e:JMP $7
(82) 7:DEX ' X: 1 -> 0; Z set
(84) 8:BEQ $11
(87) 11:RTS

A slightly larger 10 * 20 multiply, however, does not give us that good of results.

(93) 19:LDA #$A ' A: $2A -> $A 
(95) 1b:LDX #$14 ' X: 0 -> $14
(97) 1d:JSR $4
(103) 4:STA $3 ' [3] 7 -> $A
(107) 7:DEX ' X: $14 -> $13
(109) 8:BEQ $11
(111) a:CLC 
(113) b:ADC $3 ' A: $A -> $14
...
Additional loops removed as you get the idea!
...
(345) a:CLC
(347) b:ADC $3 ' A: $BE -> $C8
(351) e:JMP $7
(354) 7:DEX ' X: 1 -> 0; Z set
(356) 8:BEQ $11
(359) 11:RTS
(365) BRK

The time for this was 272 cycles which is tolerable but getting bad. For our random number generator we need to multiply by 141. This would result in 1845 cycles to perform the multiplication. Not a good result at all. Obviously we need to come up with a better way of handling multiplication. This will be covered next fortnight.

Saturday, February 16, 2019

Building a Simple trace utility

One of the early development tools for assembly language programmers was a trace tool. This was a simple tool that you ran and it would print out what the program was running at every step of the process. This type of tool could be considered the predecessor to the debugger. My eventual plans are to turn this tool into a debugger eventually. For developing the random number generator libraries I don't need that much of a utility. All that is really needed is the ability to load an assembly language file, assemble it, then run the file printing the command being executed and the state of the registers.
Trace utilities can be thought of as an automated single-step through the program that runs until a break statement is encountered. It is handy for seeing what is happening in non-linear programs such as the following demo:

    LDA #0
LDX #3
loop:
CLC
ADC #3
DEX
BNE loop
BRK

While the above example may be fairly simply to manually figure out, seeing what is happening line-by-line can still be useful for debugging as it is very easy to overlook something obvious, especially if it is a mistake that you have made yourself. It is also very handy for demonstrating what a piece of code is doing, which is also very important for explaining the mathematical functions that I will be writing in the remainder of this chapter.

People who have been following this series from the beginning will remember that I said that an emulator was essentially a disassembler that also had an interpreter attached to it. As we already have a dissasembler, and we have an emulator that can run just an individual instruction while storing the state of the registers in an accessible data class, the actual trace part of the trace utility is simply calling the dissasembler to get the assembly instructions, then running that step, then grabbing the register state and printing the results.

// run until break
m6502.state.ip = 0
var traceString = ""
var ipAddress = m6502.state.ip
while (memoryManager.read( ipAddress ) != 0) {
traceString = "(" + m6502.state.tick +") " +
ipAddress.toString(16) + ":" +
m6502.disassembleStep(m6502.state.ip)
m6502.step()
ipAddress = m6502.state.ip
traceString = traceString + "# A:" + m6502.state.acc.toString(16) +
", X:" + m6502.state.x.toString(16) +
", Y:" + m6502.state.y.toString(16) +
", Flags:" + m6502.state.flags.toString(16)
println(traceString)
}

When this is run on our sample program, we get the following results.

(0) 0:LDA #$0# A:0, X:0, Y:0, Flags:22
(2) 2:LDX #$3# A:0, X:3, Y:0, Flags:20
(4) 4:CLC # A:0, X:3, Y:0, Flags:20
(6) 5:ADC #$3# A:3, X:3, Y:0, Flags:20
(8) 7:DEX # A:3, X:2, Y:0, Flags:20
(10) 8:BNE $4# A:3, X:2, Y:0, Flags:20
(13) 4:CLC # A:3, X:2, Y:0, Flags:20
(15) 5:ADC #$3# A:6, X:2, Y:0, Flags:20
(17) 7:DEX # A:6, X:1, Y:0, Flags:20
(19) 8:BNE $4# A:6, X:1, Y:0, Flags:20
(22) 4:CLC # A:6, X:1, Y:0, Flags:20
(24) 5:ADC #$3# A:9, X:1, Y:0, Flags:20
(26) 7:DEX # A:9, X:0, Y:0, Flags:22
(28) 8:BNE $4# A:9, X:0, Y:0, Flags:22

Not the prettiest of output, but that can be easily tweaked. So with a very tiny amount of work, we now have a useful tool for writing and testing chunks of assembly language. We are now ready to start building the math libraries that we will be needing for writing our random number generators. We will start covering multiplication next fortnight.