Saturday, March 30, 2019

Multi-byte Numbers

Just because the 6502 is an 8-bit processor does not mean that it is limited to dealing with numbers that are larger than 8-bits. You can have as many bits as your memory will support though dealing with larger numbers does add a lot of work that the program has to do. The purpose of the carry flag is for dealing with multi-byte numbers. The big issue is that you can only work with 8 bits at a time of the number. While you can use the x and y registers as temporary holders for parts of the larger number, you most likely will want to store that number somewhere in memory.

There are many ways of storing a multi-byte number but the two most common formats are big-endian and little-endian in reference to  Gulliver's Travels where there were groups of people who fought over whether it was proper to crack a soft-boiled egg from the big end (the big-endians) or the little end (little-endians). CPU makers have been having a similar fight with the decision being should the high order byte be stored in the lower address followed by the lower order byte(s) which we call big-endian or should we do this with the lowest order byte first followed by the higher bytes.


The 6502 is considered to be a little-endian architecture due to the way memory addresses are stored. This is not necessarily how you need to store multi-byte numbers in memory, but this tends to be the convention for 6502 programmers. However, since the registers are only 8 bits, it really does not matter what convention you use as long as you are consistent. Here is a sample of doing a 16-bit addition.

JMP test

; Big Endian variables
BE_NumberAHi: .BYTE $3
BE_NumberALo: .BYTE $E8
BE_NumberBHi: .BYTE $3
BE_NumberBLo: .BYTE $E8

; Little Endian Variables
LE_NumberALo: .BYTE $E8
LE_NumberAHi: .BYTE $3
LE_NumberBLo: .BYTE $E8
LE_NumberBHi: .BYTE $3

test:
; Big endian A = A + B
LDA BE_NumberALo
CLC
ADC BE_NumberBLo ; add low btye
STA BE_NumberALo
LDA BE_NumberAHi ; add high byte
ADC BE_NumberBHi ; using carry from prev add.
STA BE_NumberAHi
; Little endian A = A + B
LDA LE_NumberALo
CLC
ADC LE_NumberBLo ; add low btye
STA LE_NumberALo
LDA LE_NumberAHi 
ADC LE_NumberBHi ; add high byte
STA LE_NumberAHi ; using carry from prev add.
NOP
BRK

As you can see there is no real difference which order you store the data in but this is not the case with processors that have larger registers. Multi-byte math is done by taking advantage of the carry flag as the ADC instruction will add an additional 1 to the result of the add if the carry flag is set. This is why you clear the carry flag before adding the low byte, you then add the higher bytes. Note that you can have as many bytes as you want in your number as long as you continue to chain the bytes.

The same procedure for multi-byte addition works for rotation and subtraction, though with subtraction you need to set the carry flag (SEC) not clear it. We will be seeing a demo of multi-byte rotation next fortnight when I implement my 8x16-bit multiplier.

Saturday, March 16, 2019

Shifted Multiply

Adding numbers together in a loop works for small numbers but the execution time quickly increases as the multiplier increases. Another way of multiplying numbers is by shifting them. Each shift to the left is the equivalent of multiplying by 2. This means that if we shift to the lefft 3 times we have effectively multiplied by 8. The problem is that this only works with powers of two. When you think about it, though, 5x3 is the same as 5x2+5x1. Likewise, 7x6 is just 7x4+7x2.

This means that we cam drastically improve multiplication by simply looping through the bits shifting the multiplier to the right and seeing if it is set. If it is we add the current value of the shifted multiplicand to the result. We then shifting the multiplicand to the left to multiply it by the next power. This is repeated until the multiplier becomes 0. The code that does this is as follows:

; brute force 8-bit multiplication
JMP test
; create a placeholder for the multiplicand
multiplicandLo: 
.BYTE 0
multiplierLo: 
.BYTE 0
;
; Multiply method A = multiplicand, X = multiplier result returned in A
multiply8x8:
; init by setting zero page vars
STX multiplierLo
STA multiplicandLo
; start result at zero
LDA #0
multiply8x8_loop:
; see if still bits in multiplier 
LDX multiplierLo
BEQ multiply8x8_done
; shift multiplier right. C set indicates need to add
LSR multiplierLo
BCC multiply8x8_adjust
; add current power of 2 multiplicand
CLC
ADC multiplicandLo
multiply8x8_adjust:
; Set up next power of 2
ASL multiplicandLo
JMP multiply8x8_loop
multiply8x8_done:
RTS

test:
; fourty-two test
LDA #7
LDX #6
JSR multiply8x8
; 10*20 test
LDA #10
LDX #20
JSR multiply8x8
; LCG Mulitply step test
LDA #42
LDX #141
JSR multiply8x8
NOP
BRK

When ran, we get the following trace for the first test. The second and third tests are pretty similar.

(0) 0:JMP $22
(3) 22:LDA #$7 ; A:0 -> 7
(5) 24:LDX #$6 ; X:0 -> 6
(7) 26:JSR $5
(13) 5:STX $4 ; [4]:0 -> 6
(17) 8:STA $3 ; [3]:0 -> 7
(21) b:LDA #$0 ; A:7 -> 0
(23) d:LDX $4 ; X:6 -> 6
(27) 10:BEQ $21
(29) 12:LSR $4 ; [4]:6 -> 3
(35) 15:BCC $1B
(38) 1b:ASL $3 ; [3]:7 -> $E
(44) 1e:JMP $D
(47) d:LDX $4 ; X:6 -> 3
(51) 10:BEQ $21
(53) 12:LSR $4 ; [4]:3 -> 1, C=1
(59) 15:BCC $1B
(61) 17:CLC ; C=0
(63) 18:ADC $3 ; A:0 -> $E
(67) 1b:ASL $3 ; [3]:$E -> 1C
(73) 1e:JMP $D
(76) d:LDX $4 ; X:3 -> 1
(80) 10:BEQ $21
(82) 12:LSR $4 ; [4]:1->0, C=1
(88) 15:BCC $1B
(90) 17:CLC ; C=0
(92) 18:ADC $3 ; A:$E -> 2A
(96) 1b:ASL $3 ; [3]:$1C -> $38
(102) 1e:JMP $D
(105) d:LDX $4 ; X:1 -> 0, Z=1
(109) 10:BEQ $21
(112) 21:RTS 

The call starts at 3 and runs until 118 for a 115 cycle time. This is slightly worse than the brute force way, but if we continued with the other two tests we would find that the 10x20 test takes 163 cycles which is significantly faster than the 272 cycles for the brute force multiply. And the 42x141 test only took 245 cycles compared to 1845 cycles that the brute force method required.

We now have everything necessary to create a LCG random number generator, which we will be doing next fortnight.

Saturday, March 2, 2019

Brute-forcing 8-bit multiplication

As the 6502 does not have instructions for multiplication, if we want to do any multiplication then we need to write our own routine to multiply numbers. There is a very easy way to do this, but it is not efficient. Still, lets take a look at the obvious brute force method before working out a more efficient way of doing this.

In the discrete math courses I took, multiplication was defined as adding a number which is called the multiplicand by itself the number of times specified by another number which is called the multiplier. We have an add instruction and we have looping constructs to repeat the addition a number of times. To make this a proper function we need to have a way of getting information into the function and returning the results. In this case, we have two bytes so we will have the multiplicand in the accumulator and the multiplier in x returning the result in the accumulator.

The following code is a very rough way of doing this but is not what I would consider production code which would check for the 0 case, and would have a nicer loop. As I am planning on writing a better version, this will do for demonstration purposes.

; brute force 8-bit multiplication
JMP test
; create a placeholder for the multiplicand
multiplicandLo: 
.BYTE 0
;
; Multiply method A = multiplicand, X = multiplier result returned in A
multiply:
STA multiplicandLo
multiplyLoop:
DEX
BEQ multiplyDone
CLC
ADC multiplicand
JMP multiplyLoop
multiplyDone:
RTS

test:
; fourty-two test
LDA #7
LDX #6
JSR multiply
; 10*20 test
LDA #10
LDX #20
JSR multiply
BRK


The trace results shows us pretty much what we would expect. The first test does take a while to run as it starts at cycle 3 and returns at cycle 93 (the instruction after the RTS) giving us a 90 cycle run time.

(0) 0:JMP $12
(3) 12:LDA #$7 ' A: 0 -> 7
(5) 14:LDX #$6 ' X: 0 -> 6
(7) 16:JSR $4# 
(13) 4:STA $3# ' [3]: 0 -> 7
(17) 7:DEX ' X:6 -> 5
(19) 8:BEQ $11
(21) a:CLC 
(23) b:ADC $3 ' A: 7 -> $E
(27) e:JMP $7
(30) 7:DEX ' X: 5 -> 4
(32) 8:BEQ $11
(34) a:CLC
(36) b:ADC $3 ' A: $E -> $15
(40) e:JMP $7
...
Additional loops removed as you get the idea!
...
(75) b:ADC $3 ' A: $23 -> $2A
(79) e:JMP $7
(82) 7:DEX ' X: 1 -> 0; Z set
(84) 8:BEQ $11
(87) 11:RTS

A slightly larger 10 * 20 multiply, however, does not give us that good of results.

(93) 19:LDA #$A ' A: $2A -> $A 
(95) 1b:LDX #$14 ' X: 0 -> $14
(97) 1d:JSR $4
(103) 4:STA $3 ' [3] 7 -> $A
(107) 7:DEX ' X: $14 -> $13
(109) 8:BEQ $11
(111) a:CLC 
(113) b:ADC $3 ' A: $A -> $14
...
Additional loops removed as you get the idea!
...
(345) a:CLC
(347) b:ADC $3 ' A: $BE -> $C8
(351) e:JMP $7
(354) 7:DEX ' X: 1 -> 0; Z set
(356) 8:BEQ $11
(359) 11:RTS
(365) BRK

The time for this was 272 cycles which is tolerable but getting bad. For our random number generator we need to multiply by 141. This would result in 1845 cycles to perform the multiplication. Not a good result at all. Obviously we need to come up with a better way of handling multiplication. This will be covered next fortnight.