Saturday, April 27, 2019

Building a 16-bit Linear Congruential Generator

The 16 bit seed version of the Linear Congruential Generator (LCG) is very similar to the 8 bit version in that every time it is called it returns a byte. The big difference is that the seed is 16 bits instead of 8 bits so we will have a longer time before numbers start repeating and a bit more randomness. To get the longer cycle (and more randomness) we need to use the upper byte of the results.

For those of you who haven't read the earlier blog entries in this series or who want a quick refresher, LCG uses the formula $S_n = A*S_{n-1}+C$ for getting the next seed so we are simply multiplying the seed by a constant A and then adding a constant C to the result. We will use 141
for A and 3 for C though other carefully chosen values could be used.

We already have the multiplication which we wrote last fortnight, so now all we have to do is use that multiplication routine and then add a constant to that value. This is very simple code as you can see.

; set up our constants for A and C
.EQU LCG_A 141
.EQU LCG_C 3
; create placeholder for RNG seed
seedLo: .BYTE 0
seedHi: .BYTE 0
; ...
; multiply16x8 setup and code would be put here
; ...

lcg16bit:
; load in the current value of the seed into the multiply registers
LDA seedLo
LDY seedHi
; now multiply by the A constant
LDX #LCG_A
JSR multiply16x8
; add the C constant to the reult
CLC
ADC #LCG_C
BCC lcg16bit_update_seed
INY
lcg16bit_update_seed:
; save the results in the seed
STA seedLo
STY seedHi
; put results in accumulator for return value
TYA
RTS

To test this simply call the random number generator 257 times and look at the sequence of numbers that are generated.

test:
JSR lcg16bit
; now lets do this 256 more times
LDX #0
test_loop:
TXA
PHA
JSR lcg16bit
PLA
TAX
DEX
BNE test_loop
NOP ; using this as easy to find separator
BRK

While this does demonstrate that it is working and gives us a longer sequence than the first method, it is not a very good test. For our needs we don't need statistically sound random number generation, but a simple test like generating randograms would be sufficient for getting a rough idea of how the generator was working. Randograms were covered in an earlier blog post but are essentially a 256 * 256 grid that you plot by taking two random numbers and plotting the point on the graph. This is a good idea but not really implementable on a 64k machine as the minimum size for holding the grid is 65536 bytes and we would still need some way to save the data. However, we have our own emulator so could cheat a bit to get the result. More on this next fortnight.

Saturday, April 13, 2019

Multi-byte Multiplication 8x16

When you get right down to it, there is an interesting property about the number of bits involved in a multiplication. The maximum value for a multiplication is the number of bits in the multiplicand plus the number of bits in the multiplier. This means that our 8x8 multiplier should produce a 16 bit number. Likewise a 16x16 bit multiply would properly need 32 bits to store the results. For my needs we don't care about overflow situations as for the random number generator but if you wanted proper accuracy chaining more bytes to have larger results is certainly something that could easily be done. What we need to have a 16 bit seed is to multiply a 16 bit number by an 8 bit number.

This is actually not that much different than the 8x8 multiplier that we created earlier. There are two big differences. First, we are going to need to shift a 16 bit number. This can be done simply enough using the carry flag just as was done with the multi-byte additions. The shift left would then look like:

CLC
ROL LOW
ROL HIGH

This would work just fine, but we can save a byte and a few cycles by getting rid of the CLC instruction and instead using the ASL instruction in the place of the first ROL instruction. This always rotates in a 0 while still putting the shifted out bit into the carry flag.

The second issue is how to deal with a 16-bit result. There are two solutions to this problem. We could store the results in some reserved memory (ideally in page 0) which if we were going to output a 24 bit number would be a requirement. With a little bit of register shifting, we can actually keep the results of the multiplication in the registers winch gives us a bit of a speed boost as well as reducing the bytes needed for the program and memory storage. The new multiplication routine now looks like this:

JMP test
; create a placeholder for the multiplicand
multiplicandLo: 
.BYTE 0
multiplicandHi: 
.BYTE 0
multiplierLo: 
.BYTE 0
;
; Multiply method 
; A = multiplicand lo, Y = multiplicand Hi, X = multiplier
; result returned in A (lo) and Y (hi)
multiply16x8:
; init by setting zero page vars
STX multiplierLo
STA multiplicandLo
STY multiplicandHi
; start result at zero
LDA #0
TAY
multiply16x8_loop:
; see if still bits in multiplier 
LDX multiplierLo
BEQ multiply16x8_done
; shift multiplier right. C set indicates need to add
LSR multiplierLo
BCC multiply16x8_adjust
; add current power of 2 multiplicand
CLC
ADC multiplicandLo
TAX ; save low result in X register 
TYA ; transfer high result to Acc for second half of add
ADC multiplicandHi
TAY ; return high result to Y
TXA ; return low result to Acc
multiply16x8_adjust:
; Set up next power of 2
ASL multiplicandLo
ROL multiplicandHi
JMP multiply16x8_loop
multiply16x8_done:
RTS

test:

; 300 x 5 test
LDA #$2C
LDY #$1
LDX #$5
JSR multiply16x8

; fourty-two test
LDA #7
LDY #0
LDX #6
JSR multiply16x8

; LCG Mulitply step test
LDA #42
LDY #0
LDX #141
JSR multiply16x8

NOP
BRK

Here is a trace of the first test.

(0) 0:JMP $31
(3) 31:LDA #$2C ; A:$0 -> $2c
(5) 33:LDY #$1 ; Y:0 -> $1
(7) 35:LDX #$5 ; X:0 -> $5
(9) 37:JSR $6
(15) 6:STX $5 ; [5]:-> $5
(19) 9:STA $3 ; [3]:-> $2c
(23) c:STY $4 ; [4]:-> 1
(27) f:LDA #$0  ; A:$44 -> $0, Flags:32 -> $22
(29) 11:TAY ; Y:1 -> $0
(31) 12:LDX $5 ; X:5 -> 5
(35) 15:BEQ $30
(37) 17:LSR $5 ; [5]:$5 -> $2, C=1
(43) 1a:BCC $27
(45) 1c:CLC ; C=0
(47) 1d:ADC $3 ; A:$0 -> $2c
(51) 20:TAX ; X:5 -> $2c
(53) 21:TYA  ; A:$2C -> $0
(55) 22:ADC $4 ; A:$0 -> $1
(59) 25:TAY ; Y:0 -> $1
(61) 26:TXA  ; A:$1 -> $2c
(63) 27:ASL $3 ; [3]:$2C -> $58
(69) 2a:ROL $4 ; [4]:$1 -> $2
(75) 2d:JMP $12
(78) 12:LDX $5 ; X:$2c -> $2
(82) 15:BEQ $30
(84) 17:LSR $5 ; [4]:$2 -> $1
(90) 1a:BCC $27
(93) 27:ASL $3 ; [3]:$58 -> $B0
(99) 2a:ROL $4 ; [4]:$2 -> 4
(105) 2d:JMP $12
(108) 12:LDX $5, X:$2 -> $1
(112) 15:BEQ $30
(114) 17:LSR $5 ; [5]:1 -> 0, C=1
(120) 1a:BCC $27
(122) 1c:CLC ; C=0
(124) 1d:ADC $3 ; A:$2c -> $dc
(128) 20:TAX ; X:1 -> $dc
(130) 21:TYA  ; A:$dc -> $1
(132) 22:ADC $4 ; A:$1 -> $5
(136) 25:TAY ; Y:1 -> $5
(138) 26:TXA  ; A:$5 -> $dc
(140) 27:ASL $3
(146) 2a:ROL $4
(152) 2d:JMP $12
(155) 12:LDX $5, X:$dc -> $0, Z=1
(159) 15:BEQ $30
(162) 30:RTS 

As you can see that this works very similar to the original version with only a few changes. A better 8x8 multiply that returns a 16 bit result would be easy to do by simply using this version but always setting Y to 0 when calling. We now have a multiply routine that will allow us to implement both the 16-bit seed version of the LCG algorithm as well as the PCG algorithm. We will start implementing those next fortnight!