As I stated in last year's review of the PCG paper, Randograms are an informal way of looking at random number generators. They are generated by creating a 256x256 grid and then taking 32768 pairs of numbers, plotting them on the grid. The code for doing this was given in that previous article and it is not too difficult so I will omit the drawing code. Since the numbers for the random number generator will be coming from assembly language code, the RNG component that provides the randogram it's data will need to be written. We will simply have the RNG have an array of 65536 elements that get looped through. Our 6502 interpreter will fill out this array when it is ran. This is pretty trivial code:

open class RNG {

val randomBytes = arrayListOf<Int>()

var indx = 0

init {

for (i in 0..65535) {

randomBytes.add(0)

}

}

open fun nextUByte():Int {

val temp = randomBytes[indx]

++indx

if (indx > 65535)

indx = 0

return temp and 255

}

}

We are now ready to build our Randogram Emulator's main display. This version is being done using JavaFX but could have been done in JavaScript. It would be nice if there was a standard cross-platform GUI library to use, but there are some third party attempts at such a beasts. Until there is a clear winner in this area, I will use JavaFX and plan on porting any GUI code to JavaScript when I am ready to get the js version of the emulator running. The GUI code simply sets up a canvas to draw the randogram on and a button for loading an assembly language file to be processed.

override fun start(primaryStage: Stage?) {

randogram.generate()

randogram.drawToCanvas(canvas, 2.0)

val btn = Button("Test Assembly")

btn.setOnAction {handleLoadAsm()}

val root = VBox(10.0, canvas, btn)

val scene = Scene(root, 800.0, 600.0)

primaryStage!!.title = "JavaFX test"

primaryStage.scene = scene

primaryStage.show()

}

The button handler is where all the real work takes place. We will simply set up the emulator environment, then bring up a file dialog for obtaining the assembly language file that should be processed. As this is an internal tool, we are not doing any sanity checking on the contents of the assembly language file but instead are simply calling our assembler class and getting it to assemble the file. A proper program would check the results of the assembler to make sure the file assembled properly, but this is a quick and dirty test so this will not be done.

The code for running the machine language is very simple as we just run instructions until we hit a break instruction. The twist, and where we are cheating big time, is that when we run into a NOP instruction, we handle that instruction by taking the value of the accumulator and adding it to the RNG instances list of numbers. Once we have written a proper assembly testing framework, this will result in 65536 numbers being generated which will fill out the list.

Once the assembly language code has ran, we generate the randogram and update the display. We now have a way of testing our assembly language random number generators.

private fun handleLoadAsm() {

var memoryManager = Cartridge()

var m6502 = M6502( memoryManager )

var byteData = ByteArray(4096)

val fileChooser = FileChooser()

var assemblyFile = fileChooser.showOpenDialog(null)

if (assemblyFile.isFile) {

var assemblyList:ArrayList<String> = ArrayList(assemblyFile.readLines())

var assembler = Assembler(m6502)

assembler.assembleProgram(assemblyList)

for (cntrRom in 0..assembler.currentBank.size-1) {

byteData[cntrRom] = assembler.currentBank.readBankAddress(cntrRom).toByte()

memoryManager.write(cntrRom, assembler.currentBank.readBankAddress(cntrRom))

}

m6502.state.ip = 0

var ipAddress = m6502.state.ip

var indx = 0

while (memoryManager.read( ipAddress ) != 0) {

if (memoryManager.read( ipAddress )==0xEA) {

rng.randomBytes[indx] = m6502.state.acc

++indx

}

m6502.step()

ipAddress = m6502.state.ip

}

}

randogram.clear()

randogram.generate()

randogram.drawToCanvas(canvas, 2.0)

}

The problem we now have is that we need to write a testing framework for actually testing our random number generators. This will be done next fortnight.

## Saturday, May 11, 2019

## Saturday, April 27, 2019

### Building a 16-bit Linear Congruential Generator

The 16 bit seed version of the Linear Congruential Generator (LCG) is very similar to the 8 bit version in that every time it is called it returns a byte. The big difference is that the seed is 16 bits instead of 8 bits so we will have a longer time before numbers start repeating and a bit more randomness. To get the longer cycle (and more randomness) we need to use the upper byte of the results.

For those of you who haven't read the earlier blog entries in this series or who want a quick refresher, LCG uses the formula $S_n = A*S_{n-1}+C$ for getting the next seed so we are simply multiplying the seed by a constant A and then adding a constant C to the result. We will use 141

for A and 3 for C though other carefully chosen values could be used.

We already have the multiplication which we wrote last fortnight, so now all we have to do is use that multiplication routine and then add a constant to that value. This is very simple code as you can see.

; set up our constants for A and C

.EQU LCG_A 141

.EQU LCG_C 3

; create placeholder for RNG seed

seedLo: .BYTE 0

seedHi: .BYTE 0

; ...

; multiply16x8 setup and code would be put here

; ...

lcg16bit:

; load in the current value of the seed into the multiply registers

LDA seedLo

LDY seedHi

; now multiply by the A constant

LDX #LCG_A

JSR multiply16x8

; add the C constant to the reult

CLC

ADC #LCG_C

BCC lcg16bit_update_seed

INY

lcg16bit_update_seed:

; save the results in the seed

STA seedLo

STY seedHi

; put results in accumulator for return value

TYA

RTS

To test this simply call the random number generator 257 times and look at the sequence of numbers that are generated.

test:

JSR lcg16bit

; now lets do this 256 more times

LDX #0

test_loop:

TXA

PHA

JSR lcg16bit

PLA

TAX

DEX

BNE test_loop

NOP ; using this as easy to find separator

BRK

While this does demonstrate that it is working and gives us a longer sequence than the first method, it is not a very good test. For our needs we don't need statistically sound random number generation, but a simple test like generating randograms would be sufficient for getting a rough idea of how the generator was working. Randograms were covered in an earlier blog post but are essentially a 256 * 256 grid that you plot by taking two random numbers and plotting the point on the graph. This is a good idea but not really implementable on a 64k machine as the minimum size for holding the grid is 65536 bytes and we would still need some way to save the data. However, we have our own emulator so could cheat a bit to get the result. More on this next fortnight.

For those of you who haven't read the earlier blog entries in this series or who want a quick refresher, LCG uses the formula $S_n = A*S_{n-1}+C$ for getting the next seed so we are simply multiplying the seed by a constant A and then adding a constant C to the result. We will use 141

for A and 3 for C though other carefully chosen values could be used.

We already have the multiplication which we wrote last fortnight, so now all we have to do is use that multiplication routine and then add a constant to that value. This is very simple code as you can see.

; set up our constants for A and C

.EQU LCG_A 141

.EQU LCG_C 3

; create placeholder for RNG seed

seedLo: .BYTE 0

seedHi: .BYTE 0

; ...

; multiply16x8 setup and code would be put here

; ...

lcg16bit:

; load in the current value of the seed into the multiply registers

LDA seedLo

LDY seedHi

; now multiply by the A constant

LDX #LCG_A

JSR multiply16x8

; add the C constant to the reult

CLC

ADC #LCG_C

BCC lcg16bit_update_seed

INY

lcg16bit_update_seed:

; save the results in the seed

STA seedLo

STY seedHi

; put results in accumulator for return value

TYA

RTS

To test this simply call the random number generator 257 times and look at the sequence of numbers that are generated.

test:

JSR lcg16bit

; now lets do this 256 more times

LDX #0

test_loop:

TXA

PHA

JSR lcg16bit

PLA

TAX

DEX

BNE test_loop

NOP ; using this as easy to find separator

BRK

While this does demonstrate that it is working and gives us a longer sequence than the first method, it is not a very good test. For our needs we don't need statistically sound random number generation, but a simple test like generating randograms would be sufficient for getting a rough idea of how the generator was working. Randograms were covered in an earlier blog post but are essentially a 256 * 256 grid that you plot by taking two random numbers and plotting the point on the graph. This is a good idea but not really implementable on a 64k machine as the minimum size for holding the grid is 65536 bytes and we would still need some way to save the data. However, we have our own emulator so could cheat a bit to get the result. More on this next fortnight.

## Saturday, April 13, 2019

### Multi-byte Multiplication 8x16

When you get right down to it, there is an interesting property about the number of bits involved in a multiplication. The maximum value for a multiplication is the number of bits in the multiplicand plus the number of bits in the multiplier. This means that our 8x8 multiplier should produce a 16 bit number. Likewise a 16x16 bit multiply would properly need 32 bits to store the results. For my needs we don't care about overflow situations as for the random number generator but if you wanted proper accuracy chaining more bytes to have larger results is certainly something that could easily be done. What we need to have a 16 bit seed is to multiply a 16 bit number by an 8 bit number.

This is actually not that much different than the 8x8 multiplier that we created earlier. There are two big differences. First, we are going to need to shift a 16 bit number. This can be done simply enough using the carry flag just as was done with the multi-byte additions. The shift left would then look like:

CLC

ROL LOW

ROL HIGH

This would work just fine, but we can save a byte and a few cycles by getting rid of the CLC instruction and instead using the ASL instruction in the place of the first ROL instruction. This always rotates in a 0 while still putting the shifted out bit into the carry flag.

The second issue is how to deal with a 16-bit result. There are two solutions to this problem. We could store the results in some reserved memory (ideally in page 0) which if we were going to output a 24 bit number would be a requirement. With a little bit of register shifting, we can actually keep the results of the multiplication in the registers winch gives us a bit of a speed boost as well as reducing the bytes needed for the program and memory storage. The new multiplication routine now looks like this:

JMP test

; create a placeholder for the multiplicand

multiplicandLo:

.BYTE 0

multiplicandHi:

.BYTE 0

multiplierLo:

.BYTE 0

;

; Multiply method

; A = multiplicand lo, Y = multiplicand Hi, X = multiplier

; result returned in A (lo) and Y (hi)

multiply16x8:

; init by setting zero page vars

STX multiplierLo

STA multiplicandLo

STY multiplicandHi

; start result at zero

LDA #0

TAY

multiply16x8_loop:

; see if still bits in multiplier

LDX multiplierLo

BEQ multiply16x8_done

; shift multiplier right. C set indicates need to add

LSR multiplierLo

BCC multiply16x8_adjust

; add current power of 2 multiplicand

CLC

ADC multiplicandLo

TAX ; save low result in X register

TYA ; transfer high result to Acc for second half of add

ADC multiplicandHi

TAY ; return high result to Y

TXA ; return low result to Acc

multiply16x8_adjust:

; Set up next power of 2

ASL multiplicandLo

ROL multiplicandHi

JMP multiply16x8_loop

multiply16x8_done:

RTS

test:

; 300 x 5 test

LDA #$2C

LDY #$1

LDX #$5

JSR multiply16x8

; fourty-two test

LDA #7

LDY #0

LDX #6

JSR multiply16x8

; LCG Mulitply step test

LDA #42

LDY #0

LDX #141

JSR multiply16x8

NOP

BRK

Here is a trace of the first test.

(0) 0:JMP $31

(3) 31:LDA #$2C ; A:$0 -> $2c

(5) 33:LDY #$1 ; Y:0 -> $1

(7) 35:LDX #$5 ; X:0 -> $5

(9) 37:JSR $6

(15) 6:STX $5 ; [5]:-> $5

(19) 9:STA $3 ; [3]:-> $2c

(23) c:STY $4 ; [4]:-> 1

(27) f:LDA #$0 ; A:$44 -> $0, Flags:32 -> $22

(29) 11:TAY ; Y:1 -> $0

(31) 12:LDX $5 ; X:5 -> 5

(35) 15:BEQ $30

(37) 17:LSR $5 ; [5]:$5 -> $2, C=1

(43) 1a:BCC $27

(45) 1c:CLC ; C=0

(47) 1d:ADC $3 ; A:$0 -> $2c

(51) 20:TAX ; X:5 -> $2c

(53) 21:TYA ; A:$2C -> $0

(55) 22:ADC $4 ; A:$0 -> $1

(59) 25:TAY ; Y:0 -> $1

(61) 26:TXA ; A:$1 -> $2c

(63) 27:ASL $3 ; [3]:$2C -> $58

(69) 2a:ROL $4 ; [4]:$1 -> $2

(75) 2d:JMP $12

(78) 12:LDX $5 ; X:$2c -> $2

(82) 15:BEQ $30

(84) 17:LSR $5 ; [4]:$2 -> $1

(90) 1a:BCC $27

(93) 27:ASL $3 ; [3]:$58 -> $B0

(99) 2a:ROL $4 ; [4]:$2 -> 4

(105) 2d:JMP $12

(108) 12:LDX $5, X:$2 -> $1

(112) 15:BEQ $30

(114) 17:LSR $5 ; [5]:1 -> 0, C=1

(120) 1a:BCC $27

(122) 1c:CLC ; C=0

(124) 1d:ADC $3 ; A:$2c -> $dc

(128) 20:TAX ; X:1 -> $dc

(130) 21:TYA ; A:$dc -> $1

(132) 22:ADC $4 ; A:$1 -> $5

(136) 25:TAY ; Y:1 -> $5

(138) 26:TXA ; A:$5 -> $dc

(140) 27:ASL $3

(146) 2a:ROL $4

(152) 2d:JMP $12

(155) 12:LDX $5, X:$dc -> $0, Z=1

(159) 15:BEQ $30

(162) 30:RTS

As you can see that this works very similar to the original version with only a few changes. A better 8x8 multiply that returns a 16 bit result would be easy to do by simply using this version but always setting Y to 0 when calling. We now have a multiply routine that will allow us to implement both the 16-bit seed version of the LCG algorithm as well as the PCG algorithm. We will start implementing those next fortnight!

This is actually not that much different than the 8x8 multiplier that we created earlier. There are two big differences. First, we are going to need to shift a 16 bit number. This can be done simply enough using the carry flag just as was done with the multi-byte additions. The shift left would then look like:

CLC

ROL LOW

ROL HIGH

This would work just fine, but we can save a byte and a few cycles by getting rid of the CLC instruction and instead using the ASL instruction in the place of the first ROL instruction. This always rotates in a 0 while still putting the shifted out bit into the carry flag.

The second issue is how to deal with a 16-bit result. There are two solutions to this problem. We could store the results in some reserved memory (ideally in page 0) which if we were going to output a 24 bit number would be a requirement. With a little bit of register shifting, we can actually keep the results of the multiplication in the registers winch gives us a bit of a speed boost as well as reducing the bytes needed for the program and memory storage. The new multiplication routine now looks like this:

JMP test

; create a placeholder for the multiplicand

multiplicandLo:

.BYTE 0

multiplicandHi:

.BYTE 0

multiplierLo:

.BYTE 0

;

; Multiply method

; A = multiplicand lo, Y = multiplicand Hi, X = multiplier

; result returned in A (lo) and Y (hi)

multiply16x8:

; init by setting zero page vars

STX multiplierLo

STA multiplicandLo

STY multiplicandHi

; start result at zero

LDA #0

TAY

multiply16x8_loop:

; see if still bits in multiplier

LDX multiplierLo

BEQ multiply16x8_done

; shift multiplier right. C set indicates need to add

LSR multiplierLo

BCC multiply16x8_adjust

; add current power of 2 multiplicand

CLC

ADC multiplicandLo

TAX ; save low result in X register

TYA ; transfer high result to Acc for second half of add

ADC multiplicandHi

TAY ; return high result to Y

TXA ; return low result to Acc

multiply16x8_adjust:

; Set up next power of 2

ASL multiplicandLo

ROL multiplicandHi

JMP multiply16x8_loop

multiply16x8_done:

RTS

test:

; 300 x 5 test

LDA #$2C

LDY #$1

LDX #$5

JSR multiply16x8

; fourty-two test

LDA #7

LDY #0

LDX #6

JSR multiply16x8

; LCG Mulitply step test

LDA #42

LDY #0

LDX #141

JSR multiply16x8

NOP

BRK

Here is a trace of the first test.

(0) 0:JMP $31

(3) 31:LDA #$2C ; A:$0 -> $2c

(5) 33:LDY #$1 ; Y:0 -> $1

(7) 35:LDX #$5 ; X:0 -> $5

(9) 37:JSR $6

(15) 6:STX $5 ; [5]:-> $5

(19) 9:STA $3 ; [3]:-> $2c

(23) c:STY $4 ; [4]:-> 1

(27) f:LDA #$0 ; A:$44 -> $0, Flags:32 -> $22

(29) 11:TAY ; Y:1 -> $0

(31) 12:LDX $5 ; X:5 -> 5

(35) 15:BEQ $30

(37) 17:LSR $5 ; [5]:$5 -> $2, C=1

(43) 1a:BCC $27

(45) 1c:CLC ; C=0

(47) 1d:ADC $3 ; A:$0 -> $2c

(51) 20:TAX ; X:5 -> $2c

(53) 21:TYA ; A:$2C -> $0

(55) 22:ADC $4 ; A:$0 -> $1

(59) 25:TAY ; Y:0 -> $1

(61) 26:TXA ; A:$1 -> $2c

(63) 27:ASL $3 ; [3]:$2C -> $58

(69) 2a:ROL $4 ; [4]:$1 -> $2

(75) 2d:JMP $12

(78) 12:LDX $5 ; X:$2c -> $2

(82) 15:BEQ $30

(84) 17:LSR $5 ; [4]:$2 -> $1

(90) 1a:BCC $27

(93) 27:ASL $3 ; [3]:$58 -> $B0

(99) 2a:ROL $4 ; [4]:$2 -> 4

(105) 2d:JMP $12

(108) 12:LDX $5, X:$2 -> $1

(112) 15:BEQ $30

(114) 17:LSR $5 ; [5]:1 -> 0, C=1

(120) 1a:BCC $27

(122) 1c:CLC ; C=0

(124) 1d:ADC $3 ; A:$2c -> $dc

(128) 20:TAX ; X:1 -> $dc

(130) 21:TYA ; A:$dc -> $1

(132) 22:ADC $4 ; A:$1 -> $5

(136) 25:TAY ; Y:1 -> $5

(138) 26:TXA ; A:$5 -> $dc

(140) 27:ASL $3

(146) 2a:ROL $4

(152) 2d:JMP $12

(155) 12:LDX $5, X:$dc -> $0, Z=1

(159) 15:BEQ $30

(162) 30:RTS

As you can see that this works very similar to the original version with only a few changes. A better 8x8 multiply that returns a 16 bit result would be easy to do by simply using this version but always setting Y to 0 when calling. We now have a multiply routine that will allow us to implement both the 16-bit seed version of the LCG algorithm as well as the PCG algorithm. We will start implementing those next fortnight!

## Saturday, March 30, 2019

### Multi-byte Numbers

Just because the 6502 is an 8-bit processor does not mean that it is limited to dealing with numbers that are larger than 8-bits. You can have as many bits as your memory will support though dealing with larger numbers does add a lot of work that the program has to do. The purpose of the carry flag is for dealing with multi-byte numbers. The big issue is that you can only work with 8 bits at a time of the number. While you can use the x and y registers as temporary holders for parts of the larger number, you most likely will want to store that number somewhere in memory.

There are many ways of storing a multi-byte number but the two most common formats are big-endian and little-endian in reference to Gulliver's Travels where there were groups of people who fought over whether it was proper to crack a soft-boiled egg from the big end (the big-endians) or the little end (little-endians). CPU makers have been having a similar fight with the decision being should the high order byte be stored in the lower address followed by the lower order byte(s) which we call big-endian or should we do this with the lowest order byte first followed by the higher bytes.

The 6502 is considered to be a little-endian architecture due to the way memory addresses are stored. This is not necessarily how you need to store multi-byte numbers in memory, but this tends to be the convention for 6502 programmers. However, since the registers are only 8 bits, it really does not matter what convention you use as long as you are consistent. Here is a sample of doing a 16-bit addition.

JMP test

; Big Endian variables

BE_NumberAHi: .BYTE $3

BE_NumberALo: .BYTE $E8

BE_NumberBHi: .BYTE $3

BE_NumberBLo: .BYTE $E8

; Little Endian Variables

LE_NumberALo: .BYTE $E8

LE_NumberAHi: .BYTE $3

LE_NumberBLo: .BYTE $E8

LE_NumberBHi: .BYTE $3

test:

; Big endian A = A + B

LDA BE_NumberALo

CLC

ADC BE_NumberBLo ; add low btye

STA BE_NumberALo

LDA BE_NumberAHi ; add high byte

ADC BE_NumberBHi ; using carry from prev add.

STA BE_NumberAHi

; Little endian A = A + B

LDA LE_NumberALo

CLC

ADC LE_NumberBLo ; add low btye

STA LE_NumberALo

LDA LE_NumberAHi

ADC LE_NumberBHi ; add high byte

STA LE_NumberAHi ; using carry from prev add.

NOP

BRK

As you can see there is no real difference which order you store the data in but this is not the case with processors that have larger registers. Multi-byte math is done by taking advantage of the carry flag as the ADC instruction will add an additional 1 to the result of the add if the carry flag is set. This is why you clear the carry flag before adding the low byte, you then add the higher bytes. Note that you can have as many bytes as you want in your number as long as you continue to chain the bytes.

The same procedure for multi-byte addition works for rotation and subtraction, though with subtraction you need to set the carry flag (SEC) not clear it. We will be seeing a demo of multi-byte rotation next fortnight when I implement my 8x16-bit multiplier.

There are many ways of storing a multi-byte number but the two most common formats are big-endian and little-endian in reference to Gulliver's Travels where there were groups of people who fought over whether it was proper to crack a soft-boiled egg from the big end (the big-endians) or the little end (little-endians). CPU makers have been having a similar fight with the decision being should the high order byte be stored in the lower address followed by the lower order byte(s) which we call big-endian or should we do this with the lowest order byte first followed by the higher bytes.

The 6502 is considered to be a little-endian architecture due to the way memory addresses are stored. This is not necessarily how you need to store multi-byte numbers in memory, but this tends to be the convention for 6502 programmers. However, since the registers are only 8 bits, it really does not matter what convention you use as long as you are consistent. Here is a sample of doing a 16-bit addition.

JMP test

; Big Endian variables

BE_NumberAHi: .BYTE $3

BE_NumberALo: .BYTE $E8

BE_NumberBHi: .BYTE $3

BE_NumberBLo: .BYTE $E8

; Little Endian Variables

LE_NumberALo: .BYTE $E8

LE_NumberAHi: .BYTE $3

LE_NumberBLo: .BYTE $E8

LE_NumberBHi: .BYTE $3

test:

; Big endian A = A + B

LDA BE_NumberALo

CLC

ADC BE_NumberBLo ; add low btye

STA BE_NumberALo

LDA BE_NumberAHi ; add high byte

ADC BE_NumberBHi ; using carry from prev add.

STA BE_NumberAHi

; Little endian A = A + B

LDA LE_NumberALo

CLC

ADC LE_NumberBLo ; add low btye

STA LE_NumberALo

LDA LE_NumberAHi

ADC LE_NumberBHi ; add high byte

STA LE_NumberAHi ; using carry from prev add.

NOP

BRK

As you can see there is no real difference which order you store the data in but this is not the case with processors that have larger registers. Multi-byte math is done by taking advantage of the carry flag as the ADC instruction will add an additional 1 to the result of the add if the carry flag is set. This is why you clear the carry flag before adding the low byte, you then add the higher bytes. Note that you can have as many bytes as you want in your number as long as you continue to chain the bytes.

The same procedure for multi-byte addition works for rotation and subtraction, though with subtraction you need to set the carry flag (SEC) not clear it. We will be seeing a demo of multi-byte rotation next fortnight when I implement my 8x16-bit multiplier.

## Saturday, March 16, 2019

### Shifted Multiply

Adding numbers together in a loop works for small numbers but the execution time quickly increases as the multiplier increases. Another way of multiplying numbers is by shifting them. Each shift to the left is the equivalent of multiplying by 2. This means that if we shift to the lefft 3 times we have effectively multiplied by 8. The problem is that this only works with powers of two. When you think about it, though, 5x3 is the same as 5x2+5x1. Likewise, 7x6 is just 7x4+7x2.

This means that we cam drastically improve multiplication by simply looping through the bits shifting the multiplier to the right and seeing if it is set. If it is we add the current value of the shifted multiplicand to the result. We then shifting the multiplicand to the left to multiply it by the next power. This is repeated until the multiplier becomes 0. The code that does this is as follows:

; brute force 8-bit multiplication

JMP test

; create a placeholder for the multiplicand

multiplicandLo:

.BYTE 0

multiplierLo:

.BYTE 0

;

; Multiply method A = multiplicand, X = multiplier result returned in A

multiply8x8:

; init by setting zero page vars

STX multiplierLo

STA multiplicandLo

; start result at zero

LDA #0

multiply8x8_loop:

; see if still bits in multiplier

LDX multiplierLo

BEQ multiply8x8_done

; shift multiplier right. C set indicates need to add

LSR multiplierLo

BCC multiply8x8_adjust

; add current power of 2 multiplicand

CLC

ADC multiplicandLo

multiply8x8_adjust:

; Set up next power of 2

ASL multiplicandLo

JMP multiply8x8_loop

multiply8x8_done:

RTS

test:

; fourty-two test

LDA #7

LDX #6

JSR multiply8x8

; 10*20 test

LDA #10

LDX #20

JSR multiply8x8

; LCG Mulitply step test

LDA #42

LDX #141

JSR multiply8x8

NOP

BRK

When ran, we get the following trace for the first test. The second and third tests are pretty similar.

(0) 0:JMP $22

(3) 22:LDA #$7 ; A:0 -> 7

(5) 24:LDX #$6 ; X:0 -> 6

(7) 26:JSR $5

(13) 5:STX $4 ; [4]:0 -> 6

(17) 8:STA $3 ; [3]:0 -> 7

(21) b:LDA #$0 ; A:7 -> 0

(23) d:LDX $4 ; X:6 -> 6

(27) 10:BEQ $21

(29) 12:LSR $4 ; [4]:6 -> 3

(35) 15:BCC $1B

(38) 1b:ASL $3 ; [3]:7 -> $E

(44) 1e:JMP $D

(47) d:LDX $4 ; X:6 -> 3

(51) 10:BEQ $21

(53) 12:LSR $4 ; [4]:3 -> 1, C=1

(59) 15:BCC $1B

(61) 17:CLC ; C=0

(63) 18:ADC $3 ; A:0 -> $E

(67) 1b:ASL $3 ; [3]:$E -> 1C

(73) 1e:JMP $D

(76) d:LDX $4 ; X:3 -> 1

(80) 10:BEQ $21

(82) 12:LSR $4 ; [4]:1->0, C=1

(88) 15:BCC $1B

(90) 17:CLC ; C=0

(92) 18:ADC $3 ; A:$E -> 2A

(96) 1b:ASL $3 ; [3]:$1C -> $38

(102) 1e:JMP $D

(105) d:LDX $4 ; X:1 -> 0, Z=1

(109) 10:BEQ $21

(112) 21:RTS

The call starts at 3 and runs until 118 for a 115 cycle time. This is slightly worse than the brute force way, but if we continued with the other two tests we would find that the 10x20 test takes 163 cycles which is significantly faster than the 272 cycles for the brute force multiply. And the 42x141 test only took 245 cycles compared to 1845 cycles that the brute force method required.

We now have everything necessary to create a LCG random number generator, which we will be doing next fortnight.

This means that we cam drastically improve multiplication by simply looping through the bits shifting the multiplier to the right and seeing if it is set. If it is we add the current value of the shifted multiplicand to the result. We then shifting the multiplicand to the left to multiply it by the next power. This is repeated until the multiplier becomes 0. The code that does this is as follows:

; brute force 8-bit multiplication

JMP test

; create a placeholder for the multiplicand

multiplicandLo:

.BYTE 0

multiplierLo:

.BYTE 0

;

; Multiply method A = multiplicand, X = multiplier result returned in A

multiply8x8:

; init by setting zero page vars

STX multiplierLo

STA multiplicandLo

; start result at zero

LDA #0

multiply8x8_loop:

; see if still bits in multiplier

LDX multiplierLo

BEQ multiply8x8_done

; shift multiplier right. C set indicates need to add

LSR multiplierLo

BCC multiply8x8_adjust

; add current power of 2 multiplicand

CLC

ADC multiplicandLo

multiply8x8_adjust:

; Set up next power of 2

ASL multiplicandLo

JMP multiply8x8_loop

multiply8x8_done:

RTS

test:

; fourty-two test

LDA #7

LDX #6

JSR multiply8x8

; 10*20 test

LDA #10

LDX #20

JSR multiply8x8

; LCG Mulitply step test

LDA #42

LDX #141

JSR multiply8x8

NOP

BRK

When ran, we get the following trace for the first test. The second and third tests are pretty similar.

(0) 0:JMP $22

(3) 22:LDA #$7 ; A:0 -> 7

(5) 24:LDX #$6 ; X:0 -> 6

(7) 26:JSR $5

(13) 5:STX $4 ; [4]:0 -> 6

(17) 8:STA $3 ; [3]:0 -> 7

(21) b:LDA #$0 ; A:7 -> 0

(23) d:LDX $4 ; X:6 -> 6

(27) 10:BEQ $21

(29) 12:LSR $4 ; [4]:6 -> 3

(35) 15:BCC $1B

(38) 1b:ASL $3 ; [3]:7 -> $E

(44) 1e:JMP $D

(47) d:LDX $4 ; X:6 -> 3

(51) 10:BEQ $21

(53) 12:LSR $4 ; [4]:3 -> 1, C=1

(59) 15:BCC $1B

(61) 17:CLC ; C=0

(63) 18:ADC $3 ; A:0 -> $E

(67) 1b:ASL $3 ; [3]:$E -> 1C

(73) 1e:JMP $D

(76) d:LDX $4 ; X:3 -> 1

(80) 10:BEQ $21

(82) 12:LSR $4 ; [4]:1->0, C=1

(88) 15:BCC $1B

(90) 17:CLC ; C=0

(92) 18:ADC $3 ; A:$E -> 2A

(96) 1b:ASL $3 ; [3]:$1C -> $38

(102) 1e:JMP $D

(105) d:LDX $4 ; X:1 -> 0, Z=1

(109) 10:BEQ $21

(112) 21:RTS

The call starts at 3 and runs until 118 for a 115 cycle time. This is slightly worse than the brute force way, but if we continued with the other two tests we would find that the 10x20 test takes 163 cycles which is significantly faster than the 272 cycles for the brute force multiply. And the 42x141 test only took 245 cycles compared to 1845 cycles that the brute force method required.

We now have everything necessary to create a LCG random number generator, which we will be doing next fortnight.

## Saturday, March 2, 2019

### Brute-forcing 8-bit multiplication

As the 6502 does not have instructions for multiplication, if we want to do any multiplication then we need to write our own routine to multiply numbers. There is a very easy way to do this, but it is not efficient. Still, lets take a look at the obvious brute force method before working out a more efficient way of doing this.

In the discrete math courses I took, multiplication was defined as adding a number which is called the multiplicand by itself the number of times specified by another number which is called the multiplier. We have an add instruction and we have looping constructs to repeat the addition a number of times. To make this a proper function we need to have a way of getting information into the function and returning the results. In this case, we have two bytes so we will have the multiplicand in the accumulator and the multiplier in x returning the result in the accumulator.

The following code is a very rough way of doing this but is not what I would consider production code which would check for the 0 case, and would have a nicer loop. As I am planning on writing a better version, this will do for demonstration purposes.

; brute force 8-bit multiplication

JMP test

; create a placeholder for the multiplicand

multiplicandLo:

.BYTE 0

;

; Multiply method A = multiplicand, X = multiplier result returned in A

multiply:

STA multiplicandLo

multiplyLoop:

DEX

BEQ multiplyDone

CLC

ADC multiplicand

JMP multiplyLoop

multiplyDone:

RTS

test:

; fourty-two test

LDA #7

LDX #6

JSR multiply

; 10*20 test

LDA #10

LDX #20

JSR multiply

BRK

The trace results shows us pretty much what we would expect. The first test does take a while to run as it starts at cycle 3 and returns at cycle 93 (the instruction after the RTS) giving us a 90 cycle run time.

(0) 0:JMP $12

(3) 12:LDA #$7 ' A: 0 -> 7

(5) 14:LDX #$6 ' X: 0 -> 6

(7) 16:JSR $4#

(13) 4:STA $3# ' [3]: 0 -> 7

(17) 7:DEX ' X:6 -> 5

(19) 8:BEQ $11

(21) a:CLC

(23) b:ADC $3 ' A: 7 -> $E

(27) e:JMP $7

(30) 7:DEX ' X: 5 -> 4

(32) 8:BEQ $11

(34) a:CLC

(36) b:ADC $3 ' A: $E -> $15

(40) e:JMP $7

...

Additional loops removed as you get the idea!

...

(75) b:ADC $3 ' A: $23 -> $2A

(79) e:JMP $7

(82) 7:DEX ' X: 1 -> 0; Z set

(84) 8:BEQ $11

(87) 11:RTS

A slightly larger 10 * 20 multiply, however, does not give us that good of results.

(93) 19:LDA #$A ' A: $2A -> $A

(95) 1b:LDX #$14 ' X: 0 -> $14

(97) 1d:JSR $4

(103) 4:STA $3 ' [3] 7 -> $A

(107) 7:DEX ' X: $14 -> $13

(109) 8:BEQ $11

(111) a:CLC

(113) b:ADC $3 ' A: $A -> $14

...

Additional loops removed as you get the idea!

...

(345) a:CLC

(347) b:ADC $3 ' A: $BE -> $C8

(351) e:JMP $7

(354) 7:DEX ' X: 1 -> 0; Z set

(356) 8:BEQ $11

(359) 11:RTS

(365) BRK

The time for this was 272 cycles which is tolerable but getting bad. For our random number generator we need to multiply by 141. This would result in 1845 cycles to perform the multiplication. Not a good result at all. Obviously we need to come up with a better way of handling multiplication. This will be covered next fortnight.

In the discrete math courses I took, multiplication was defined as adding a number which is called the multiplicand by itself the number of times specified by another number which is called the multiplier. We have an add instruction and we have looping constructs to repeat the addition a number of times. To make this a proper function we need to have a way of getting information into the function and returning the results. In this case, we have two bytes so we will have the multiplicand in the accumulator and the multiplier in x returning the result in the accumulator.

The following code is a very rough way of doing this but is not what I would consider production code which would check for the 0 case, and would have a nicer loop. As I am planning on writing a better version, this will do for demonstration purposes.

; brute force 8-bit multiplication

JMP test

; create a placeholder for the multiplicand

multiplicandLo:

.BYTE 0

;

; Multiply method A = multiplicand, X = multiplier result returned in A

multiply:

STA multiplicandLo

multiplyLoop:

DEX

BEQ multiplyDone

CLC

ADC multiplicand

JMP multiplyLoop

multiplyDone:

RTS

test:

; fourty-two test

LDA #7

LDX #6

JSR multiply

; 10*20 test

LDA #10

LDX #20

JSR multiply

BRK

The trace results shows us pretty much what we would expect. The first test does take a while to run as it starts at cycle 3 and returns at cycle 93 (the instruction after the RTS) giving us a 90 cycle run time.

(0) 0:JMP $12

(3) 12:LDA #$7 ' A: 0 -> 7

(5) 14:LDX #$6 ' X: 0 -> 6

(7) 16:JSR $4#

(13) 4:STA $3# ' [3]: 0 -> 7

(17) 7:DEX ' X:6 -> 5

(19) 8:BEQ $11

(21) a:CLC

(23) b:ADC $3 ' A: 7 -> $E

(27) e:JMP $7

(30) 7:DEX ' X: 5 -> 4

(32) 8:BEQ $11

(34) a:CLC

(36) b:ADC $3 ' A: $E -> $15

(40) e:JMP $7

...

Additional loops removed as you get the idea!

...

(75) b:ADC $3 ' A: $23 -> $2A

(79) e:JMP $7

(82) 7:DEX ' X: 1 -> 0; Z set

(84) 8:BEQ $11

(87) 11:RTS

A slightly larger 10 * 20 multiply, however, does not give us that good of results.

(93) 19:LDA #$A ' A: $2A -> $A

(95) 1b:LDX #$14 ' X: 0 -> $14

(97) 1d:JSR $4

(103) 4:STA $3 ' [3] 7 -> $A

(107) 7:DEX ' X: $14 -> $13

(109) 8:BEQ $11

(111) a:CLC

(113) b:ADC $3 ' A: $A -> $14

...

Additional loops removed as you get the idea!

...

(345) a:CLC

(347) b:ADC $3 ' A: $BE -> $C8

(351) e:JMP $7

(354) 7:DEX ' X: 1 -> 0; Z set

(356) 8:BEQ $11

(359) 11:RTS

(365) BRK

The time for this was 272 cycles which is tolerable but getting bad. For our random number generator we need to multiply by 141. This would result in 1845 cycles to perform the multiplication. Not a good result at all. Obviously we need to come up with a better way of handling multiplication. This will be covered next fortnight.

## Saturday, February 16, 2019

### Building a Simple trace utility

One of the early development tools for assembly language programmers was a trace tool. This was a simple tool that you ran and it would print out what the program was running at every step of the process. This type of tool could be considered the predecessor to the debugger. My eventual plans are to turn this tool into a debugger eventually. For developing the random number generator libraries I don't need that much of a utility. All that is really needed is the ability to load an assembly language file, assemble it, then run the file printing the command being executed and the state of the registers.

Trace utilities can be thought of as an automated single-step through the program that runs until a break statement is encountered. It is handy for seeing what is happening in non-linear programs such as the following demo:

LDA #0

LDX #3

loop:

CLC

ADC #3

DEX

BNE loop

BRK

While the above example may be fairly simply to manually figure out, seeing what is happening line-by-line can still be useful for debugging as it is very easy to overlook something obvious, especially if it is a mistake that you have made yourself. It is also very handy for demonstrating what a piece of code is doing, which is also very important for explaining the mathematical functions that I will be writing in the remainder of this chapter.

People who have been following this series from the beginning will remember that I said that an emulator was essentially a disassembler that also had an interpreter attached to it. As we already have a dissasembler, and we have an emulator that can run just an individual instruction while storing the state of the registers in an accessible data class, the actual trace part of the trace utility is simply calling the dissasembler to get the assembly instructions, then running that step, then grabbing the register state and printing the results.

// run until break

m6502.state.ip = 0

var traceString = ""

var ipAddress = m6502.state.ip

while (memoryManager.read( ipAddress ) != 0) {

traceString = "(" + m6502.state.tick +") " +

ipAddress.toString(16) + ":" +

m6502.disassembleStep(m6502.state.ip)

m6502.step()

ipAddress = m6502.state.ip

traceString = traceString + "# A:" + m6502.state.acc.toString(16) +

", X:" + m6502.state.x.toString(16) +

", Y:" + m6502.state.y.toString(16) +

", Flags:" + m6502.state.flags.toString(16)

println(traceString)

}

When this is run on our sample program, we get the following results.

(0) 0:LDA #$0# A:0, X:0, Y:0, Flags:22

(2) 2:LDX #$3# A:0, X:3, Y:0, Flags:20

(4) 4:CLC # A:0, X:3, Y:0, Flags:20

(6) 5:ADC #$3# A:3, X:3, Y:0, Flags:20

(8) 7:DEX # A:3, X:2, Y:0, Flags:20

(10) 8:BNE $4# A:3, X:2, Y:0, Flags:20

(13) 4:CLC # A:3, X:2, Y:0, Flags:20

(15) 5:ADC #$3# A:6, X:2, Y:0, Flags:20

(17) 7:DEX # A:6, X:1, Y:0, Flags:20

(19) 8:BNE $4# A:6, X:1, Y:0, Flags:20

(22) 4:CLC # A:6, X:1, Y:0, Flags:20

(24) 5:ADC #$3# A:9, X:1, Y:0, Flags:20

(26) 7:DEX # A:9, X:0, Y:0, Flags:22

(28) 8:BNE $4# A:9, X:0, Y:0, Flags:22

Not the prettiest of output, but that can be easily tweaked. So with a very tiny amount of work, we now have a useful tool for writing and testing chunks of assembly language. We are now ready to start building the math libraries that we will be needing for writing our random number generators. We will start covering multiplication next fortnight.

Trace utilities can be thought of as an automated single-step through the program that runs until a break statement is encountered. It is handy for seeing what is happening in non-linear programs such as the following demo:

LDA #0

LDX #3

loop:

CLC

ADC #3

DEX

BNE loop

BRK

While the above example may be fairly simply to manually figure out, seeing what is happening line-by-line can still be useful for debugging as it is very easy to overlook something obvious, especially if it is a mistake that you have made yourself. It is also very handy for demonstrating what a piece of code is doing, which is also very important for explaining the mathematical functions that I will be writing in the remainder of this chapter.

People who have been following this series from the beginning will remember that I said that an emulator was essentially a disassembler that also had an interpreter attached to it. As we already have a dissasembler, and we have an emulator that can run just an individual instruction while storing the state of the registers in an accessible data class, the actual trace part of the trace utility is simply calling the dissasembler to get the assembly instructions, then running that step, then grabbing the register state and printing the results.

// run until break

m6502.state.ip = 0

var traceString = ""

var ipAddress = m6502.state.ip

while (memoryManager.read( ipAddress ) != 0) {

traceString = "(" + m6502.state.tick +") " +

ipAddress.toString(16) + ":" +

m6502.disassembleStep(m6502.state.ip)

m6502.step()

ipAddress = m6502.state.ip

traceString = traceString + "# A:" + m6502.state.acc.toString(16) +

", X:" + m6502.state.x.toString(16) +

", Y:" + m6502.state.y.toString(16) +

", Flags:" + m6502.state.flags.toString(16)

println(traceString)

}

When this is run on our sample program, we get the following results.

(0) 0:LDA #$0# A:0, X:0, Y:0, Flags:22

(2) 2:LDX #$3# A:0, X:3, Y:0, Flags:20

(4) 4:CLC # A:0, X:3, Y:0, Flags:20

(6) 5:ADC #$3# A:3, X:3, Y:0, Flags:20

(8) 7:DEX # A:3, X:2, Y:0, Flags:20

(10) 8:BNE $4# A:3, X:2, Y:0, Flags:20

(13) 4:CLC # A:3, X:2, Y:0, Flags:20

(15) 5:ADC #$3# A:6, X:2, Y:0, Flags:20

(17) 7:DEX # A:6, X:1, Y:0, Flags:20

(19) 8:BNE $4# A:6, X:1, Y:0, Flags:20

(22) 4:CLC # A:6, X:1, Y:0, Flags:20

(24) 5:ADC #$3# A:9, X:1, Y:0, Flags:20

(26) 7:DEX # A:9, X:0, Y:0, Flags:22

(28) 8:BNE $4# A:9, X:0, Y:0, Flags:22

Not the prettiest of output, but that can be easily tweaked. So with a very tiny amount of work, we now have a useful tool for writing and testing chunks of assembly language. We are now ready to start building the math libraries that we will be needing for writing our random number generators. We will start covering multiplication next fortnight.

## Saturday, February 2, 2019

### Preparing for Random Assembly

In the last post we covered the basics of a linear congruential generator and the simplest form of the permuted congruential generator. Now I am going write some generic 6502 implementation of these generators. This will require covering a few topics that not been covered yet as well such as writing a simple trace utility so the results can be seen. This fortnight I will break down the topics that will be covered over the next few months.

The first step is to create a simple trace utility that will let you run some assembly language code showing the effected registers and memory as the program runs. This is the equivalent of using a debugger to single-step through the code. One alternative to this would be to use an existing emulator that contains a debugger, but there are three reasons why I am not going this route. First, a trace utility would be fun to write. Second, I ultimately want my emulator to also be an IDE so development tools like the trace would be great to have. Finally, going with the existing emulator would require writing for a specific platform which at a minimum means a bunch of boilerplate code obfuscating the code. Writing a trace utility should not take too long, probability only one or two posts.

For the pure 8 bit version of LCG we need 8-bit multiply which is something that the 6502 does not do. Software multiply is possible so we will look at a couple of ways of doing this before implementing our 8-bit LCG.

Before getting to the 16-bit version of LCG we will have to take a look at how to perform multibyte math and then expand my multiply routine to 16-bits which is a bit of work.

Finally we will implement the 16 bit version of LCG and the most basic version of PCG.

The first step is to create a simple trace utility that will let you run some assembly language code showing the effected registers and memory as the program runs. This is the equivalent of using a debugger to single-step through the code. One alternative to this would be to use an existing emulator that contains a debugger, but there are three reasons why I am not going this route. First, a trace utility would be fun to write. Second, I ultimately want my emulator to also be an IDE so development tools like the trace would be great to have. Finally, going with the existing emulator would require writing for a specific platform which at a minimum means a bunch of boilerplate code obfuscating the code. Writing a trace utility should not take too long, probability only one or two posts.

For the pure 8 bit version of LCG we need 8-bit multiply which is something that the 6502 does not do. Software multiply is possible so we will look at a couple of ways of doing this before implementing our 8-bit LCG.

Before getting to the 16-bit version of LCG we will have to take a look at how to perform multibyte math and then expand my multiply routine to 16-bits which is a bit of work.

Finally we will implement the 16 bit version of LCG and the most basic version of PCG.

## Saturday, January 19, 2019

### Building PCG

A LCG basically uses the following calculation: X[n] = (A * X[n-1] + C) % M. X[n] is the next number in the sequence to generate with A being a specially selected multiplier value, C is a value to add to the generator, and M is the modulus or range of the generator. We will use Rotenberg's values of 141 for A, 3 for C and just use a byte so 256 for the modulus. Lets generate a randogram and see what we get.

While this is not as good as we hoped, it is not as bad as it seems as the order the dots appear is random. We are using an 8 bit generator so what if we went with 16? Obviously the lower order byte would remain the same but what if we took higher bits for the output.

Interesting enough, each bit you shift becomes more random with different shifts having different coverage of the randogram. What if we added a permutation to select from some of the better shifts? If only we had a random way of selecting which shift to use. What about the top 2 bits (more for larger generators)? While they have the most randomness, we may gain overall.

Melissa added additional permutations for rotation, XOR, and XORShift. More permutations are possible if you follow the rules:

By combining different permutations, it was possible to complete BigCrush using only a 36 bit PCG (Permuted Congruential Generators). So, in conclusion PCG is a family of Pseudo-Random Number Generation algorithms. They are relatively simple to implement. They run Fast as the code is simple boolean operations, multiplications and additions. They are Space-Efficient requiring less than double the modulus size. And passing the TestU01 BigCrush tests shows they are Statistically Good.

While this is not as good as we hoped, it is not as bad as it seems as the order the dots appear is random. We are using an 8 bit generator so what if we went with 16? Obviously the lower order byte would remain the same but what if we took higher bits for the output.

Interesting enough, each bit you shift becomes more random with different shifts having different coverage of the randogram. What if we added a permutation to select from some of the better shifts? If only we had a random way of selecting which shift to use. What about the top 2 bits (more for larger generators)? While they have the most randomness, we may gain overall.

Melissa added additional permutations for rotation, XOR, and XORShift. More permutations are possible if you follow the rules:

- Bits can be thrown away but you can't make new ones (precious).
- Target and control groups of bits must be separate.
- Control bits must not be modified.
- Can do many rounds and different rounds can have different control bits.

By combining different permutations, it was possible to complete BigCrush using only a 36 bit PCG (Permuted Congruential Generators). So, in conclusion PCG is a family of Pseudo-Random Number Generation algorithms. They are relatively simple to implement. They run Fast as the code is simple boolean operations, multiplications and additions. They are Space-Efficient requiring less than double the modulus size. And passing the TestU01 BigCrush tests shows they are Statistically Good.

## Saturday, January 5, 2019

### Testing Random Number Generators

There are a number of factors that people look for in a random number generator.

This diverse range of usage inevitably meant that there were a lot of different tests. L'Ecuyer and Simard made a huge contribution to testing in 2007 with the release of TestU01. They took a look at the myriad of tests that were available and created a test suite that includes a large number of previously independently published tests. They combined a number of different tests into batteries of tests (SmallCrush, BigCrush). Lots of the random number generators that were in common use at the time failed the test suite. The PCG paper covers testing of a large number of random number generators for those of you who are curious.

TestU01 is a C library with tests and generators for the most common PRNGs. This means that you need to write a simple C program to run the tests that you desire. Here is the version of Java's random number generator that I wrote to test it. Note that the values used are based on the API documentation and are the minimal random number generator that is required.

// Adapted from TestU01 manual, Figure 2.2, Figure 2.4

#include "TestU01.h"

// Example PRNG: Xorshift 32

static unsigned long x = 2463534242U;

static unsigned long a = 0x5DEECE66DL;

static unsigned long c = 11U;

unsigned int jvmlcg (void)

{

x = x * a + c;

return (int)(x>>16);

}

int main()

{

// Create TestU01 PRNG object for our generator

unif01_Gen* gen = unif01_CreateExternGenBits("JVM LCG", jvmlcg);

// Run the tests.

//bbattery_SmallCrush(gen);

bbattery_BigCrush(gen);

// Clean up.

unif01_DeleteExternGenBits(gen);

return 0;

}

The Java test did not do very well on the big crush test. I ran the tests in an emulator running Linux so that I didn't need to set up MinGW on my system as the libraries are for Unix. If you see a flaw in my program, feel free to email me. You can run the test for yourself but to save you hours of waiting, here are the final test results:

========= Summary results of BigCrush =========

Version: TestU01 1.2.3

Generator: JVM LCG

Number of statistics: 160

Total CPU time: 08:55:08.77

The following tests gave p-values outside [0.001, 0.9990]:

(eps means a value < 1.0e-300):

(eps1 means a value < 1.0e-15):

Test p-value

----------------------------------------------

2 SerialOver, r = 22 eps

4 CollisionOver, t = 2 1 - eps1

6 CollisionOver, t = 3 1 - eps1

8 CollisionOver, t = 7 eps

10 CollisionOver, t = 14 eps

12 CollisionOver, t = 21 eps

13 BirthdaySpacings, t = 2 eps

14 BirthdaySpacings, t = 3 eps

15 BirthdaySpacings, t = 4 eps

16 BirthdaySpacings, t = 7 eps

17 BirthdaySpacings, t = 7 eps

18 BirthdaySpacings, t = 8 eps

19 BirthdaySpacings, t = 8 eps

20 BirthdaySpacings, t = 16 eps

21 BirthdaySpacings, t = 16 eps

22 ClosePairs mNP, t = 3 1.6e-26

22 ClosePairs mNP1, t = 3 1.3e-48

22 ClosePairs mNP2S, t = 3 eps

23 ClosePairs mNP, t = 5 3.4e-4

23 ClosePairs mNP1, t = 5 3.4e-4

23 ClosePairs mNP2S, t = 5 eps

24 ClosePairs mNP2S, t = 9 5.7e-46

25 ClosePairs mNP, t = 16 8.0e-30

25 ClosePairs mNP1, t = 16 3.7e-29

25 ClosePairs mNP2S, t = 16 1.1e-283

27 SimpPoker, r = 27 eps

29 SimpPoker, r = 25 eps

32 CouponCollector, r = 20 eps

33 CouponCollector, r = 27 eps

35 Gap, r = 25 eps

37 Gap, r = 20 eps

43 Permutation, t = 10 0.9997

58 AppearanceSpacings, r = 27 eps

60 WeightDistrib, r = 20 eps

61 WeightDistrib, r = 28 eps

64 WeightDistrib, r = 26 eps

67 MatrixRank, L=30, r=26 7.9e-10

75 RandomWalk1 H (L=50, r=25) eps

75 RandomWalk1 M (L=50, r=25) eps

75 RandomWalk1 J (L=50, r=25) eps

75 RandomWalk1 R (L=50, r=25) eps

75 RandomWalk1 C (L=50, r=25) eps

85 Fourier3, r = 27 eps

87 LongestHeadRun, r = 27 eps

87 LongestHeadRun, r = 27 1 - eps1

89 PeriodsInStrings, r = 20 1 - eps1

91 HammingWeight2, r = 27 5.0e-13

96 HammingIndep, L=30, r=27 eps

98 HammingIndep, L=300, r=26 eps

100 HammingIndep, L=1200, r=25 eps

102 Run of bits, r = 27 eps

106 AutoCor, d=3, r=27 2.6e-7

----------------------------------------------

All other tests were passed

One interesting observation that Melissa made in her study is that the XORShift did poor in the crush tests but it's companion XORShift* did exceptionally well. This is a surprise when you consider that XORShift and XORSHift* use the same shift-and-apply-xor operations. The only difference between XORShift and XORShift* is XORShift* adds a multiplication at the end. Anybody familiar with random number generators will know that Linear Congruential Generators (LCGs) work by multiplying then adding. This means that XORShift* gains it's powers by essentially running it's output through a LCG.

If an improvement step works for XORShift, would such an improvement step work for a LCG?

- Period - How long a generator can run before it repeats the entire sequence.
- Uniformity - All the numbers within the range will appear.
- Predictability - how predictable is the next number in the sequence.
- Repeatability - can the same number repeat itself?
- Security - is the sequence invertable?
- Seekability - can we quickly skip over numbers?
- Streams - How usable is the generator when threading?
- k-dimensional equadistribution - How well distributed numbers are.

This diverse range of usage inevitably meant that there were a lot of different tests. L'Ecuyer and Simard made a huge contribution to testing in 2007 with the release of TestU01. They took a look at the myriad of tests that were available and created a test suite that includes a large number of previously independently published tests. They combined a number of different tests into batteries of tests (SmallCrush, BigCrush). Lots of the random number generators that were in common use at the time failed the test suite. The PCG paper covers testing of a large number of random number generators for those of you who are curious.

TestU01 is a C library with tests and generators for the most common PRNGs. This means that you need to write a simple C program to run the tests that you desire. Here is the version of Java's random number generator that I wrote to test it. Note that the values used are based on the API documentation and are the minimal random number generator that is required.

// Adapted from TestU01 manual, Figure 2.2, Figure 2.4

#include "TestU01.h"

// Example PRNG: Xorshift 32

static unsigned long x = 2463534242U;

static unsigned long a = 0x5DEECE66DL;

static unsigned long c = 11U;

unsigned int jvmlcg (void)

{

x = x * a + c;

return (int)(x>>16);

}

int main()

{

// Create TestU01 PRNG object for our generator

unif01_Gen* gen = unif01_CreateExternGenBits("JVM LCG", jvmlcg);

// Run the tests.

//bbattery_SmallCrush(gen);

bbattery_BigCrush(gen);

// Clean up.

unif01_DeleteExternGenBits(gen);

return 0;

}

The Java test did not do very well on the big crush test. I ran the tests in an emulator running Linux so that I didn't need to set up MinGW on my system as the libraries are for Unix. If you see a flaw in my program, feel free to email me. You can run the test for yourself but to save you hours of waiting, here are the final test results:

========= Summary results of BigCrush =========

Version: TestU01 1.2.3

Generator: JVM LCG

Number of statistics: 160

Total CPU time: 08:55:08.77

The following tests gave p-values outside [0.001, 0.9990]:

(eps means a value < 1.0e-300):

(eps1 means a value < 1.0e-15):

Test p-value

----------------------------------------------

2 SerialOver, r = 22 eps

4 CollisionOver, t = 2 1 - eps1

6 CollisionOver, t = 3 1 - eps1

8 CollisionOver, t = 7 eps

10 CollisionOver, t = 14 eps

12 CollisionOver, t = 21 eps

13 BirthdaySpacings, t = 2 eps

14 BirthdaySpacings, t = 3 eps

15 BirthdaySpacings, t = 4 eps

16 BirthdaySpacings, t = 7 eps

17 BirthdaySpacings, t = 7 eps

18 BirthdaySpacings, t = 8 eps

19 BirthdaySpacings, t = 8 eps

20 BirthdaySpacings, t = 16 eps

21 BirthdaySpacings, t = 16 eps

22 ClosePairs mNP, t = 3 1.6e-26

22 ClosePairs mNP1, t = 3 1.3e-48

22 ClosePairs mNP2S, t = 3 eps

23 ClosePairs mNP, t = 5 3.4e-4

23 ClosePairs mNP1, t = 5 3.4e-4

23 ClosePairs mNP2S, t = 5 eps

24 ClosePairs mNP2S, t = 9 5.7e-46

25 ClosePairs mNP, t = 16 8.0e-30

25 ClosePairs mNP1, t = 16 3.7e-29

25 ClosePairs mNP2S, t = 16 1.1e-283

27 SimpPoker, r = 27 eps

29 SimpPoker, r = 25 eps

32 CouponCollector, r = 20 eps

33 CouponCollector, r = 27 eps

35 Gap, r = 25 eps

37 Gap, r = 20 eps

43 Permutation, t = 10 0.9997

58 AppearanceSpacings, r = 27 eps

60 WeightDistrib, r = 20 eps

61 WeightDistrib, r = 28 eps

64 WeightDistrib, r = 26 eps

67 MatrixRank, L=30, r=26 7.9e-10

75 RandomWalk1 H (L=50, r=25) eps

75 RandomWalk1 M (L=50, r=25) eps

75 RandomWalk1 J (L=50, r=25) eps

75 RandomWalk1 R (L=50, r=25) eps

75 RandomWalk1 C (L=50, r=25) eps

85 Fourier3, r = 27 eps

87 LongestHeadRun, r = 27 eps

87 LongestHeadRun, r = 27 1 - eps1

89 PeriodsInStrings, r = 20 1 - eps1

91 HammingWeight2, r = 27 5.0e-13

96 HammingIndep, L=30, r=27 eps

98 HammingIndep, L=300, r=26 eps

100 HammingIndep, L=1200, r=25 eps

102 Run of bits, r = 27 eps

106 AutoCor, d=3, r=27 2.6e-7

----------------------------------------------

All other tests were passed

One interesting observation that Melissa made in her study is that the XORShift did poor in the crush tests but it's companion XORShift* did exceptionally well. This is a surprise when you consider that XORShift and XORSHift* use the same shift-and-apply-xor operations. The only difference between XORShift and XORShift* is XORShift* adds a multiplication at the end. Anybody familiar with random number generators will know that Linear Congruential Generators (LCGs) work by multiplying then adding. This means that XORShift* gains it's powers by essentially running it's output through a LCG.

If an improvement step works for XORShift, would such an improvement step work for a LCG?

Subscribe to:
Posts (Atom)