Saturday, May 11, 2019

Building a Randogram Emulator

As I stated in last year's review of the PCG paper, Randograms are an informal way of looking at random number generators. They are generated by creating a 256x256 grid and then taking 32768 pairs of numbers,  plotting them on the grid. The code for doing this was given in that previous article and it is not too difficult so I will omit the drawing code. Since the numbers for the random number generator will be coming from assembly language code, the RNG component that provides the randogram it's data will need to be written. We will simply have the RNG have an array of 65536 elements that get looped through. Our 6502 interpreter will fill out this array when it is ran. This is pretty trivial code:

open class RNG {
val randomBytes = arrayListOf<Int>()
var indx = 0

init {
for (i in 0..65535) {

open fun nextUByte():Int {
val temp = randomBytes[indx]
if (indx > 65535)
indx = 0
return temp and 255


We are now ready to build our Randogram Emulator's main display. This version is being done using JavaFX but could have been done in JavaScript. It would be nice if there was a standard cross-platform GUI library to use, but there are some third party attempts at such a beasts. Until there is a clear winner in this area, I will use JavaFX and plan on porting any GUI code to JavaScript when I am ready to get the js version of the emulator running. The GUI code simply sets up a canvas to draw the randogram on and a button for loading an assembly language file to be processed.

    override fun start(primaryStage: Stage?) {
randogram.drawToCanvas(canvas, 2.0)

val btn = Button("Test Assembly")
btn.setOnAction {handleLoadAsm()}
val root = VBox(10.0, canvas, btn)
val scene = Scene(root, 800.0, 600.0)

primaryStage!!.title = "JavaFX test"
primaryStage.scene = scene

The button handler is where all the real work takes place. We will simply set up the emulator environment, then bring up a file dialog for obtaining the assembly language file that should be processed. As this is an internal tool, we are not doing any sanity checking on the contents of the assembly language file but instead are simply calling our assembler class and getting it to assemble the file. A proper program would check the results of the assembler to make sure the file assembled properly, but this is a quick and dirty test so this will not be done.

The code for running the machine language is very simple as we just run instructions until we hit a break instruction. The twist, and where we are cheating big time, is that when we run into a NOP instruction, we handle that instruction by taking the value of the accumulator and adding it to the RNG instances list of numbers. Once we have written a proper assembly testing framework, this will result in 65536 numbers being generated which will fill out the list.

Once the assembly language code has ran, we generate the randogram and update the display. We now have a way of testing our assembly language random number generators.

    private fun handleLoadAsm() {
var memoryManager = Cartridge()
var m6502 = M6502( memoryManager )
var byteData =  ByteArray(4096)

val fileChooser = FileChooser()
var assemblyFile = fileChooser.showOpenDialog(null)
if (assemblyFile.isFile) {
var assemblyList:ArrayList<String> = ArrayList(assemblyFile.readLines())

var assembler = Assembler(m6502)
for (cntrRom in 0..assembler.currentBank.size-1) {
byteData[cntrRom] = assembler.currentBank.readBankAddress(cntrRom).toByte()
memoryManager.write(cntrRom, assembler.currentBank.readBankAddress(cntrRom))

m6502.state.ip = 0
var ipAddress = m6502.state.ip
var indx = 0
while ( ipAddress ) != 0) {
if ( ipAddress )==0xEA) {
rng.randomBytes[indx] = m6502.state.acc
ipAddress = m6502.state.ip

randogram.drawToCanvas(canvas, 2.0)

The problem we now have is that we need to write a testing framework for actually testing our random number generators. This will be done next fortnight.

Saturday, April 27, 2019

Building a 16-bit Linear Congruential Generator

The 16 bit seed version of the Linear Congruential Generator (LCG) is very similar to the 8 bit version in that every time it is called it returns a byte. The big difference is that the seed is 16 bits instead of 8 bits so we will have a longer time before numbers start repeating and a bit more randomness. To get the longer cycle (and more randomness) we need to use the upper byte of the results.

For those of you who haven't read the earlier blog entries in this series or who want a quick refresher, LCG uses the formula $S_n = A*S_{n-1}+C$ for getting the next seed so we are simply multiplying the seed by a constant A and then adding a constant C to the result. We will use 141
for A and 3 for C though other carefully chosen values could be used.

We already have the multiplication which we wrote last fortnight, so now all we have to do is use that multiplication routine and then add a constant to that value. This is very simple code as you can see.

; set up our constants for A and C
.EQU LCG_A 141
; create placeholder for RNG seed
seedLo: .BYTE 0
seedHi: .BYTE 0
; ...
; multiply16x8 setup and code would be put here
; ...

; load in the current value of the seed into the multiply registers
LDA seedLo
LDY seedHi
; now multiply by the A constant
JSR multiply16x8
; add the C constant to the reult
BCC lcg16bit_update_seed
; save the results in the seed
STA seedLo
STY seedHi
; put results in accumulator for return value

To test this simply call the random number generator 257 times and look at the sequence of numbers that are generated.

JSR lcg16bit
; now lets do this 256 more times
LDX #0
JSR lcg16bit
BNE test_loop
NOP ; using this as easy to find separator

While this does demonstrate that it is working and gives us a longer sequence than the first method, it is not a very good test. For our needs we don't need statistically sound random number generation, but a simple test like generating randograms would be sufficient for getting a rough idea of how the generator was working. Randograms were covered in an earlier blog post but are essentially a 256 * 256 grid that you plot by taking two random numbers and plotting the point on the graph. This is a good idea but not really implementable on a 64k machine as the minimum size for holding the grid is 65536 bytes and we would still need some way to save the data. However, we have our own emulator so could cheat a bit to get the result. More on this next fortnight.

Saturday, April 13, 2019

Multi-byte Multiplication 8x16

When you get right down to it, there is an interesting property about the number of bits involved in a multiplication. The maximum value for a multiplication is the number of bits in the multiplicand plus the number of bits in the multiplier. This means that our 8x8 multiplier should produce a 16 bit number. Likewise a 16x16 bit multiply would properly need 32 bits to store the results. For my needs we don't care about overflow situations as for the random number generator but if you wanted proper accuracy chaining more bytes to have larger results is certainly something that could easily be done. What we need to have a 16 bit seed is to multiply a 16 bit number by an 8 bit number.

This is actually not that much different than the 8x8 multiplier that we created earlier. There are two big differences. First, we are going to need to shift a 16 bit number. This can be done simply enough using the carry flag just as was done with the multi-byte additions. The shift left would then look like:


This would work just fine, but we can save a byte and a few cycles by getting rid of the CLC instruction and instead using the ASL instruction in the place of the first ROL instruction. This always rotates in a 0 while still putting the shifted out bit into the carry flag.

The second issue is how to deal with a 16-bit result. There are two solutions to this problem. We could store the results in some reserved memory (ideally in page 0) which if we were going to output a 24 bit number would be a requirement. With a little bit of register shifting, we can actually keep the results of the multiplication in the registers winch gives us a bit of a speed boost as well as reducing the bytes needed for the program and memory storage. The new multiplication routine now looks like this:

JMP test
; create a placeholder for the multiplicand
; Multiply method 
; A = multiplicand lo, Y = multiplicand Hi, X = multiplier
; result returned in A (lo) and Y (hi)
; init by setting zero page vars
STX multiplierLo
STA multiplicandLo
STY multiplicandHi
; start result at zero
LDA #0
; see if still bits in multiplier 
LDX multiplierLo
BEQ multiply16x8_done
; shift multiplier right. C set indicates need to add
LSR multiplierLo
BCC multiply16x8_adjust
; add current power of 2 multiplicand
ADC multiplicandLo
TAX ; save low result in X register 
TYA ; transfer high result to Acc for second half of add
ADC multiplicandHi
TAY ; return high result to Y
TXA ; return low result to Acc
; Set up next power of 2
ASL multiplicandLo
ROL multiplicandHi
JMP multiply16x8_loop


; 300 x 5 test
LDA #$2C
LDY #$1
LDX #$5
JSR multiply16x8

; fourty-two test
LDA #7
LDY #0
LDX #6
JSR multiply16x8

; LCG Mulitply step test
LDA #42
LDY #0
LDX #141
JSR multiply16x8


Here is a trace of the first test.

(0) 0:JMP $31
(3) 31:LDA #$2C ; A:$0 -> $2c
(5) 33:LDY #$1 ; Y:0 -> $1
(7) 35:LDX #$5 ; X:0 -> $5
(9) 37:JSR $6
(15) 6:STX $5 ; [5]:-> $5
(19) 9:STA $3 ; [3]:-> $2c
(23) c:STY $4 ; [4]:-> 1
(27) f:LDA #$0  ; A:$44 -> $0, Flags:32 -> $22
(29) 11:TAY ; Y:1 -> $0
(31) 12:LDX $5 ; X:5 -> 5
(35) 15:BEQ $30
(37) 17:LSR $5 ; [5]:$5 -> $2, C=1
(43) 1a:BCC $27
(45) 1c:CLC ; C=0
(47) 1d:ADC $3 ; A:$0 -> $2c
(51) 20:TAX ; X:5 -> $2c
(53) 21:TYA  ; A:$2C -> $0
(55) 22:ADC $4 ; A:$0 -> $1
(59) 25:TAY ; Y:0 -> $1
(61) 26:TXA  ; A:$1 -> $2c
(63) 27:ASL $3 ; [3]:$2C -> $58
(69) 2a:ROL $4 ; [4]:$1 -> $2
(75) 2d:JMP $12
(78) 12:LDX $5 ; X:$2c -> $2
(82) 15:BEQ $30
(84) 17:LSR $5 ; [4]:$2 -> $1
(90) 1a:BCC $27
(93) 27:ASL $3 ; [3]:$58 -> $B0
(99) 2a:ROL $4 ; [4]:$2 -> 4
(105) 2d:JMP $12
(108) 12:LDX $5, X:$2 -> $1
(112) 15:BEQ $30
(114) 17:LSR $5 ; [5]:1 -> 0, C=1
(120) 1a:BCC $27
(122) 1c:CLC ; C=0
(124) 1d:ADC $3 ; A:$2c -> $dc
(128) 20:TAX ; X:1 -> $dc
(130) 21:TYA  ; A:$dc -> $1
(132) 22:ADC $4 ; A:$1 -> $5
(136) 25:TAY ; Y:1 -> $5
(138) 26:TXA  ; A:$5 -> $dc
(140) 27:ASL $3
(146) 2a:ROL $4
(152) 2d:JMP $12
(155) 12:LDX $5, X:$dc -> $0, Z=1
(159) 15:BEQ $30
(162) 30:RTS 

As you can see that this works very similar to the original version with only a few changes. A better 8x8 multiply that returns a 16 bit result would be easy to do by simply using this version but always setting Y to 0 when calling. We now have a multiply routine that will allow us to implement both the 16-bit seed version of the LCG algorithm as well as the PCG algorithm. We will start implementing those next fortnight!

Saturday, March 30, 2019

Multi-byte Numbers

Just because the 6502 is an 8-bit processor does not mean that it is limited to dealing with numbers that are larger than 8-bits. You can have as many bits as your memory will support though dealing with larger numbers does add a lot of work that the program has to do. The purpose of the carry flag is for dealing with multi-byte numbers. The big issue is that you can only work with 8 bits at a time of the number. While you can use the x and y registers as temporary holders for parts of the larger number, you most likely will want to store that number somewhere in memory.

There are many ways of storing a multi-byte number but the two most common formats are big-endian and little-endian in reference to  Gulliver's Travels where there were groups of people who fought over whether it was proper to crack a soft-boiled egg from the big end (the big-endians) or the little end (little-endians). CPU makers have been having a similar fight with the decision being should the high order byte be stored in the lower address followed by the lower order byte(s) which we call big-endian or should we do this with the lowest order byte first followed by the higher bytes.

The 6502 is considered to be a little-endian architecture due to the way memory addresses are stored. This is not necessarily how you need to store multi-byte numbers in memory, but this tends to be the convention for 6502 programmers. However, since the registers are only 8 bits, it really does not matter what convention you use as long as you are consistent. Here is a sample of doing a 16-bit addition.

JMP test

; Big Endian variables
BE_NumberAHi: .BYTE $3
BE_NumberALo: .BYTE $E8
BE_NumberBHi: .BYTE $3
BE_NumberBLo: .BYTE $E8

; Little Endian Variables
LE_NumberALo: .BYTE $E8
LE_NumberAHi: .BYTE $3
LE_NumberBLo: .BYTE $E8
LE_NumberBHi: .BYTE $3

; Big endian A = A + B
LDA BE_NumberALo
ADC BE_NumberBLo ; add low btye
STA BE_NumberALo
LDA BE_NumberAHi ; add high byte
ADC BE_NumberBHi ; using carry from prev add.
STA BE_NumberAHi
; Little endian A = A + B
LDA LE_NumberALo
ADC LE_NumberBLo ; add low btye
STA LE_NumberALo
LDA LE_NumberAHi 
ADC LE_NumberBHi ; add high byte
STA LE_NumberAHi ; using carry from prev add.

As you can see there is no real difference which order you store the data in but this is not the case with processors that have larger registers. Multi-byte math is done by taking advantage of the carry flag as the ADC instruction will add an additional 1 to the result of the add if the carry flag is set. This is why you clear the carry flag before adding the low byte, you then add the higher bytes. Note that you can have as many bytes as you want in your number as long as you continue to chain the bytes.

The same procedure for multi-byte addition works for rotation and subtraction, though with subtraction you need to set the carry flag (SEC) not clear it. We will be seeing a demo of multi-byte rotation next fortnight when I implement my 8x16-bit multiplier.

Saturday, March 16, 2019

Shifted Multiply

Adding numbers together in a loop works for small numbers but the execution time quickly increases as the multiplier increases. Another way of multiplying numbers is by shifting them. Each shift to the left is the equivalent of multiplying by 2. This means that if we shift to the lefft 3 times we have effectively multiplied by 8. The problem is that this only works with powers of two. When you think about it, though, 5x3 is the same as 5x2+5x1. Likewise, 7x6 is just 7x4+7x2.

This means that we cam drastically improve multiplication by simply looping through the bits shifting the multiplier to the right and seeing if it is set. If it is we add the current value of the shifted multiplicand to the result. We then shifting the multiplicand to the left to multiply it by the next power. This is repeated until the multiplier becomes 0. The code that does this is as follows:

; brute force 8-bit multiplication
JMP test
; create a placeholder for the multiplicand
; Multiply method A = multiplicand, X = multiplier result returned in A
; init by setting zero page vars
STX multiplierLo
STA multiplicandLo
; start result at zero
LDA #0
; see if still bits in multiplier 
LDX multiplierLo
BEQ multiply8x8_done
; shift multiplier right. C set indicates need to add
LSR multiplierLo
BCC multiply8x8_adjust
; add current power of 2 multiplicand
ADC multiplicandLo
; Set up next power of 2
ASL multiplicandLo
JMP multiply8x8_loop

; fourty-two test
LDA #7
LDX #6
JSR multiply8x8
; 10*20 test
LDA #10
LDX #20
JSR multiply8x8
; LCG Mulitply step test
LDA #42
LDX #141
JSR multiply8x8

When ran, we get the following trace for the first test. The second and third tests are pretty similar.

(0) 0:JMP $22
(3) 22:LDA #$7 ; A:0 -> 7
(5) 24:LDX #$6 ; X:0 -> 6
(7) 26:JSR $5
(13) 5:STX $4 ; [4]:0 -> 6
(17) 8:STA $3 ; [3]:0 -> 7
(21) b:LDA #$0 ; A:7 -> 0
(23) d:LDX $4 ; X:6 -> 6
(27) 10:BEQ $21
(29) 12:LSR $4 ; [4]:6 -> 3
(35) 15:BCC $1B
(38) 1b:ASL $3 ; [3]:7 -> $E
(44) 1e:JMP $D
(47) d:LDX $4 ; X:6 -> 3
(51) 10:BEQ $21
(53) 12:LSR $4 ; [4]:3 -> 1, C=1
(59) 15:BCC $1B
(61) 17:CLC ; C=0
(63) 18:ADC $3 ; A:0 -> $E
(67) 1b:ASL $3 ; [3]:$E -> 1C
(73) 1e:JMP $D
(76) d:LDX $4 ; X:3 -> 1
(80) 10:BEQ $21
(82) 12:LSR $4 ; [4]:1->0, C=1
(88) 15:BCC $1B
(90) 17:CLC ; C=0
(92) 18:ADC $3 ; A:$E -> 2A
(96) 1b:ASL $3 ; [3]:$1C -> $38
(102) 1e:JMP $D
(105) d:LDX $4 ; X:1 -> 0, Z=1
(109) 10:BEQ $21
(112) 21:RTS 

The call starts at 3 and runs until 118 for a 115 cycle time. This is slightly worse than the brute force way, but if we continued with the other two tests we would find that the 10x20 test takes 163 cycles which is significantly faster than the 272 cycles for the brute force multiply. And the 42x141 test only took 245 cycles compared to 1845 cycles that the brute force method required.

We now have everything necessary to create a LCG random number generator, which we will be doing next fortnight.

Saturday, March 2, 2019

Brute-forcing 8-bit multiplication

As the 6502 does not have instructions for multiplication, if we want to do any multiplication then we need to write our own routine to multiply numbers. There is a very easy way to do this, but it is not efficient. Still, lets take a look at the obvious brute force method before working out a more efficient way of doing this.

In the discrete math courses I took, multiplication was defined as adding a number which is called the multiplicand by itself the number of times specified by another number which is called the multiplier. We have an add instruction and we have looping constructs to repeat the addition a number of times. To make this a proper function we need to have a way of getting information into the function and returning the results. In this case, we have two bytes so we will have the multiplicand in the accumulator and the multiplier in x returning the result in the accumulator.

The following code is a very rough way of doing this but is not what I would consider production code which would check for the 0 case, and would have a nicer loop. As I am planning on writing a better version, this will do for demonstration purposes.

; brute force 8-bit multiplication
JMP test
; create a placeholder for the multiplicand
; Multiply method A = multiplicand, X = multiplier result returned in A
STA multiplicandLo
BEQ multiplyDone
ADC multiplicand
JMP multiplyLoop

; fourty-two test
LDA #7
LDX #6
JSR multiply
; 10*20 test
LDA #10
LDX #20
JSR multiply

The trace results shows us pretty much what we would expect. The first test does take a while to run as it starts at cycle 3 and returns at cycle 93 (the instruction after the RTS) giving us a 90 cycle run time.

(0) 0:JMP $12
(3) 12:LDA #$7 ' A: 0 -> 7
(5) 14:LDX #$6 ' X: 0 -> 6
(7) 16:JSR $4# 
(13) 4:STA $3# ' [3]: 0 -> 7
(17) 7:DEX ' X:6 -> 5
(19) 8:BEQ $11
(21) a:CLC 
(23) b:ADC $3 ' A: 7 -> $E
(27) e:JMP $7
(30) 7:DEX ' X: 5 -> 4
(32) 8:BEQ $11
(34) a:CLC
(36) b:ADC $3 ' A: $E -> $15
(40) e:JMP $7
Additional loops removed as you get the idea!
(75) b:ADC $3 ' A: $23 -> $2A
(79) e:JMP $7
(82) 7:DEX ' X: 1 -> 0; Z set
(84) 8:BEQ $11
(87) 11:RTS

A slightly larger 10 * 20 multiply, however, does not give us that good of results.

(93) 19:LDA #$A ' A: $2A -> $A 
(95) 1b:LDX #$14 ' X: 0 -> $14
(97) 1d:JSR $4
(103) 4:STA $3 ' [3] 7 -> $A
(107) 7:DEX ' X: $14 -> $13
(109) 8:BEQ $11
(111) a:CLC 
(113) b:ADC $3 ' A: $A -> $14
Additional loops removed as you get the idea!
(345) a:CLC
(347) b:ADC $3 ' A: $BE -> $C8
(351) e:JMP $7
(354) 7:DEX ' X: 1 -> 0; Z set
(356) 8:BEQ $11
(359) 11:RTS
(365) BRK

The time for this was 272 cycles which is tolerable but getting bad. For our random number generator we need to multiply by 141. This would result in 1845 cycles to perform the multiplication. Not a good result at all. Obviously we need to come up with a better way of handling multiplication. This will be covered next fortnight.

Saturday, February 16, 2019

Building a Simple trace utility

One of the early development tools for assembly language programmers was a trace tool. This was a simple tool that you ran and it would print out what the program was running at every step of the process. This type of tool could be considered the predecessor to the debugger. My eventual plans are to turn this tool into a debugger eventually. For developing the random number generator libraries I don't need that much of a utility. All that is really needed is the ability to load an assembly language file, assemble it, then run the file printing the command being executed and the state of the registers.
Trace utilities can be thought of as an automated single-step through the program that runs until a break statement is encountered. It is handy for seeing what is happening in non-linear programs such as the following demo:

    LDA #0
LDX #3
ADC #3
BNE loop

While the above example may be fairly simply to manually figure out, seeing what is happening line-by-line can still be useful for debugging as it is very easy to overlook something obvious, especially if it is a mistake that you have made yourself. It is also very handy for demonstrating what a piece of code is doing, which is also very important for explaining the mathematical functions that I will be writing in the remainder of this chapter.

People who have been following this series from the beginning will remember that I said that an emulator was essentially a disassembler that also had an interpreter attached to it. As we already have a dissasembler, and we have an emulator that can run just an individual instruction while storing the state of the registers in an accessible data class, the actual trace part of the trace utility is simply calling the dissasembler to get the assembly instructions, then running that step, then grabbing the register state and printing the results.

// run until break
m6502.state.ip = 0
var traceString = ""
var ipAddress = m6502.state.ip
while ( ipAddress ) != 0) {
traceString = "(" + m6502.state.tick +") " +
ipAddress.toString(16) + ":" +
ipAddress = m6502.state.ip
traceString = traceString + "# A:" + m6502.state.acc.toString(16) +
", X:" + m6502.state.x.toString(16) +
", Y:" + m6502.state.y.toString(16) +
", Flags:" + m6502.state.flags.toString(16)

When this is run on our sample program, we get the following results.

(0) 0:LDA #$0# A:0, X:0, Y:0, Flags:22
(2) 2:LDX #$3# A:0, X:3, Y:0, Flags:20
(4) 4:CLC # A:0, X:3, Y:0, Flags:20
(6) 5:ADC #$3# A:3, X:3, Y:0, Flags:20
(8) 7:DEX # A:3, X:2, Y:0, Flags:20
(10) 8:BNE $4# A:3, X:2, Y:0, Flags:20
(13) 4:CLC # A:3, X:2, Y:0, Flags:20
(15) 5:ADC #$3# A:6, X:2, Y:0, Flags:20
(17) 7:DEX # A:6, X:1, Y:0, Flags:20
(19) 8:BNE $4# A:6, X:1, Y:0, Flags:20
(22) 4:CLC # A:6, X:1, Y:0, Flags:20
(24) 5:ADC #$3# A:9, X:1, Y:0, Flags:20
(26) 7:DEX # A:9, X:0, Y:0, Flags:22
(28) 8:BNE $4# A:9, X:0, Y:0, Flags:22

Not the prettiest of output, but that can be easily tweaked. So with a very tiny amount of work, we now have a useful tool for writing and testing chunks of assembly language. We are now ready to start building the math libraries that we will be needing for writing our random number generators. We will start covering multiplication next fortnight.

Saturday, February 2, 2019

Preparing for Random Assembly

In the last post we covered the basics of a linear congruential generator and the simplest form of the permuted congruential generator. Now I am going write some generic 6502 implementation of these generators. This will require covering a few topics that not been covered yet as well such as writing a simple trace utility so the results can be seen. This fortnight I will break down the topics that will be covered over the next few months.

The first step is to create a simple trace utility that will let you run some assembly language code showing the effected registers and memory as the program runs. This is the equivalent of using a debugger to single-step through the code. One alternative to this would be to use an existing emulator that contains a debugger, but there are three reasons why I am not going this route. First, a trace utility would be fun to write. Second, I ultimately want my emulator to also be an IDE so development tools like the trace would be great to have. Finally, going with the existing emulator would require writing for a specific platform which at a minimum means a bunch of boilerplate code obfuscating the code. Writing a trace utility should not take too long, probability only one or two posts.

For the pure 8 bit version of LCG we need 8-bit multiply which is something  that the 6502 does not do.  Software multiply is possible so we will look at a couple of ways of doing this before implementing our 8-bit LCG.

Before getting to the 16-bit version of LCG we will have to take a look at how to perform multibyte math and then expand my multiply routine to 16-bits which is a bit of work.

Finally we will implement the 16 bit version of LCG and the most basic version of PCG.

Saturday, January 19, 2019

Building PCG

A LCG basically uses the following calculation: X[n] = (A * X[n-1] + C) % M. X[n] is the next number in the sequence to generate with A being a specially selected multiplier value, C is a value to add to the generator, and M is the modulus or range of the generator. We will use Rotenberg's values of 141 for A, 3 for C and just use a byte so 256 for the modulus. Lets generate a randogram and see what we get.

While this is not as good as we hoped, it is not as bad as it seems as the order the dots appear is random. We are using an 8 bit generator so what if we went with 16? Obviously the lower order byte would remain the same but what if we took higher bits for the output.

Interesting enough, each bit you shift becomes more random with different shifts having different coverage of the randogram. What if we added a permutation to select from some of the better shifts? If only we had a random way of selecting which shift to use. What about the top 2 bits (more for larger generators)? While they have the most randomness, we may gain overall.

Melissa added additional permutations for rotation, XOR, and XORShift. More permutations are possible if you follow the rules:

  • Bits can be thrown away but you can't make new ones (precious).
  • Target and control groups of bits must be separate.
  • Control bits must not be modified.
  • Can do many rounds and different rounds can have different control bits.

By combining different permutations, it was possible to complete BigCrush using only a 36 bit PCG (Permuted Congruential Generators). So, in conclusion PCG is a family of Pseudo-Random Number Generation algorithms. They are relatively simple to implement. They run Fast as the code is simple boolean operations, multiplications and additions. They are Space-Efficient requiring less than double the modulus size. And passing the TestU01 BigCrush tests shows they are Statistically Good.

Saturday, January 5, 2019

Testing Random Number Generators

There are a number of factors that people look for in a random number generator.

  • Period - How long a generator can run before it repeats the entire sequence.
  • Uniformity - All the numbers within the range will appear.
  • Predictability - how predictable is the next number in the sequence.
  • Repeatability - can the same number repeat itself?
  • Security - is the sequence invertable?
  • Seekability - can we quickly skip over numbers?
  • Streams - How usable is the generator when threading?
  • k-dimensional equadistribution - How well distributed numbers are.

This diverse range of usage inevitably meant that there were a lot of different tests. L'Ecuyer and Simard made a huge contribution to testing in 2007 with the release of TestU01. They took a look at the myriad of tests that were available and created a test suite that includes a large number of previously independently published tests. They combined a number of different tests into batteries of tests (SmallCrush, BigCrush). Lots of the random number generators that were in common use at the time failed the test suite. The PCG paper covers testing of a large number of random number generators for those of you who are curious.

TestU01 is a C library with tests and generators for the most common PRNGs. This means that you need to write a simple C program to run the tests that you desire. Here is the version of Java's random number generator that I wrote to test it. Note that the values used are based on the API documentation and are the minimal random number generator that is required.

// Adapted from TestU01 manual, Figure 2.2, Figure 2.4

#include "TestU01.h"

// Example PRNG: Xorshift 32

static unsigned long x = 2463534242U;
static unsigned long a = 0x5DEECE66DL;
static unsigned long c = 11U;

unsigned int jvmlcg (void)
x = x * a + c;
return (int)(x>>16);

int main()
// Create TestU01 PRNG object for our generator
unif01_Gen* gen = unif01_CreateExternGenBits("JVM LCG", jvmlcg);
// Run the tests.

// Clean up.

return 0;

The Java test did not do very well on the big crush test. I ran the tests in an emulator running Linux so that I didn't need to set up MinGW on my system as the libraries are for Unix. If you see a flaw in my program, feel free to email me. You can run the test for yourself but to save you hours of waiting, here are the final test results:

========= Summary results of BigCrush =========

Version:          TestU01 1.2.3
Generator:        JVM LCG
Number of statistics:  160
Total CPU time:   08:55:08.77
The following tests gave p-values outside [0.001, 0.9990]:
(eps  means a value < 1.0e-300):
(eps1 means a value < 1.0e-15):

Test                          p-value
2  SerialOver, r = 22               eps  
4  CollisionOver, t = 2           1 - eps1
6  CollisionOver, t = 3           1 - eps1
8  CollisionOver, t = 7             eps  
10  CollisionOver, t = 14            eps  
12  CollisionOver, t = 21            eps  
13  BirthdaySpacings, t = 2          eps  
14  BirthdaySpacings, t = 3          eps  
15  BirthdaySpacings, t = 4          eps  
16  BirthdaySpacings, t = 7          eps  
17  BirthdaySpacings, t = 7          eps  
18  BirthdaySpacings, t = 8          eps  
19  BirthdaySpacings, t = 8          eps  
20  BirthdaySpacings, t = 16         eps  
21  BirthdaySpacings, t = 16         eps  
22  ClosePairs mNP, t = 3          1.6e-26
22  ClosePairs mNP1, t = 3         1.3e-48
22  ClosePairs mNP2S, t = 3          eps  
23  ClosePairs mNP, t = 5           3.4e-4
23  ClosePairs mNP1, t = 5          3.4e-4
23  ClosePairs mNP2S, t = 5          eps  
24  ClosePairs mNP2S, t = 9        5.7e-46
25  ClosePairs mNP, t = 16         8.0e-30
25  ClosePairs mNP1, t = 16        3.7e-29
25  ClosePairs mNP2S, t = 16      1.1e-283
27  SimpPoker, r = 27                eps  
29  SimpPoker, r = 25                eps  
32  CouponCollector, r = 20          eps  
33  CouponCollector, r = 27          eps  
35  Gap, r = 25                      eps  
37  Gap, r = 20                      eps  
43  Permutation, t = 10             0.9997 
58  AppearanceSpacings, r = 27       eps  
60  WeightDistrib, r = 20            eps  
61  WeightDistrib, r = 28            eps  
64  WeightDistrib, r = 26            eps  
67  MatrixRank, L=30, r=26         7.9e-10
75  RandomWalk1 H (L=50, r=25)       eps  
75  RandomWalk1 M (L=50, r=25)       eps  
75  RandomWalk1 J (L=50, r=25)       eps  
75  RandomWalk1 R (L=50, r=25)       eps  
75  RandomWalk1 C (L=50, r=25)       eps  
85  Fourier3, r = 27                 eps  
87  LongestHeadRun, r = 27           eps  
87  LongestHeadRun, r = 27         1 - eps1
89  PeriodsInStrings, r = 20       1 - eps1
91  HammingWeight2, r = 27         5.0e-13
96  HammingIndep, L=30, r=27         eps  
98  HammingIndep, L=300, r=26        eps  
100  HammingIndep, L=1200, r=25       eps  
102  Run of bits, r = 27              eps  
106  AutoCor, d=3, r=27              2.6e-7
All other tests were passed

One interesting observation that Melissa made in her study is that the XORShift did poor in the crush tests but it's companion XORShift* did exceptionally well. This is a surprise when you consider that XORShift and XORSHift* use the same shift-and-apply-xor operations. The only difference between XORShift and XORShift* is XORShift* adds a multiplication at the end. Anybody familiar with random number generators will know that Linear Congruential Generators (LCGs) work by multiplying then adding. This means that XORShift* gains it's powers by essentially running it's output through a LCG.

If an improvement step works for XORShift, would such an improvement step work for a LCG?