does anybody here know a portable, easy to use, and high quality, deterministic random number generator? It is a requirement that you must get the exactly the same sequence of pseudo-random numbers on any architecture (depending on a user-settable seed).
I use a hand-crafted (two-line) linear congruential generator, but always feel a bit uneasy about that.
If you use the same symmetric key and nonce, the output bit stream should be the same every time.
This is how OpenBSD's arc4random(3) started out: using the output of RC4 and stirring the entropy pool regularly to ensure forward security. (They've since switched to ChaCha20.)
Monocypher is a decent fit here. Just run its ChaCha20 with (a hash of) your seed as the key. Larger libraries (libsodium, NaCl, etc) are decent too, but bigger.
For cryptographic applications what you want is as previously noted just some arbitrary secure stream function (with block cipher in OFB or CTR mode being particularly convenient way to obtain such cipher). Bear in mind that for cryptographic applications it is often important whether the state can be determined by observing the output and whether the state can be rewound back (which then necessitates choosing stream cipher that satisfies these requirements or using more complex construction. Most notably block ciphers in CTR/OFB mode does not satisfy this and ChaCha/Salsa also does not as it is essentially block cipher in CTR mode). Interesting modern construction would involve Keccak-style sponge functions, which are kind of universal cryptographic primitive that is usable as hash, MAC and PRNG/stream cipher all at once depending on mode of operation.
For monte-carlo simulations you want something that is reasonably fast and has large state space. Mersenne Twister is common choice, although XorShift is significantly faster and usually has state space that is large enought.
For randomized distributed simulations (read: lock-step realtime multiplayer games) you don't care about the output quality of the output that much and LCG tends to be sufficient, althought also in this case XorShift would be probably better as it has comparable performance and significantly better output quality.
All the mentioned algorithms are by definition deterministic and discounting implementation errors platform independent.
Plenty of CSRNGs are built from AES-CTR and ChaCha20. There's probably no good reason to prefer XorShift over a cryptographic random bit generator that you can seed directly.
For Monte Carlo simulations, CSRNGs are too slow. Xorshift is incredibly fast. You wouldn't want to use a CSRNG for path tracing rendering, for example; you're calling the RNG many times per ray and shooting dozens or hundreds of rays per pixel.
We've dealt with this in Rust—people love to write Monte Carlo simulations as language benchmarks, and Rust used to get dinged for preferring CSRNGs.
Xorshift is like 10 cycles. AES-NI can take hundreds, depending on the number of rounds.
In the case of brute force Monte Carlo path tracing, the speed of the RNG is important, because you have dozens or hundreds of rays per pixel, each with multiple bounces, and every bounce samples from your RNG. When your render times are 6 hours per frame to begin with, you really don't want to mess around.
Ok, let's say we're rendering at 8K (32 megapixels), 200 rays per pixel, 20 bounces per ray. And rendering that frame takes 6 hours. How many (mega)bytes of random data do we need per second?
Six whopping megabytes per second, for the entire job, across all cores! Well, that assumed you only need one byte per bounce. Maybe you need a few bytes?
And here I can pull an order of magnitude more and then some (180 megabytes/s) via /dev/urandom on OpenBSD (i.e. Chacha20, no hardware accel), using a single core of a few-years-old fanless mini PC (some Intel Celeron with "U" postfix).
I generally don't buy the argument that your PRNG needs to be super duper fast, because usually the simulation is doing much more than just generating random data, and all that work is so much more expensive than even a relatively expensive CSPRNG. At least my ray tracers are. Generating rays is actually trivial compared to traversing, intersecting and shading the scene.. goodness gracious if I need to do actual texture & normal map lookups at a high resolution, so many cache misses :(
That said, a simple and fast PRNG with little internal state can make life easier if you want reproducibility.. and it might be easier to build a reproducible simulation if your RNG gives exactly one word the size you need, instead of requiring you to generate a big block (and then extract things out of it somehow). This is especially the case with threaded sims. So yeah.
But in this case, I wouldn't be looking at something like XorShift, I'd be looking at a random map that allows me to use (e.g.) the pixel's coordinates and a frame & bounce counter as the input and get matching (constant) output. Now it no longer matters which thread renders which pixel and when, the result will be all the same. It's not hard to build such functions, but they will be slower than XorShift if you want comparable quality. Speed should be no problem though.
It's amazing how much efforts people are willing to invest to contradict others' experience on the Internet.
The GP is giving a real-life example (people having complained about something being too slow compared to xorshift) and yet you try to do the math to prove it couldn't happen…
The math was just the prelude to my real world experience. Please try to read and comprehend the rest of my post. If you can't comprehend it, then feel free to ask for clarification or just leave it be without adding any snide remarks.
People having complained is not a real-life example of an actual performance issue. Until otherwise proven, it's a real-life example of someone complaining about some microbenchmark that, based on my real life experience, would be completely irrelevant if they actually developed their real world problem-solving program which totally shifts the bottleneck to something that actually matters.
Alternatively, it's a real-life example of poor implementation or usage, and not necessarily poor performance of the underlying primitive. This also happens a lot. "Foo is slow!" is true for a lot of things if you use them wrong. Maybe it's simply the user's ignorance, or maybe the API isn't well designed to support their case. Maybe it's poor documentation.
Also, this is a technical forum. There's absolutely nothing wrong with people doing back of the envelope calculations to analyze the scale of a problem. If more people did that while pointing at real examples with concrete numbers, maybe we'd actually know things instead of just relying on hearsay and religion. Whatever, it's a cargo cult industry.
Here's the good part: if someone who knows better comes along, they find those numbers and can tell us why we're wrong. Otherwise, let's just take everything on faith?
It's amazing how often people (like you, who haven't contributed anything except noise to the discussion) just don't want to understand a problem and if someone comes along trying to understand it and do the math, they're called a lunatic.
And this is why some of the code I get to work on is so terrible, people are wasting time and complicating code to micro-optimize things that are completely irrelevant (a conclusion one could come to either by a ballpark-estimate or a quick profile), while massive bottlenecks sit unattended.
AES-128-CTR on my laptop is 4.8GB/sec per core. I'm sure something producing random enough output for simulation can be built that is as fast as memcpy, but would anything real be affected by the difference?
Yes, it is possible but its difficult to do it right. I ran into at least 3 or 4 "roadblocks" while designing this algorithm, fortunately I managed to get through all of them.
I've got an unrolled dual-stream going at 37.1 GBps on Threadripper 1950x (which has 2x AES pipelines). ~29.2 GBps for more "typical" usage (but still impractical).
The issue with this is that random-bits of this speed are almost useless. A division or modulus operation takes 80-cycles (~20 nanoseconds) so as soon as you do "rand() % 6" (assuming you wanted a 6-sided dice roll), you've completely wreaked the bit-generator. A branch-misprediction is like 15-cycles, so that singular if-statement in your code that uses the RNG-number would be 5x slower than the generator itself.
Its fast, but I don't know how practical it is. There's no practical way to use all of these random bits that I'm aware of.
I did get a bit-compatible POWER9 version running by the way. So its portable across systems (at least, systems with an AES-encoder. ARM, Power9, and 2010-era x86.). Rasp. Pi doesn't have a CPU-AES engine however (its outside of the core), so Rasp. Pi can't use this methodology.
--------------
If I were to actually push forward with this project more, I'd need to study on whole-program optimization, so that the (rarely used) XMM registers can be pragmatically used in a lot of code. The entirety of this RNG fits inside a single XMM register, so it'd benefit grossly from whole-program optimization.
The XMM-data should also be fed into SIMD somehow, some kind of engine into ispc or OpenMP SIMD code. A bunch of raw-bits coming in through assembly commands is somewhat impractical, no matter how you look at it.
Dan Lemire informed me of an amazing algorithm (that does work with AES XMM registers) that creates uniform random numbers out of a few multiply instructions. So I can get XMM-registers full of arbitrary integers with very low bias. (ex: numbers between 0 and 100) out of the bitstream.
I also managed to convert the numbers into single-precision floats and double-precision floats between 0 and 1.0 uniformly.
Still, this only really seems practical if you have SIMD-code that actually consumes all of these bits. Converting from XMM-registers into RAX registers has quite a bit of slowdown.
I don't think the typical simulation code is actually limited by the speed of RNGs at all.
Counter (0, 1, 2, 3, 4...) -> Multiply Const1 -> Bit-reverse -> XOR Const2 -> Multiply Const3 has been my PRNG for GPUs.
Const1 and Const3 need to be odd (bottom bit == 1). Const2 can be fully random. All three values can come from /dev/urandom with pretty good results with some preliminary testing. I haven't figured out a good methodology from selecting Const1 or Const3 aside from the bottom-bit == 1 and random (If I had more time, I'd probably work on it...)
---------
The "counter" is 64-bits and allocated per-thread. Assume 2^14 bits for 16384 GPU-threads, each GPU-thread then has 50-bits of values.
64-bit integer multiply is slow however. 32-bit multiply would be faster for most GPUs. But also in my experience, GPUs are almost always memory-constrained, not compute-constrained. So you probably can do fine with 64-bit integer multiply.
If you really need to stick with 32-bit integers, then make Const1 and Const3 per-thread specific. That gives a unique sequence of 32-bit numbers per thread, but a 32-bit cycle is inevitable every 4-billion numbers with only 32-bits of state.
It's not only monte-carlo simulations. You may want to add a bit of random noise to a high-resolution video sequence, for example. It is impractical to do so with a slow PRNG.
PCG generators are very good, very fast, and decently random, don't suffer (AFAIK) from bad seeds like Mersenne Twister, mix well, have small state, and have selectable periods of about any size. They are vastly better than LCGs.
I use a 64 bit one and a 128 bit one for all sorts of work.
These are not crypto, so don't use them for that. For non-crypto I have seen no better PRNG than these (and I've followed and written on this space for a long time).
When you say "not crypto, don't use them for that", you mean "don't use them for any situation in which the security of the randomness matters", right?
I don't know what you want this for exactly, so definitely don't use my suggestion for crypto purposes. However...
If what you want is just to add some uniform, seemingly random behavior to something (e.g. a game, generate some noise, whatever), I've used the Wolfenstein3D's FizzleFade effect algorithm [0] in the past.
What's high quality depends on your requirements and usage. But, for example, you can use a Feistel construction to generate a random, non-repeating sequence that is as unpredictable as the mixing function. I've used this technique instead of a Fisher–Yates shuffle, trading higher CPU for lower space, for generating 16-bit DNS QIDs.
Yes, I know many nice algorithms with "high quality" properties. My question is about actual implementations: is there a canonical github repo that everybody is using for this need, or I have to write or adapt the code myself?
You can use a sequence of AES encryptions with a pre-shared seed, the quality of the random numbers will be among the highest you can get. You can find AES implementations all over the place, for basically any platform.
You would do something like rng1 = AES(seed); rng2 = AES(rng1); rng3 = AES(rng2);
To add to my answer, you can also do something like this:
randomNumber = AES(secretKey, counter), where counter is just a sequence number that you can share in the open. Anyone without the key would not be able to generate the random number with just the counter only.
does anybody here know a portable, easy to use, and high quality, deterministic random number generator? It is a requirement that you must get the exactly the same sequence of pseudo-random numbers on any architecture (depending on a user-settable seed).
I use a hand-crafted (two-line) linear congruential generator, but always feel a bit uneasy about that.