I did exercise 1-4 before exercise 1-3 because you sort of need 1-4 to do 1-3 properly.
A quick explanation from the header docs:
Two solutions to Column 1 Exercise 4 of Bentley’s programming pearls: “How do you generate k unique random integers less than n in random order?”
The “shuffle” method is what I considered to be the naive solution, but it is surprisingly what Bentley’s book describes as the answer. In this method, you allocate an array A of n integers such that A[i] = i for 0 < i < n. You then shuffle the whole array in O(n) time, and select the top k. This has a space requirement of O(n + k) and a time requirement of O(n + k).
The solution I came up with, which I call the "partition" method, involves dividing the interval 0..n into k subintervals, and randomly selecting 1 integer from each interval. This guarantees uniqueness without having to actually store all n integers. You can then shuffle the k samples. You require O(k) time to generate the k samples, O(k) time to shuffle them, and O(k) space to store them, which implies that the whole procedure requires O(k) time and O(k) space. The consequences of this leaner method are that (a) Your distribution is "more uniform" than in the first, since you're forced to select samples from subranges, and (b) there are always some numbers left over in the last interval that you can't sample from, and the size of this junk interval grows larger as k gets smaller.
I implemented both methods in Python (generate_unique_ints.py, or with syntax highlighting: generate_unique_ints.py), only to find that mine was slower in many cases when I expected it to always be faster. I then reimplemented the whole thing in C (generate_unique_ints.cc, or with syntax highlighting: generate_unique_ints.cc). Here are some experimental results that sum up my findings. All measurements were done on the ol’ MacBook using the ‘time’ command. All time measurements are user time, in seconds.
Python, n = 1 000 000
| k | Time with shuffle (sec) | Time with partition (sec) |
|---|---|---|
| k = 100 | 6.4 | 0.0 |
| k = 100 000 | 6.4 | 1.3 |
| k = 500 000 | 6.4 | 6.4 |
| k = 900 000 | 6.4 | 11.5 |
C version, n = 10 000 000
| k | Time with shuffle (sec) | Time with partition (sec) |
|---|---|---|
| k = 1 000 | 1.2 | 0.0 |
| k = 1 000 000 | 1.2 | 0.1 |
| k = 5 000 000 | 1.2 | 0.6 |
| k = 9 000 000 | 1.2 | 1.3 |
In both cases, the performance of the shuffle method stays constant, which is what one would expect since you’re always generating n integers where n > k. What I didn’t expect is for the partition version to run slower than the shuffle version for any k: However, I observed in Python that at around k = n/2, it becomes slower. I suspect that the reason that this is happening is that I’m building the list of k samples by calling append() k times, instead of using a list comprehension. It wasn’t immediately obvious how to use a comprehension to write that loop, so i didn’t do it. I’m curious as to how this would affect performance, if at all, since this is just a wild guess.
Comments (0)