|
| 1 | +# Selection Sampling |
| 2 | + |
| 3 | +Goal: Select *k* items at random from a collection of *n* items. |
| 4 | + |
| 5 | +Let's say you have a deck of 52 playing cards and you need to draw 10 cards at random. This algorithm lets you do that. |
| 6 | + |
| 7 | +Here's a very fast version: |
| 8 | + |
| 9 | +```swift |
| 10 | +func select<T>(from a: [T], count k: Int) -> [T] { |
| 11 | + var a = a |
| 12 | + for i in 0..<k { |
| 13 | + let r = random(min: i, max: a.count - 1) |
| 14 | + if i != r { |
| 15 | + swap(&a[i], &a[r]) |
| 16 | + } |
| 17 | + } |
| 18 | + return Array(a[0..<k]) |
| 19 | +} |
| 20 | +``` |
| 21 | + |
| 22 | +As often happens with these [kinds of algorithms](../Shuffle/), it divides the array into two regions. The first region is the selected items; the second region is all the remaining items. |
| 23 | + |
| 24 | +Here's an example. Let's say the array is: |
| 25 | + |
| 26 | + [ "a", "b", "c", "d", "e", "f", "g" ] |
| 27 | + |
| 28 | +We want to select 3 items, so `k = 3`. In the loop, `i` is initially 0, so it points at `"a"`. |
| 29 | + |
| 30 | + [ "a", "b", "c", "d", "e", "f", "g" ] |
| 31 | + i |
| 32 | + |
| 33 | +We calculate a random number between `i` and `a.count`, the size of the array. Let's say this is 4. Now we swap `"a"` with `"e"`, the element at index 4, and move `i` forward: |
| 34 | + |
| 35 | + [ "e" | "b", "c", "d", "a", "f", "g" ] |
| 36 | + i |
| 37 | + |
| 38 | +The `|` bar shows the split between the two regions. `"e"` is the first element we've selected. Everything to the right of the bar we still need to look at. |
| 39 | + |
| 40 | +Again, we ask for a random number between `i` and `a.count`, but because `i` has shifted, the random number can never be less than 1. So we'll never again swap `"e"` with anything. |
| 41 | + |
| 42 | +Let's say the random number is 6 and we swap `"b"` with `"g"`: |
| 43 | + |
| 44 | + [ "e" , "g" | "c", "d", "a", "f", "b" ] |
| 45 | + i |
| 46 | + |
| 47 | +One more random number to pick, let's say it is 4 again. We swap `"c"` with `"a"` to get the final selection on the left: |
| 48 | + |
| 49 | + [ "e", "g", "a" | "d", "c", "f", "b" ] |
| 50 | + |
| 51 | +And that's it. Easy peasy. The performance of this function is **O(k)** because as soon as we've selected *k* elements, we're done. |
| 52 | + |
| 53 | +However, there is one downside: this algorithm does not keep the elements in the original order. In the input array `"a"` came before `"e"` but now it's the other way around. If that is an issue for your app, you can't use this particular method. |
| 54 | + |
| 55 | +Here is an alternative approach that does keep the original order intact, but is a little more involved: |
| 56 | + |
| 57 | +```swift |
| 58 | +func select<T>(from a: [T], count requested: Int) -> [T] { |
| 59 | + var examined = 0 |
| 60 | + var selected = 0 |
| 61 | + var b = [T]() |
| 62 | + |
| 63 | + while selected < requested { // 1 |
| 64 | + examined += 1 |
| 65 | + |
| 66 | + let r = Double(arc4random()) / 0x100000000 // 2 |
| 67 | + |
| 68 | + let leftToExamine = a.count - examined + 1 // 3 |
| 69 | + let leftToAdd = requested - selected |
| 70 | + |
| 71 | + if Double(leftToExamine) * r < Double(leftToAdd) { // 4 |
| 72 | + selected += 1 |
| 73 | + b.append(a[examined - 1]) |
| 74 | + } |
| 75 | + } |
| 76 | + return b |
| 77 | +} |
| 78 | +``` |
| 79 | + |
| 80 | +This algorithm uses probability to decide whether to include a number in the selection or not. |
| 81 | + |
| 82 | +1. The loop steps through the array from beginning to end. It keeps going until we've selected *k* items from our set of *n*. Here, *k* is called `requested` and *n* is `a.count`. |
| 83 | + |
| 84 | +2. Calculate a random number between 0 and 1. We want `0.0 <= r < 1.0`. The higher bound is exclusive; we never want it to be exactly 1. That's why we divide the result from `arc4random()` by `0x100000000` instead of the more usual `0xffffffff`. |
| 85 | + |
| 86 | +3. `leftToExamine` is how many items we still haven't looked at. `leftToAdd` is how many items we still need to select before we're done. |
| 87 | + |
| 88 | +4. This is where the magic happens. Basically, we're flipping a coin. If it was heads, we add the current array element to the selection; if it was tails, we skip it. |
| 89 | + |
| 90 | +Interestingly enough, even though we use probability, this approach always guarantees that we end up with exactly *k* items in the output array. |
| 91 | + |
| 92 | +Let's walk through the same example again. The input array is: |
| 93 | + |
| 94 | + [ "a", "b", "c", "d", "e", "f", "g" ] |
| 95 | + |
| 96 | +The loop looks at each element in turn, so we start at `"a"`. We get a random number between 0 and 1, let's say it is 0.841. The formula at `// 4` multiplies the number of items left to examine with this random number. There are still 7 elements left to examine, so the result is: |
| 97 | + |
| 98 | + 7 * 0.841 = 5.887 |
| 99 | + |
| 100 | +We compare this to 3 because we wanted to select 3 items. Since 5.887 is greater than 3, we skip `"a"` and move on to `"b"`. |
| 101 | + |
| 102 | +Again, we get a random number, let's say 0.212. Now there are only 6 elements left to examine, so the formula gives: |
| 103 | + |
| 104 | + 6 * 0.212 = 1.272 |
| 105 | + |
| 106 | +This *is* less than 3 and we add `"b"` to the selection. This is the first item we've selected, so two left to go. |
| 107 | + |
| 108 | +On to the next element, `"c"`. The random number is 0.264, giving the result: |
| 109 | + |
| 110 | + 5 * 0.264 = 1.32 |
| 111 | + |
| 112 | +There are only 2 elements left to select, so this number must be less than 2. It is, and we also add `"c"` to the selection. The total selection is `[ "b", "c" ]`. |
| 113 | + |
| 114 | +Only one item left to select but there are still 4 candidates to look at. Suppose the next random number is 0.718. The formula now gives: |
| 115 | + |
| 116 | + 4 * 0.718 = 2.872 |
| 117 | + |
| 118 | +For this element to be selected the number has to be less than 1, since there is only 1 element left to be picked. It isn't, so we skip `"d"`. Only three possibilities left -- will we make it? |
| 119 | + |
| 120 | +The random number is 0.346. The formula gives: |
| 121 | + |
| 122 | + 3 * 0.346 = 1.038 |
| 123 | + |
| 124 | +Just a tiny bit too high. We skip `"e"`. Only two candidates left... |
| 125 | + |
| 126 | +Note that now literally we're dealing with a toin coss: if the random number is less than 0.5 we select `"f"` and we're done. If it's greater than 0.5, we go on to the final element. Let's say we get 0.583: |
| 127 | + |
| 128 | + 2 * 0.583 = 1.166 |
| 129 | + |
| 130 | +We skip `"f"` and look at the very last element. Whatever random number we get here, it should always select `"g"` or we won't have selected enough elements and the algorithm doesn't work! |
| 131 | + |
| 132 | +Let's say our final random number is 0.999 (remember, it can never be 1.0 or higher). Actually, no matter what we choose here, the formula will always give a value less than 1: |
| 133 | + |
| 134 | + 1 * 0.999 = 0.999 |
| 135 | + |
| 136 | +And so the last element will always be chosen if we didn't have a big enough selection yet. The final selection is `[ "b", "c", "g" ]`. Notice that the elements are still in their original order, because we examined the array from left to right. |
| 137 | + |
| 138 | +Maybe you're not convinced yet... What if we always got 0.999 as the random value, would that still select 3 items? Well, let's do the math: |
| 139 | + |
| 140 | + 7 * 0.999 = 6.993 is this less than 3? no |
| 141 | + 6 * 0.999 = 5.994 is this less than 3? no |
| 142 | + 5 * 0.999 = 4.995 is this less than 3? no |
| 143 | + 4 * 0.999 = 3.996 is this less than 3? no |
| 144 | + 3 * 0.999 = 2.997 is this less than 3? YES |
| 145 | + 2 * 0.999 = 1.998 is this less than 2? YES |
| 146 | + 1 * 0.999 = 0.999 is this less than 1? YES |
| 147 | + |
| 148 | +It always works! But does this mean that elements closer to the end of the array have a higher probability of being chosen than those in the beginning? Nope, all elements are equally likely to be selected. (Don't take my word for it: see the playground for a quick test that shows this in practice.) |
| 149 | + |
| 150 | +Here's an example of how to test this algorithm: |
| 151 | + |
| 152 | +```swift |
| 153 | +let input = [ |
| 154 | + "there", "once", "was", "a", "man", "from", "nantucket", |
| 155 | + "who", "kept", "all", "of", "his", "cash", "in", "a", "bucket", |
| 156 | + "his", "daughter", "named", "nan", |
| 157 | + "ran", "off", "with", "a", "man", |
| 158 | + "and", "as", "for", "the", "bucket", "nan", "took", "it", |
| 159 | +] |
| 160 | + |
| 161 | +let output = select(from: input, count: 10) |
| 162 | +print(output) |
| 163 | +print(output.count) |
| 164 | +``` |
| 165 | + |
| 166 | +The performance of this second algorithm is **O(n)** as it may require a pass through the entire input array. |
| 167 | + |
| 168 | +> **Note:** If `k > n/2`, then it's more efficient to do it the other way around and choose `k` items to remove. |
| 169 | +
|
| 170 | +Based on code from Algorithm Alley, Dr. Dobb's Magazine, October 1993. |
| 171 | + |
| 172 | +*Written for Swift Algorithm Club by Matthijs Hollemans* |
0 commit comments