Skip to content

Commit 5dc6a94

Browse files
committed
Add selection sampling algorithm
1 parent a9a6eb8 commit 5dc6a94

File tree

6 files changed

+321
-1
lines changed

6 files changed

+321
-1
lines changed

README.markdown

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ If you're new to algorithms and data structures, here are a few good ones to sta
5050
- [Count Occurrences](Count Occurrences/). Count how often a value appears in an array.
5151
- [Select Minimum / Maximum](Select Minimum Maximum). Find the minimum/maximum value in an array.
5252
- [k-th Largest Element](Kth Largest Element/). Find the kth largest element in an array.
53-
- Selection Sampling
53+
- [Selection Sampling](Selection Sampling/). Randomly choose a number of items from a collection.
5454
- Union-Find
5555

5656
### String Search

Selection Sampling/README.markdown

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
# Selection Sampling
2+
3+
Goal: Select *k* items at random from a collection of *n* items.
4+
5+
Let's say you have a deck of 52 playing cards and you need to draw 10 cards at random. This algorithm lets you do that.
6+
7+
Here's a very fast version:
8+
9+
```swift
10+
func select<T>(from a: [T], count k: Int) -> [T] {
11+
var a = a
12+
for i in 0..<k {
13+
let r = random(min: i, max: a.count - 1)
14+
if i != r {
15+
swap(&a[i], &a[r])
16+
}
17+
}
18+
return Array(a[0..<k])
19+
}
20+
```
21+
22+
As often happens with these [kinds of algorithms](../Shuffle/), it divides the array into two regions. The first region is the selected items; the second region is all the remaining items.
23+
24+
Here's an example. Let's say the array is:
25+
26+
[ "a", "b", "c", "d", "e", "f", "g" ]
27+
28+
We want to select 3 items, so `k = 3`. In the loop, `i` is initially 0, so it points at `"a"`.
29+
30+
[ "a", "b", "c", "d", "e", "f", "g" ]
31+
i
32+
33+
We calculate a random number between `i` and `a.count`, the size of the array. Let's say this is 4. Now we swap `"a"` with `"e"`, the element at index 4, and move `i` forward:
34+
35+
[ "e" | "b", "c", "d", "a", "f", "g" ]
36+
i
37+
38+
The `|` bar shows the split between the two regions. `"e"` is the first element we've selected. Everything to the right of the bar we still need to look at.
39+
40+
Again, we ask for a random number between `i` and `a.count`, but because `i` has shifted, the random number can never be less than 1. So we'll never again swap `"e"` with anything.
41+
42+
Let's say the random number is 6 and we swap `"b"` with `"g"`:
43+
44+
[ "e" , "g" | "c", "d", "a", "f", "b" ]
45+
i
46+
47+
One more random number to pick, let's say it is 4 again. We swap `"c"` with `"a"` to get the final selection on the left:
48+
49+
[ "e", "g", "a" | "d", "c", "f", "b" ]
50+
51+
And that's it. Easy peasy. The performance of this function is **O(k)** because as soon as we've selected *k* elements, we're done.
52+
53+
However, there is one downside: this algorithm does not keep the elements in the original order. In the input array `"a"` came before `"e"` but now it's the other way around. If that is an issue for your app, you can't use this particular method.
54+
55+
Here is an alternative approach that does keep the original order intact, but is a little more involved:
56+
57+
```swift
58+
func select<T>(from a: [T], count requested: Int) -> [T] {
59+
var examined = 0
60+
var selected = 0
61+
var b = [T]()
62+
63+
while selected < requested { // 1
64+
examined += 1
65+
66+
let r = Double(arc4random()) / 0x100000000 // 2
67+
68+
let leftToExamine = a.count - examined + 1 // 3
69+
let leftToAdd = requested - selected
70+
71+
if Double(leftToExamine) * r < Double(leftToAdd) { // 4
72+
selected += 1
73+
b.append(a[examined - 1])
74+
}
75+
}
76+
return b
77+
}
78+
```
79+
80+
This algorithm uses probability to decide whether to include a number in the selection or not.
81+
82+
1. The loop steps through the array from beginning to end. It keeps going until we've selected *k* items from our set of *n*. Here, *k* is called `requested` and *n* is `a.count`.
83+
84+
2. Calculate a random number between 0 and 1. We want `0.0 <= r < 1.0`. The higher bound is exclusive; we never want it to be exactly 1. That's why we divide the result from `arc4random()` by `0x100000000` instead of the more usual `0xffffffff`.
85+
86+
3. `leftToExamine` is how many items we still haven't looked at. `leftToAdd` is how many items we still need to select before we're done.
87+
88+
4. This is where the magic happens. Basically, we're flipping a coin. If it was heads, we add the current array element to the selection; if it was tails, we skip it.
89+
90+
Interestingly enough, even though we use probability, this approach always guarantees that we end up with exactly *k* items in the output array.
91+
92+
Let's walk through the same example again. The input array is:
93+
94+
[ "a", "b", "c", "d", "e", "f", "g" ]
95+
96+
The loop looks at each element in turn, so we start at `"a"`. We get a random number between 0 and 1, let's say it is 0.841. The formula at `// 4` multiplies the number of items left to examine with this random number. There are still 7 elements left to examine, so the result is:
97+
98+
7 * 0.841 = 5.887
99+
100+
We compare this to 3 because we wanted to select 3 items. Since 5.887 is greater than 3, we skip `"a"` and move on to `"b"`.
101+
102+
Again, we get a random number, let's say 0.212. Now there are only 6 elements left to examine, so the formula gives:
103+
104+
6 * 0.212 = 1.272
105+
106+
This *is* less than 3 and we add `"b"` to the selection. This is the first item we've selected, so two left to go.
107+
108+
On to the next element, `"c"`. The random number is 0.264, giving the result:
109+
110+
5 * 0.264 = 1.32
111+
112+
There are only 2 elements left to select, so this number must be less than 2. It is, and we also add `"c"` to the selection. The total selection is `[ "b", "c" ]`.
113+
114+
Only one item left to select but there are still 4 candidates to look at. Suppose the next random number is 0.718. The formula now gives:
115+
116+
4 * 0.718 = 2.872
117+
118+
For this element to be selected the number has to be less than 1, since there is only 1 element left to be picked. It isn't, so we skip `"d"`. Only three possibilities left -- will we make it?
119+
120+
The random number is 0.346. The formula gives:
121+
122+
3 * 0.346 = 1.038
123+
124+
Just a tiny bit too high. We skip `"e"`. Only two candidates left...
125+
126+
Note that now literally we're dealing with a toin coss: if the random number is less than 0.5 we select `"f"` and we're done. If it's greater than 0.5, we go on to the final element. Let's say we get 0.583:
127+
128+
2 * 0.583 = 1.166
129+
130+
We skip `"f"` and look at the very last element. Whatever random number we get here, it should always select `"g"` or we won't have selected enough elements and the algorithm doesn't work!
131+
132+
Let's say our final random number is 0.999 (remember, it can never be 1.0 or higher). Actually, no matter what we choose here, the formula will always give a value less than 1:
133+
134+
1 * 0.999 = 0.999
135+
136+
And so the last element will always be chosen if we didn't have a big enough selection yet. The final selection is `[ "b", "c", "g" ]`. Notice that the elements are still in their original order, because we examined the array from left to right.
137+
138+
Maybe you're not convinced yet... What if we always got 0.999 as the random value, would that still select 3 items? Well, let's do the math:
139+
140+
7 * 0.999 = 6.993 is this less than 3? no
141+
6 * 0.999 = 5.994 is this less than 3? no
142+
5 * 0.999 = 4.995 is this less than 3? no
143+
4 * 0.999 = 3.996 is this less than 3? no
144+
3 * 0.999 = 2.997 is this less than 3? YES
145+
2 * 0.999 = 1.998 is this less than 2? YES
146+
1 * 0.999 = 0.999 is this less than 1? YES
147+
148+
It always works! But does this mean that elements closer to the end of the array have a higher probability of being chosen than those in the beginning? Nope, all elements are equally likely to be selected. (Don't take my word for it: see the playground for a quick test that shows this in practice.)
149+
150+
Here's an example of how to test this algorithm:
151+
152+
```swift
153+
let input = [
154+
"there", "once", "was", "a", "man", "from", "nantucket",
155+
"who", "kept", "all", "of", "his", "cash", "in", "a", "bucket",
156+
"his", "daughter", "named", "nan",
157+
"ran", "off", "with", "a", "man",
158+
"and", "as", "for", "the", "bucket", "nan", "took", "it",
159+
]
160+
161+
let output = select(from: input, count: 10)
162+
print(output)
163+
print(output.count)
164+
```
165+
166+
The performance of this second algorithm is **O(n)** as it may require a pass through the entire input array.
167+
168+
> **Note:** If `k > n/2`, then it's more efficient to do it the other way around and choose `k` items to remove.
169+
170+
Based on code from Algorithm Alley, Dr. Dobb's Magazine, October 1993.
171+
172+
*Written for Swift Algorithm Club by Matthijs Hollemans*
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
//: Playground - noun: a place where people can play
2+
3+
import Foundation
4+
5+
/* Returns a random integer in the range min...max, inclusive. */
6+
public func random(min min: Int, max: Int) -> Int {
7+
assert(min < max)
8+
return min + Int(arc4random_uniform(UInt32(max - min + 1)))
9+
}
10+
11+
/*
12+
func select<T>(from a: [T], count k: Int) -> [T] {
13+
var a = a
14+
for i in 0..<k {
15+
let r = random(min: i, max: a.count - 1)
16+
if i != r {
17+
swap(&a[i], &a[r])
18+
}
19+
}
20+
return Array(a[0..<k])
21+
}
22+
*/
23+
24+
func select<T>(from a: [T], count requested: Int) -> [T] {
25+
var examined = 0
26+
var selected = 0
27+
var b = [T]()
28+
29+
while selected < requested {
30+
examined += 1
31+
32+
// Calculate random variable 0.0 <= r < 1.0 (exclusive!).
33+
let r = Double(arc4random()) / 0x100000000
34+
35+
let leftToExamine = a.count - examined + 1
36+
let leftToAdd = requested - selected
37+
38+
// Decide whether to use the next record from the input.
39+
if Double(leftToExamine) * r < Double(leftToAdd) {
40+
selected += 1
41+
b.append(a[examined - 1])
42+
}
43+
}
44+
return b
45+
}
46+
47+
48+
49+
let poem = [
50+
"there", "once", "was", "a", "man", "from", "nantucket",
51+
"who", "kept", "all", "of", "his", "cash", "in", "a", "bucket",
52+
"his", "daughter", "named", "nan",
53+
"ran", "off", "with", "a", "man",
54+
"and", "as", "for", "the", "bucket", "nan", "took", "it",
55+
]
56+
57+
let output = select(from: poem, count: 10)
58+
print(output)
59+
output.count
60+
61+
62+
63+
// Use this to verify that all input elements have the same probability
64+
// of being chosen. The "counts" dictionary should have a roughly equal
65+
// count for each input element.
66+
67+
/*
68+
let input = [ "a", "b", "c", "d", "e", "f", "g" ]
69+
var counts = [String: Int]()
70+
for x in input {
71+
counts[x] = 0
72+
}
73+
74+
for _ in 0...1000 {
75+
let output = select(from: input, count: 3)
76+
for x in output {
77+
counts[x] = counts[x]! + 1
78+
}
79+
}
80+
81+
print(counts)
82+
*/
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
2+
<playground version='5.0' target-platform='osx'>
3+
<timeline fileName='timeline.xctimeline'/>
4+
</playground>
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<Timeline
3+
version = "3.0">
4+
<TimelineItems>
5+
</TimelineItems>
6+
</Timeline>
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
//: Playground - noun: a place where people can play
2+
3+
import Foundation
4+
5+
/* Returns a random integer in the range min...max, inclusive. */
6+
public func random(min min: Int, max: Int) -> Int {
7+
assert(min < max)
8+
return min + Int(arc4random_uniform(UInt32(max - min + 1)))
9+
}
10+
11+
/*
12+
Selects k items at random from an array of size n. Does not keep the elements
13+
in the original order. Performance: O(k).
14+
*/
15+
func select<T>(from a: [T], count k: Int) -> [T] {
16+
var a = a
17+
for i in 0..<k {
18+
let r = random(min: i, max: a.count - 1)
19+
if i != r {
20+
swap(&a[i], &a[r])
21+
}
22+
}
23+
return Array(a[0..<k])
24+
}
25+
26+
/*
27+
Selects `count` items at random from an array. Respects the original order of
28+
the elements. Performance: O(n).
29+
30+
Note: if `count > size/2`, then it's more efficient to do it the other way
31+
around and choose `count` items to remove.
32+
33+
Based on code from Algorithm Alley, Dr. Dobb's Magazine, October 1993.
34+
*/
35+
func select<T>(from a: [T], count requested: Int) -> [T] {
36+
var examined = 0
37+
var selected = 0
38+
var b = [T]()
39+
40+
while selected < requested {
41+
examined += 1
42+
43+
// Calculate random variable 0.0 <= r < 1.0 (exclusive!).
44+
let r = Double(arc4random()) / 0x100000000
45+
46+
let leftToExamine = a.count - examined + 1
47+
let leftToAdd = requested - selected
48+
49+
// Decide whether to use the next record from the input.
50+
if Double(leftToExamine) * r < Double(leftToAdd) {
51+
selected += 1
52+
b.append(a[examined - 1])
53+
}
54+
}
55+
return b
56+
}

0 commit comments

Comments
 (0)