Skip to content

Commit 58392bc

Browse files
committed
Add run-length encoding
1 parent 2b9689d commit 58392bc

File tree

14 files changed

+1690
-1
lines changed

14 files changed

+1690
-1
lines changed

README.markdown

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ Bad sorting algorithms (don't use these!):
7979

8080
### Compression
8181

82-
- Run-Length Encoding (RLE)
82+
- [Run-Length Encoding (RLE)](Run-Length Encoding)
8383
- Huffman Encoding
8484

8585
### Miscellaneous

Run-Length Encoding/README.markdown

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Run-Length Encoding (RLE)
2+
3+
RLE is probably the simplest way to do compression. Let's say you have data that looks like this:
4+
5+
aaaaabbbcdeeeeeeef...
6+
7+
then RLE encodes it as follows:
8+
9+
5a3b1c1d7e1f...
10+
11+
Instead of repeating bytes, you first write how often that byte occurs and then the byte's actual value. If the data has a lot of "byte runs", that is lots of repeating bytes, then RLE can save quite a bit of space. It works quite well on images.
12+
13+
There are many different ways you can implement RLE. Here's an extension of `NSData` that does a version of RLE inspired by the old [PCX image file format](https://en.wikipedia.org/wiki/PCX).
14+
15+
The rules are these:
16+
17+
- Each byte run, i.e. when a certain byte value occurs more than once in a row, is compressed using two bytes: the first byte records the number of repetitions, the second records the actual value. The first byte is stored as: `191 + count`. This means encoded byte runs can never be more than 64 bytes long.
18+
19+
- A single byte in the range 0 - 191 is not compressed and is copied without change.
20+
21+
- A single byte in the range 192 - 255 is represented by two bytes: first the byte 192 (meaning a run of 1 byte), followed by the actual value.
22+
23+
Here is the compression code. It returns a new `NSData` object containing the run-length encoded bytes:
24+
25+
```swift
26+
extension NSData {
27+
public func compressRLE() -> NSData {
28+
let data = NSMutableData()
29+
if length > 0 {
30+
var ptr = UnsafePointer<UInt8>(bytes)
31+
let end = ptr + length
32+
33+
while ptr < end { // 1
34+
var count = 0
35+
var byte = ptr.memory
36+
var next = byte
37+
38+
while next == byte && ptr < end && count < 64 { // 2
39+
ptr = ptr.advancedBy(1)
40+
next = ptr.memory
41+
count += 1
42+
}
43+
44+
if count > 1 || byte >= 192 { // 3
45+
var size = 191 + UInt8(count)
46+
data.appendBytes(&size, length: 1)
47+
data.appendBytes(&byte, length: 1)
48+
} else { // 4
49+
data.appendBytes(&byte, length: 1)
50+
}
51+
}
52+
}
53+
return data
54+
}
55+
}
56+
```
57+
58+
How it works:
59+
60+
1. We use an `UnsafePointer` to step through the bytes of the original `NSData` object.
61+
62+
2. At this point we've read the current byte value into the `byte` variable. If the next byte is the same, then we keep reading until we find a byte value that is different, or we reach the end of the data. We also stop if the run is 64 bytes because that's the maximum we can encode.
63+
64+
3. Here, we have to decide how to encode the bytes we just read. The first possibility is that we've read a run of 2 or more bytes (up to 64). In that case we write out two bytes: the length of the run followed by the byte value. But it's also possible we've read a single byte with a value >= 192. That will also be encoded with two bytes.
65+
66+
4. The third possibility is that we've read a single byte < 192. That simply gets copied to the output verbatim.
67+
68+
You can test it like this in a playground:
69+
70+
```swift
71+
let originalString = "aaaaabbbcdeeeeeeef"
72+
let utf8 = originalString.dataUsingEncoding(NSUTF8StringEncoding)!
73+
let compressed = utf8.compressRLE()
74+
```
75+
76+
The compressed `NSData` object should be `<c461c262 6364c665 66>`. Let's decode that by hand to see what has happened:
77+
78+
c4 This is 196 in decimal. It means the next byte appears 5 times.
79+
61 The data byte "a".
80+
c2 The next byte appears 3 times.
81+
62 The data byte "b".
82+
63 The data byte "c". Because this is < 192, it's a single data byte.
83+
64 The data byte "d". Also appears just once.
84+
c6 The next byte will appear 7 times.
85+
65 The data byte "e".
86+
66 The data byte "f". Appears just once.
87+
88+
So that's 9 bytes encoded versus 18 original. That's a savings of 50%. Of course, this was only a simple test case... If you get unlucky and there are no byte runs at all in your original data, then this method will actually make the encoded data twice as large! So it really depends on the input data.
89+
90+
Here is the decompression code:
91+
92+
```swift
93+
public func decompressRLE() -> NSData {
94+
let data = NSMutableData()
95+
if length > 0 {
96+
var ptr = UnsafePointer<UInt8>(bytes)
97+
let end = ptr + length
98+
99+
while ptr < end {
100+
var byte = ptr.memory // 1
101+
ptr = ptr.advancedBy(1)
102+
103+
if byte < 192 { // 2
104+
data.appendBytes(&byte, length: 1)
105+
106+
} else if ptr < end { // 3
107+
var value = ptr.memory
108+
ptr = ptr.advancedBy(1)
109+
110+
for _ in 0 ..< byte - 191 {
111+
data.appendBytes(&value, length: 1)
112+
}
113+
}
114+
}
115+
}
116+
return data
117+
}
118+
```
119+
120+
1. Again this uses an `UnsafePointer` to read the `NSData`. Here we read the next byte; this is either a single value less than 192, or the start of a byte run.
121+
122+
2. If it's a single value, then it's just a matter of copying it to the output.
123+
124+
3. But if the byte is the start of a run, we have to first read the actual data value and then write it out repeatedly.
125+
126+
To turn the compressed data back into the original, you'd do:
127+
128+
```swift
129+
let decompressed = compressed.decompressRLE()
130+
let restoredString = String(data: decompressed, encoding: NSUTF8StringEncoding)
131+
```
132+
133+
And now `originalString == restoredString` must be true!
134+
135+
Footnote: The original PCX implementation is slightly different. There, a byte value of 192 (0xC0) means that the following byte will be repeated 0 times. This also limits the maximum run size to 63 bytes. Because it makes no sense to store bytes that don't occur, in my implementation 192 means the next byte appears once, and the maximum run length is 64 bytes.
136+
137+
This was probably a trade-off when they designed the PCX format way back when. If you look at it in binary, the upper two bits indicate whether a byte is compressed. (If both bits are set then the byte value is 192 or more.) To get the run length you can simply do `byte & 0x3F`, giving you a value in the range 0 to 63.
138+
139+
*Written for Swift Algorithm Club by Matthijs Hollemans*

0 commit comments

Comments
 (0)