|
| 1 | +# Minimum Edit Distance |
| 2 | + |
| 3 | +The minimum edit distance is a possibility to measure the similarity of two strings *w* and *u* by counting costs of operations which are necessary to transform *w* into *u* (or vice versa). |
| 4 | + |
| 5 | +### Algorithm using Levenshtein distance |
| 6 | + |
| 7 | +A common distance measure is given by the *Levenshtein distance*, which allows the following three transformation operations: |
| 8 | + |
| 9 | +* **Inseration** (*ε→x*) of a single symbol *x* with **cost 1**, |
| 10 | +* **Deletion** (*x→ε*) of a single symbol *x* with **cost 1**, and |
| 11 | +* **Substitution** (*x→y*) of two single symbols *x, y* with **cost 1** if *x≠y* and with **cost 0** otherwise. |
| 12 | + |
| 13 | +When transforming a string by a sequence of operations, the costs of the single operations are added to obtain the (minimal) edit distance. For example, the string *Door* can be transformed by the operations *o→l*, *r→l*, *ε→s* to the string *Dolls*, which results in a minimum edit distance of 3. |
| 14 | + |
| 15 | +To avoid exponential time complexity, the minimum edit distance of two strings in the usual is computed using *dynamic programming*. For this in a matrix |
| 16 | + |
| 17 | +```swift |
| 18 | +var matrix = [[Int]](count: m+1, repeatedValue: [Int](count: n+1, repeatedValue: 0)) |
| 19 | +``` |
| 20 | + |
| 21 | +already computed minimal edit distances of prefixes of *w* and *u* (of length *m* and *n*, respectively) are used to fill the matrix. In a first step the matrix is initialized by filling the first row and the first column as follows: |
| 22 | + |
| 23 | +```swift |
| 24 | +// initialize matrix |
| 25 | +for index in 1...m { |
| 26 | + // the distance of any prefix of the first string to an empty second string |
| 27 | + matrix[index][0]=index |
| 28 | +} |
| 29 | +for index in 1...n { |
| 30 | + // the distance of any prefix of the second string to an empty first string |
| 31 | + matrix[0][index]=index |
| 32 | +} |
| 33 | +``` |
| 34 | +Then in each cell the minimum of the cost of insertion, deletion, or substitution added to the already computed costs in the corresponding cells is chosen. In this way the matrix is filled iteratively: |
| 35 | + |
| 36 | +```swift |
| 37 | +// compute Levenshtein distance |
| 38 | +for (i, selfChar) in self.characters.enumerate() { |
| 39 | + for (j, otherChar) in other.characters.enumerate() { |
| 40 | + if otherChar == selfChar { |
| 41 | + // substitution of equal symbols with cost 0 |
| 42 | + matrix[i+1][j+1] = matrix[i][j] |
| 43 | + } else { |
| 44 | + // minimum of the cost of insertion, deletion, or substitution added |
| 45 | + // to the already computed costs in the corresponing cells |
| 46 | + matrix[i+1][j+1] = min(matrix[i][j]+1, matrix[i+1][j]+1, matrix[i][j+1]+1) |
| 47 | + } |
| 48 | + |
| 49 | + } |
| 50 | +} |
| 51 | +``` |
| 52 | + |
| 53 | +After applying this algorithm, the minimal edit distance can be read from the rightmost bottom cell and is returned. |
| 54 | + |
| 55 | +```swift |
| 56 | +return matrix[m][n] |
| 57 | +``` |
| 58 | + |
| 59 | +This algorithm has a time complexity of Θ(*mn*). |
| 60 | + |
| 61 | +### Other distance measures |
| 62 | + |
| 63 | +**todo** |
| 64 | + |
| 65 | +*Written for Swift Algorithm Club by Luisa Herrmann* |
0 commit comments