performance - Efficient implementation of Damerau-Levenshtein distance -
i'm trying implement efficient clojure function compute damerau-levenshtein distance. i've decided use this algorithm (attached source should c++) computing levenshtein distance , add lines make work dld.
here i've created in common lisp (i hope help):
(defun damerau-levenshtein (x y) (declare (type string x y) #.*std-opts*) (let* ((x-len (length x)) (y-len (length y)) (v0 (apply #'vector (mapa-b #'identity 0 y-len))) (v1 (make-array (1+ y-len) :element-type 'integer)) (v* (make-array (1+ y-len) :element-type 'integer))) (do ((i 0 (1+ i))) ((= x-len) (aref v0 y-len)) (setf (aref v1 0) (1+ i)) (do ((j 0 (1+ j))) ((= j y-len)) (let* ((x-i (char x i)) (y-j (char y j)) (cost (if (char-equal x-i y-j) 0 1))) (setf (aref v1 (1+ j)) (min (1+ (aref v1 j)) (1+ (aref v0 (1+ j))) (+ (aref v0 j) cost))) (when (and (plusp i) (plusp j)) (let ((x-i-1 (char x (1- i))) (y-j-1 (char y (1- j))) (val (+ (aref v* (1- j)) cost))) (when (and (char-equal x-i y-j-1) (char-equal x-i-1 y-j) (< val (aref v1 (1+ j)))) (setf (aref v1 (1+ j)) val)))))) (rotatef v* v0 v1))))
now, fear cannot translate efficient , idiomatic clojure code (in functional style?). appreciate suggestion , think may quite useful many future readers too.
p.s. i've found this implementation, doubt if efficient , uses obsolete contrib
functions (deep-merge-with
, bool-to-binary
):
(defn damerau-levenshtein-distance [a b] (let [m (count a) n (count b) init (apply deep-merge-with (fn [a b] b) (concat ;;deletion (for [i (range 0 (+ 1 m))] {i {0 i}}) ;;insertion (for [j (range 0 (+ 1 n))] {0 {j j}}))) table (reduce (fn [d [i j]] (deep-merge-with (fn [a b] b) d (let [cost (bool-to-binary (not (= (nth (- 1)) (nth b (- j 1))))) x (min (+ ((d (- 1)) j) 1) ;;deletion (+ ((d i) (- j 1)) 1) ;;insertion (+ ((d (- 1)) (- j 1)) cost)) ;;substitution)) val (if (and (> 1) (> j 1) (= (nth (- 1)) (nth b (- j 2))) (= (nth (- 2)) (nth b (- j 1)))) (min x (+ ((d (- 2)) (- j 2)) ;;transposition cost)) x)] {i {j val}}))) init (for [j (range 1 (+ 1 n)) (range 1 (+ 1 m))] [i j]))] ((table m) n)))
i had write efficient levenshtein distance function in clojure calculate edits between ground truth text , ocr engine result. recursive implementation wasn't performant enough calculate levenshtein distance between 2 whole pages, implementation uses dynamic programming. instead of dropping down java 2d-arrays uses core.matrix handle matrix stuff. adding transposition stuff damerau-levenshtein should not hard.
(defn lev [str1 str2] (let [mat (new-matrix :ndarray (inc (count str1)) (inc (count str2))) len1 (count str1) len2 (count str2)] (mset! mat 0 0 0) (dotimes [i lein1] (mset! mat (inc i) 0 (inc i))) (dotimes [j len2] (mset! mat 0 (inc j) (inc j))) (dotimes [dj len2] (dotimes [di len1] (let [j (inc dj) (inc di)] (mset! mat j (cond (= (.charat ^string str1 di) (.charat ^string str2 dj)) (mget mat di dj) :else (min (inc (mget mat di j)) (inc (mget mat dj)) (inc (mget mat di dj)))))))) (mget mat len1 len2))))
hope helps
Comments
Post a Comment