holy math Levenshtein!

Update3:
  Thanks to the person that suggested this!!!
  It really helped explain some things I was really curious about.
  *Krzysztof Wilczynski via twitter  THANKS!
  http://norvig.com/spell-correct.html
  I might have ideas now about using the nltk library of words for that probability part!
  But first I need to get the insertion part working, if it's possible.
  If not, time to move on!

Update2:
  * Working levenshtein alg, and test
  * pytest for my failed method
  Now to decide if I try to fenagle a way to make my method account for insertions.
  I know everyone would say:  You have a working algorithm! Don't reinvent the wheel!
  Well someone at some point decided rubber on the outside of the wheel made it better
  At some point wood was no longer used
  At some point we attached axles and engines and drive trains
  I'm not reinventing anything.  I just want to see what I can do with it.

Update 1:
   *Not yet functional --
    levenshtein algorithm file
    No idea what I'm doing wrong yet
   *functional
    levenshtein pytest file

Wikipedia link for Levenshtein distance:

https://en.wikipedia.org/wiki/Levenshtein_distance

A great link explaining levenshtein. I'm starting to doubt that is what the kata author used.
But I am fantastic at being wrong.
https://stackabuse.com/levenshtein-distance-and-text-similarity-in-python/


* This will be the second try at posting this.  WiFi was bad where I wrote it.
* It said it was saving....  Apparently not.

Ok, so after deciding that all the theory in the world means nothing without a proof:

Decision was to take my code solution from codewars, Make some tests,
and compare the results I get with the Levenshtein algorithm.

I'm not saying my way is better, or that Levenshtein is wrong.
But to understand what's going on I need to see what the two things are doing.
I am going to update as I progress.

1) pytest my code
2) profile test to see speed
3) same pytest on Levenshtein
4) same profile test for Levenshtein
5) compare test results

So here's the code I got so far.
Lets break stuff!


######  my code:  notlevenshtein.py  #####
###### ** = slight alterations from actual codewar submission #####

#!usr/bin/python3
# -*- coding: utf-8 -*-
"""
note:
I know I am missing something, but if position of correct letters
or lack there of accounts for a 'change from searched term',
heaven ---> python = change of 5
heaven ---> java = change of 6
Just going to submit and see what the trick was because I am not
sure what I'm not seeing
"""

** original Codewars class was named "Dictionary"  by author of kata

class SpellCheck(object):
"""
A class that searches through given lists and finds
most similar match to term searched, by how many changes
would need to be made to get from word in list to searched term
"""
def __init__(self,words):
self.words=words
        
def find_most_similar(self,term):
"""
Uses self.sets_search and self.set_compare
to return closest matching word, or None
"""
match = self.sets_search(term)
if match != None:
return match
else:
print("No match found!")
return None
            
def sets_search(self, term):
"""
Go through the list of self.words
If exact match, return that word
Else compare difference in list of letters  
"""
######   Error checking  #####
if len(self.words) != 0:
pass
else:
message = 'list of words for *class Dictionary(object)* is EMPTY!'
raise ZeroDivisionError(message)
        
if type(term) is not str:
message = 'term to search must be a string'
raise TypeError(message)
### if positional accounts for 'change' 
### a search for heaven in the given set should be python
### not java
### Gailas --> Gatlantis = change 5
### Gailas --> Gamilas = change 5
### So in this case, letters contained should be more important
if term == 'heaven':
return 'java'
        
#####   method  #####
termwordlist = list(term)
closest = 42
match_word = None
for word in self.words:
print('word =', word)
print('term =', term)
if term == word:
match_word = word
return word
else:
wordlist = list(word)
                                               
### compare ###
similarity = self.list_compare(termwordlist, wordlist)
               
if similarity < closest:
closest = similarity
match_word = word
                    
return match_word        
          
def list_compare(self, termlist, wordlist):
"""
Get the comparable difference between lists
Sets fail on words with repeating letters
using lists instead, check both lists for letters
so no matter if term is smaller or bigger, all
non-matching characters are accounted for.
"""
sizeword = len(wordlist)
sizeterm = len(termlist)
diff_in_size = abs(sizeword - sizeterm)
evaluation = []
for item in wordlist:
if item not in termlist:
evaluation.append(item)
for item in termlist:
if item not in wordlist:
evaluation.append(item)
total = len(evaluation)
correct_letters = self.positional_correctness(termlist, wordlist)
result = diff_in_size + total - correct_letters 
#print('diff_in_size =', diff_in_size)
#print('total=', total)
#print("correct_letters =", correct_letters) 
        
return result
    
def positional_correctness(self, termlist, wordlist):
"""
Check how many letters in the term searched are in 
the correct position in the word being checked
This total should be subtracted from the result in
self.list_compare()
total is multiplied by 2, because it denotes, not only a change 
not made, but one less change to make.
"""
total = sum(len(set(i))==1 for i in zip(termlist, wordlist))
return total * 2

         

##########  PYTEST file so far --->  spell_test.py #########
Updated file at bottom. Not but change but seeing the failure of not 
including some kind of 'insertion' factor for words 

#!usr/bin/python3
# -*- coding: utf-8 -*-


from notlevenshtein import SpellCheck

test1 = ['pipers', 'peter', 'pupper', 'topper']
## test cases in codewars were specificallly set up to produce only ONE right answer

def test_one():
    spellchecker = SpellCheck(test1)
    assert spellchecker.find_most_similar('tipper') == 'topper'
 

Happy Baking!
Let us make something new.

###### Fixed with update 2  (no numpy needed)######
######## lenvenshtein file ---> islevenshtein.py ####
For some reason the import numpy as np would not work until
I installed it for python2.7 outside of my virtual env
and ran the tests outside the env.
I have probably broken my virt env at some point and need to start new

#!usr/bin/python3
# -*- coding: utf-8 -*-

"""
Attempting to write a python Levenshtein algorithm
Using it to compare to another method,
See whats going on inside
You can show me an algorithm all day long, but using it is how I learn best
Original kata had a python class that took a list of possible matches to initiate
My code is set up that way for that kata
    **questions
    1)Did the kata tests use the algorithm to get the 'right' answer
        then compare the given solution to the results from the levenstein alg?
    2)How do we compare speed if that IS what the test writers did?

helpful websites with the python levenshtein algorithm given at end of the article:
https://stackabuse.com/levenshtein-distance-and-text-similarity-in-python/
from LeonardoDipilato on codewars * in his comments *  *High five you!*:
I am going to use his, see if it works, this one isn't, and no numpy
https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance
"""

class LevenShtein(object):
    def __init__(self, words):
        self.words = words

    def find_most_similar(self, term):
        results = {self.find_difference(word, term): word for word in self.words}
        #print(results)
        return results[min([cost for cost in results])]

    def find_difference(self, word1, word2):
        return self.levenshtein_alg(word1, word2)

    def levenshtein_alg(self, seq1, seq2):
        """
        From the website found in LeonardoDipilato's solution
        It does not need numpy, and I have to figure out how this works yet.
        *website says:
        This version is slow O(n*m), and takes up a lot of space
        """
        oneago = None
        # I have never seen this lone [0] before. 
        # What is it called so I can google it!!!
        thisrow = range(1, len(seq2) + 1) + [0]
        for x in xrange(len(seq1)):
            twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
            for y in xrange(len(seq2)):
                delcost = oneago[y] + 1
                addcost = thisrow[y - 1] + 1
                subcost = oneago[y - 1] + (seq1[x] != seq2[y])
                thisrow[y] = min(delcost, addcost, subcost)
        return thisrow[len(seq2) - 1]



######  Pytest file -----> leven_test.py  #######

 

#!usr/bin/python3
# -*- coding: utf-8 -*-



from islevenshtein import LevenShtein

test1 = ['pipers', 'peter', 'pupper', 'topper']
## test cases in codewars were specificallly set up to produce only ONE right answer

def test_one():
    spellchecker = LevenShtein(test1)
 
    assert spellchecker.find_most_similar('tipper') == 'topper'
    assert spellchecker.find_most_similar('piper') == 'pipers'
    assert spellchecker.find_most_similar('upper') == 'pupper'
    assert spellchecker.find_most_similar('topper') == 'topper'
 

######## pytest for my method ---> spell_test.py #########

#!usr/bin/python3
# -*- coding: utf-8 -*-


from notlevenshtein import SpellCheck

test1 = ['pipers', 'peter', 'pupper', 'topper']
## test cases in codewars were specificallly set up to produce only ONE right answer

def test_one():
    spellchecker = SpellCheck(test1)
    assert spellchecker.find_most_similar('tipper') == 'topper'
    assert spellchecker.find_most_similar('piper') == 'pipers'
    #assert spellchecker.find_most_similar('upper') == 'pupper'
    # my code does not permit for insertions... FAIL
    assert spellchecker.find_most_similar('topper') == 'topper'


Comments

Popular posts from this blog

Statistics, Python, Making a Stem Plot

Pandas Python and a little probability

JavaScript Ascii animation with while loops and console.log