FuzzyFinder - in 10 lines of Python

Introduction:

FuzzyFinder is a popular feature available in decent editors to open files. The idea is to start typing partial strings from the full path and the list of suggestions will be narrowed down to match the desired file. 

Examples: 

Vim (Ctrl-P)

Sublime Text (Cmd-P)

This is an extremely useful feature and it's quite easy to implement.

Problem Statement:

We have a collection of strings (filenames). We're trying to filter down that collection based on user input. The user input can be partial strings from the filename. Let's walk this through with an example. Here is a collection of filenames:

When the user types 'djm' we are supposed to match 'django_migrations.py' and 'django_admin_log.py'. The simplest route to achieve this is to use regular expressions. 

Solutions:

Naive Regex Matching:

Convert 'djm' into 'd.*j.*m' and try to match this regex against every item in the list. Items that match are the possible candidates.

This got us the desired results for input 'djm'. But the suggestions are not ranked in any particular order.

In fact, for the second example with user input 'mig' the best possible suggestion 'migrations.py' was listed as the last item in the result.

Ranking based on match position:

We can rank the results based on the position of the first occurrence of the matching character. For user input 'mig' the position of the matching characters are as follows:

Here's the code:

We made the list of suggestions to be tuples where the first item is the position of the match and second item is the matching filename. When this list is sorted, python will sort them based on the first item in tuple and use the second item as a tie breaker. On line 14 we use a list comprehension to iterate over the sorted list of tuples and extract just the second item which is the file name we're interested in.

This got us close to the end result, but as shown in the example, it's not perfect. We see 'main_generator.py' as the first suggestion, but the user wanted 'migration.py'.

Ranking based on compact match:

When a user starts typing a partial string they will continue to type consecutive letters in a effort to find the exact match. When someone types 'mig' they are looking for 'migrations.py' or 'django_migrations.py' not 'main_generator.py'. The key here is to find the most compact match for the user input.

Once again this is trivial to do in python. When we match a string against a regular expression, the matched string is stored in the match.group(). 

For example, if the input is 'mig', the matching group from the 'collection' defined earlier is as follows:

We can use the length of the captured group as our primary rank and use the starting position as our secondary rank. To do that we add the len(match.group()) as the first item in the tuple, match.start() as the second item in the tuple and the filename itself as the third item in the tuple. Python will sort this list based on first item in the tuple (primary rank), second item as tie-breaker (secondary rank) and the third item as the fall back tie-breaker. 

This produces the desired behavior for our input. We're not quite done yet.

Non-Greedy Matching

There is one more subtle corner case that was caught by Daniel Rocco. Consider these two items in the collection ['api_user', 'user_group']. When you enter the word 'user' the ideal suggestion should be ['user_group', 'api_user']. But the actual result is:

Looking at this output, you'll notice that api_user appears before user_group. Digging in a little, it turns out the search user expands to u.*s.*e.*r; notice that user_group has two rs, so the pattern matches user_gr instead of the expected user. The longer match length forces the ranking of this match down, which again seems counterintuitive. This is easy to change by using the non-greedy version of the regex (.*? instead of .*) on line 4. 

Now that works for all the cases we've outlines. We've just implemented a fuzzy finder in 10 lines of code.

Conclusion:

That was the design process for implementing fuzzy matching for my side project pgcli, which is a repl for Postgresql that can do auto-completion. 

I've extracted fuzzyfinder into a stand-alone python package. You can install it via 'pip install fuzzyfinder' and use it in your projects.

Thanks to Micah Zoltu and Daniel Rocco for reviewing the algorithm and fixing the corner cases.

If you found this interesting, you should follow me on twitter

Epilogue:

When I first started looking into fuzzy matching in python, I encountered this excellent library called fuzzywuzzy. But the fuzzy matching done by that library is a different kind. It uses levenshtein distance to find the closest matching string from a collection. Which is a great technique for auto-correction against spelling errors but it doesn't produce the desired results for matching long names from partial sub-strings.

Comments

Slight issue: the "solution" is overly slow. Using regexes is completely unnecessary. Try https://gist.github.com/kirbyfan64/5fad06f7a70e....
Nice article! Just a trivial question, is there some reason you use " '%s' % pattern " instead of just " pattern " when creating the regex?
@James: I should be using the pattern directly instead of '%s' % pattern. I had originally had that as '(%s)' % pattern. But later realized that I don't need the parens and took it out. But forgot to change the string interpolation.
@Ryan That's impressive how you got it in a single line. But it doesn't do what I'm doing though. It is doing straight up sub-string find. It doesn't do fuzzy matching. For eg, try this input: fuzzy_finder('djm', collection) Your solution comes back with an empty list. But with fuzzy matching you're supposed to return 'django_migration.py' and 'django_admin_log.py'. That's why I had to use regex. :)
Ah, that's what I figured, some modification in the process of working on the code. Just goes to show you why having multiple eyes on a piece of code is always a good thing! Anyways, interesting article, I like how simple your solution is!
@A posthaven user Ugh...I feel smart... :O But you still don't need regexes! I updated it again. You can just use map(s.find, p). This wouldn't be so bad had Python's regex engine not been a memory-hungry, slow beast. :)
@Ryan Could you deconstruct the snippet into more than one line? I get lost trying to track the flow.
Great! Thanks for sharing this. I'd love to have it in PyCharm, so I created this issue: https://youtrack.jetbrains.com/issue/PY-16279
@ Michał Thank you! Did you check PyCharm? I believe it might already have fuzzy matching for file open.
File open, yes. It works like a charm :-) but that's only one place I think... all others use standard matching, which sometimes is not what I want.
You should map re.escape() over the user input characters before joining them, to properly match special characters such as periods. That, or filter non-alphanumeric characters out of the input.
@Tal I do escape them in the library. I didn't want to complicate the examples in the blog post too much, so I left it out. But here's the code from the library: https://github.com/amjith/fuzzyfinder/blob/mast...
@A posthaven user The one liner solution is basically this with slight modifications: https://gist.github.com/anonymous/f61b0461ad76f...
For the same purpose you can use python standard library — difflib. It would correct mistakes and return matching ratio between two words. And it should work faster than regexps.
Great job! I'll need to give this a shot!
Awesome post! You have inspired me to try writing up my own fuzzy search algorithm. I'm curious about trying to construct an index for your pool of strings to search for in python.
@Christopher I'm glad you were inspired. What do you mean by building an index? An in memory datastructure that preprocesses the list into a Trie or something similar? I'd be very interested in what you come up with. :)
In this https://gist.github.com/anonymous/f61b0461ad76f... solution 'nohtyp' matches 'python', which seems to be wrong behavior. Each "key" array must be checked so that the elements in it were in ascending order.
Nice but far from ideal when it comes to ordering of the results... When I search for example for text "abc" in ["a?b?c?", "a??b??c??abc"] then I actually would prefer to have the second item before the first one because it actually contains the whole text. But regexp will find only the first match of "abc" in "a??b??c" and ignore the rest of the string. And because "a??b??c" is longer than "a?b?c", it will order it wrong. I think the algorithm should at least aim to find the shortest match in each string if there are more than one. Please correct me if I am wrong, regexp is not among my strengths. :) Update: I tried to improve that with sifting through all matches using re.finditer() and get the shortest one. But this will not help unfortunately, because it searches from the left and finds only non-overlapping matches. So finding "abc" in "a?b?abc" will find only one match corresponding to the full string instead of the last three letters. I think there has to be some completely different solution. I think maybe the solution should be based on the accepted answer here: https://stackoverflow.com/questions/2148700/how... But anyway, after playing a bit with this, I am more and more sceptical. Positive matching "abc" with "blahblahablahblahblahcblahblah" with just make the user ask: "WTF!? I was not searching for this!"
@Vladimir The results are not simply based on the regex matches alone. It does take into consideration the shortest match. The blog post actually goes into a variety of steps to improve the simple regex match in many ways.
@Ryan, mapping with str.find doesn't find the characters in order. If I search with "fir", it can return "rif", despite it not matching.
You forgot/didn't bother with case insensitivity, OP
@Jacob You're right I didn't add the case sensitivity to keep the blog post succinct. But I'm sure you've figured out by now that it's trivial to add support for case sensitivity. :)
@OP There is a sort of failure in here. Even my "correct" str.find() algorithm doesn't do this case properly; it doesn't find the "best" match. For example, if the search term is "fi" and you're checking "affirmative", the match becomes "ffi" instead of the better "fi". My regex-fu is lacking a little. Do you know how to fix this? My best guess is to use a regex search that finds *every* match on the item being checked, then pick the best match, which would be the first of the shortest ones. What do you think?
Sorry to be kind of spammy, but I figured it out. Here's my code: def fuzzyfinder2(user_input, collection): suggestions=[] pattern = '(?=(' + '.*?'.join(user_input.lower()) + '))' regex = re.compile(pattern) for item in collection: match = best_match(regex.findall(item.lower())) if match: suggestions.append((len(match), item.lower().index(match), item)) return [(item, length, start) for length, start, item in sorted(suggestions)] def best_match(matches): if matches: return sorted(matches, key=len)[0] else: return None This solves the problem of ignoring case as well by calling .lower() on the input when creating the pattern and on the item of the collection in the regex call. To get the best match, you have to use findall() instead of search(), but that doesn't normally do overlapping matches, so you have to put in a lookahead on the pattern, which you can see above. Then, since that just returns a list of strings (with this pattern), you sort it by length and choose the first one. Also, again, since the matches are just a list of strings instead of a list of matches, you have to get your sorting data a different way, which you see with len(match) instead of len(match.group()) and item.lower().index(match) instead of match.start(). This unfortunately doesn't fit in 10 lines, BUT you could replace the line with the call to best_match with matches = regex.findall(item.lower()) match = matches if not matches else sorted(matches)[0] which only bumps the line count up by one. Now, if you don't count the import, it's back to 10 lines. You may be able to find a different way to compress it, but I'm not really interested in doing that.
Sorry, I forgot to remove a few small changes to the code, too, that I was using for debug purposes (which actually helped me find the "best match" problem). Ignore any differences that aren't talked about.
nice work!

Add a comment