In defense of GPL

When I first started coding, GPL used to be the most popular license for open source projects. It was the go to license for someone dipping their toe in the FOSS world unless you're developing software for FreeBSD.

But over the past few years, companies have pulled a brilliant coup d'état and convinced the up and coming programmers that GPL is a virus. If you release your software in anything other than MIT or BSD license the programming community looks down upon that contribution.

Recently I've been seeing a lot of open source maintainers complaining about companies that use their software and demand bug fixes or features but hardly contribute any code or money towards their projects. 

This is the exact problem that GPL was designed to solve. If a company finds value in your code and decide to build upon it, they can either contribute back to the community by making their product open source or pay you for an alternative license that allows them to keep their code closed source.

OpenSource is Socialism. People do OpenSource because they enjoy doing it or they stepped up to fill a need. In a socialistic community, you're not paid for the end product and sharing (or forking) is encouraged. But commercial companies operate on the principle of Capitalism. They're trying to maximize their shareholder's value. Trying to appeal to a capitalist entity to support a socialist endeavor will not work. This is the reason why GPL is designed the way it is. GPL protects the OpenSource source ecosystem from the exploitation of capitalist entities.

But somehow the programming community decided that permissive licenses are the way forward. This is further encouraged by commercial companies because now they can use all this quality software without paying a penny or contributing back to the community. This erodes the OpenSource ecosystem in the long run because we are building software on the ashes of thousands of burnt out programmers.

This is where we arrive at a fork in our journey. 

Should we all use GPL? No.

If you're doing OpenSource because you love what you do and want to see your work used by as many people as possible, go with a more permissive license. But don't expect companies or users to pay for the product. This is never going to work.

There is nothing wrong with trying to make a living through OpenSource. Dual licensing with GPL + paid commercial license is a fantastic option. Don't succumb to the ivory tower programmers who demand all OpenSource software must be permissively licensed. If a company wants to build their profits from your software, there is nothing wrong with asking them to pay your fair share.

If you do take the more permissive license route, take care of your mental health and take steps to prevent burnout. Because whether you like it or not it's coming.

Alphabet Race - a kid's game

TL;DR: I built a little truck "game" to teach my son alphabets.

I started watching http://javascript30.com a couple of days ago. It's a series of 30 videos that uses vanilla Javascript to do amazing things.

I have only done the first two videos in the series. Based on what I learned in those two videos, I decided to write a small game. My son is a fan of trucks, so I cobbled together a truck race that teaches him alphabets.

The game is written using the latest version of Javascript. It will not work in old browsers (sorry). It is not designed for tablets or phones, you need a keyboard to play this game.

Game: http://alphabetrace.itsybits.xyz/

Source: https://github.com/amjith/alphabet_race/

My goal was to create an educational game that is entertaining but not addictive. 

You can send me feedback about the game via twitter or email.

Nostalgic Programming

For some unknown reason, I looked up GWBasic today and downloaded an emulator. GWBasic was the first programming language that I learned. I have fond memories of that language. I love the fact that I could switch to a graphics mode and start drawing circles and squares. 

After about 30 minutes of fumbling around, my muscle memory kicked in and I started to write a simple program to draw some shapes on the screen. I asked Sempi to sit with me and help me with the drawing. He wanted me to draw a truck, so I decided to give it a shot. 

He lost interest midway when I started looking up various commands in the programming manual, but I stuck with it. 

Here's the creation in all it's glory. 

Needless to say, I had a lot of fun. 

FuzzyFinder - in 10 lines of Python

Introduction:

FuzzyFinder is a popular feature available in decent editors to open files. The idea is to start typing partial strings from the full path and the list of suggestions will be narrowed down to match the desired file. 

Examples: 

Vim (Ctrl-P)

Sublime Text (Cmd-P)

This is an extremely useful feature and it's quite easy to implement.

Problem Statement:

We have a collection of strings (filenames). We're trying to filter down that collection based on user input. The user input can be partial strings from the filename. Let's walk this through with an example. Here is a collection of filenames:

When the user types 'djm' we are supposed to match 'django_migrations.py' and 'django_admin_log.py'. The simplest route to achieve this is to use regular expressions. 

Solutions:

Naive Regex Matching:

Convert 'djm' into 'd.*j.*m' and try to match this regex against every item in the list. Items that match are the possible candidates.

This got us the desired results for input 'djm'. But the suggestions are not ranked in any particular order.

In fact, for the second example with user input 'mig' the best possible suggestion 'migrations.py' was listed as the last item in the result.

Ranking based on match position:

We can rank the results based on the position of the first occurrence of the matching character. For user input 'mig' the position of the matching characters are as follows:

Here's the code:

We made the list of suggestions to be tuples where the first item is the position of the match and second item is the matching filename. When this list is sorted, python will sort them based on the first item in tuple and use the second item as a tie breaker. On line 14 we use a list comprehension to iterate over the sorted list of tuples and extract just the second item which is the file name we're interested in.

This got us close to the end result, but as shown in the example, it's not perfect. We see 'main_generator.py' as the first suggestion, but the user wanted 'migration.py'.

Ranking based on compact match:

When a user starts typing a partial string they will continue to type consecutive letters in a effort to find the exact match. When someone types 'mig' they are looking for 'migrations.py' or 'django_migrations.py' not 'main_generator.py'. The key here is to find the most compact match for the user input.

Once again this is trivial to do in python. When we match a string against a regular expression, the matched string is stored in the match.group(). 

For example, if the input is 'mig', the matching group from the 'collection' defined earlier is as follows:

We can use the length of the captured group as our primary rank and use the starting position as our secondary rank. To do that we add the len(match.group()) as the first item in the tuple, match.start() as the second item in the tuple and the filename itself as the third item in the tuple. Python will sort this list based on first item in the tuple (primary rank), second item as tie-breaker (secondary rank) and the third item as the fall back tie-breaker. 

This produces the desired behavior for our input. We're not quite done yet.

Non-Greedy Matching

There is one more subtle corner case that was caught by Daniel Rocco. Consider these two items in the collection ['api_user', 'user_group']. When you enter the word 'user' the ideal suggestion should be ['user_group', 'api_user']. But the actual result is:

Looking at this output, you'll notice that api_user appears before user_group. Digging in a little, it turns out the search user expands to u.*s.*e.*r; notice that user_group has two rs, so the pattern matches user_gr instead of the expected user. The longer match length forces the ranking of this match down, which again seems counterintuitive. This is easy to change by using the non-greedy version of the regex (.*? instead of .*) on line 4. 

Now that works for all the cases we've outlines. We've just implemented a fuzzy finder in 10 lines of code.

Conclusion:

That was the design process for implementing fuzzy matching for my side project pgcli, which is a repl for Postgresql that can do auto-completion. 

I've extracted fuzzyfinder into a stand-alone python package. You can install it via 'pip install fuzzyfinder' and use it in your projects.

Thanks to Micah Zoltu and Daniel Rocco for reviewing the algorithm and fixing the corner cases.

If you found this interesting, you should follow me on twitter

Epilogue:

When I first started looking into fuzzy matching in python, I encountered this excellent library called fuzzywuzzy. But the fuzzy matching done by that library is a different kind. It uses levenshtein distance to find the closest matching string from a collection. Which is a great technique for auto-correction against spelling errors but it doesn't produce the desired results for matching long names from partial sub-strings.

Python Profiling - Part 1

I gave a talk on profiling python code at the 2012 Utah Open Source Conference. Here are the slides and the accompanying code.

There are three parts to this profiling talk:

  • Standard Lib Tools - cProfile, Pstats
  • Third Party Tools - line_profiler, mem_profiler
  • Commercial Tools - New Relic

This is Part 1 of that talk. It covers:

  • cProfile module - usage
  • Pstats module - usage
  • RunSnakeRun - GUI viewer

Why Profiling:

  • Identify the bottle-necks.
  • Optimize intelligently. 

In God we trust, everyone else bring data

cProfile:

cProfile is a profiling module that is included in the Python's standard library. It instruments the code and reports the time to run each function and the number of times each function is called. 

Basic Usage:

The sample code I'm profiling is finding the lowest common multiplier of two numbers. lcm.py

# lcm.py - ver1 
    def lcm(arg1, arg2):
        i = max(arg1, arg2)
        while i < (arg1 * arg2):
            if i % min(arg1,arg2) == 0:
                return i
            i += max(arg1,arg2)
        return(arg1 * arg2)

    lcm(21498497, 3890120)

Let's run the profiler.

$ python -m cProfile lcm.py 
     7780242 function calls in 4.474 seconds
    
    Ordered by: standard name
   
    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    0.000    0.000    4.474    4.474 lcm.py:3()
         1    2.713    2.713    4.474    4.474 lcm.py:3(lcm)
   3890120    0.881    0.000    0.881    0.000 {max}
         1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
   3890119    0.880    0.000    0.880    0.000 {min}

Output Columns:

  • ncalls - number of calls to a function.
  • tottime - total time spent in the function without counting calls to sub-functions.
  • percall - tottime/ncalls
  • cumtime - cumulative time spent in a function and it's sub-functions.
  • percall - cumtime/ncalls

It's clear from the output that the built-in functions max() and min() are called a few thousand times which could be optimized by saving the results in a variable instead of calling it every time. 

    Pstats:

    Pstats is also included in the standard library that is used to analyze profiles that are saved using the cProfile module. 

    Usage:

    For scripts that are bigger it's not feasible to analyze the output of the cProfile module on the command-line. The solution is to save the profile to a file and use Pstats to analyze it like a database. Example:  Let's analyze shorten.py.

    $ python -m cProfile -o shorten.prof shorten.py   # saves the output to shorten.prof
    
    $ ls
    shorten.py shorten.prof

    Let's analyze the profiler output to list the top 5 frequently called functions.

    $ python 
    >>> import pstats
    >>> p  = pstats.Stats('script.prof')   # Load the profiler output
    >>> p.sort_stats('calls')              # Sort the results by the ncalls column
    >>> p.print_stats(5)                   # Print top 5 items
    
        95665 function calls (93215 primitive calls) in 2.371 seconds
        
       Ordered by: call count
       List reduced from 1919 to 5 due to restriction <5>
        
           ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        10819/10539    0.002    0.000    0.002    0.000 {len}
               9432    0.002    0.000    0.002    0.000 {method 'append' of 'list' objects}
               6061    0.003    0.000    0.003    0.000 {isinstance}
               3092    0.004    0.000    0.005    0.000 /lib/python2.7/sre_parse.py:182(__next)
               2617    0.001    0.000    0.001    0.000 {method 'endswith' of 'str' objects}

    This is quite tedious or not a lot of fun. Let's introduce a GUI so we can easily drill down. 

    RunSnakeRun:

    This cleverly named GUI written in wxPython makes life a lot easy. 

    Install it from PyPI using (requires wxPython)

    $ pip install SquareMap RunSnakeRun
    $ runsnake shorten.prof     #load the profile using GUI

    The output is displayed using squaremaps that clearly highlights the bigger pieces of the pie that are worth optimizing. 

    It also lets you sort by clicking the columns or drill down by double clicking on a piece of the SquareMap.

    Conclusion:

    That concludes Part 1 of the profiling series. All the tools except RunSnakeRun are available as part of the standard library. It is essential to introspect the code before we start shooting in the dark in the hopes of optimizing the code.

    We'll look at line_profilers and mem_profilers in Part 2. Stay tuned. 

    You are welcome to follow me on twitter (@amjithr).

    Memoization Decorator

    Recently I had the opportunity to give a short 10 min presentation on Memoization Decorator at our local UtahPython Users Group meeting. 

    Memoization: 

    • Everytime a function is called, save the results in a cache (map).
    • Next time the function is called with the exact same args, return the value from the cache instead of running the function.

    The code for memoization decorator for python is here: http://wiki.python.org/moin/PythonDecoratorLibrary#Memoize

    Example:

    The typical recursive implementation of fibonacci calculation is pretty inefficient O(2^n).   

    def fibonacci(num):
            print 'fibonacci(%d)'%num
            if num in (0,1):
                return num
            return fibonacci(num-1) + fibonacci(num-2)

    >>> math_funcs.fibonacci(4) # 9 function calls fibonacci(4) fibonacci(3) fibonacci(2) fibonacci(1) fibonacci(0) fibonacci(1) fibonacci(2) fibonacci(1) fibonacci(0) 3

    But the memoized version makes it ridiculously efficient O(n) with very little effort.

    import memoized
    @memoized
    def fibonacci(num):
        print 'fibonacci(%d)'%num
        if num in (0,1):
            return num
        return fibonacci(num-1) + fibonacci(num-2)
        
    >>> math_funcs.mfibonacci(4)  # 5 function calls
        fibonacci(4)
        fibonacci(3)
        fibonacci(2)
        fibonacci(1)
        fibonacci(0)
        3

    We just converted an algorithm from Exponential Complexity to Linear Complexity by simply adding the memoization decorator.

    Slides:

    Presentation:

    I generated the slides using LaTeX Beamer. But instead of writing raw LaTeX code I used reStructured Text (rst) and used rst2beamer script to generate the .tex file. 

    Source:

    The rst file and tex files are available in Github.

    https://github.com/amjith/User-Group-Presentations/tree/master/memoization_de...

     

    Productive Meter

    A few weeks ago I decided that I should suck it up and start learning how to develop for the web. After asking around, my faithful community brethren, I decided to learn Django from its docs

    ::Django documentation is awesome::

    Around this time I came across this post about Waking up at 5am to code. I tried it a few times and it worked wonders. I've been working on a small project that can keep track of my productivity on the computer. The concept is really simple, just log the window that is on top and find a way to display that data in a meaningful way. 

    Today's 5am session got me to a milestone on my project. I am finally able to visaulize the time I spend using a decent looking graph. Which is a huge milestone for someone who learned how to display html tables 3 weeks ago.

    Tools:

    A huge thanks to my irc friends and random geeks who wrote awesome blog posts and SO answers on every problem I encountered.

    I will be open-sourcing the app pretty soon. Stay tuned.

    Picking 'k' items from a list of 'n' - Recursion

    Let me preface this post by saying I suck at recursion. But it never stopped me from trying to master it. Here is my latest (successful) attempt at an algorithm that required recursion. 

    Background: 

    You can safely skip this section if you're not interested in the back story behind why I decided to code this up. 

    I was listening to KhanAcademy videos on probability. I was particularly intrigued by the combinatorics video. The formula to calculate the number of combinations of nCr was simple, but I wanted to print all the possible combinations of nCr. 

    Problem Statement:

    Given 'ABCD' what are the possible outcomes if you pick 3 letters from it to form a combination without repetition (i.e. 'ABC' is the same as 'BAC'). 

    At first I tried to solve this using an iterative method and gave up pretty quickly. It was clearly designed to be a recursive problem. After 4 hours of breaking my head I finally got a working algorithm using recursion. I was pretty adamant about not looking it up online but I seeked some help from IRC (Thanks jtolds). 

    Code: 

    def combo(w, l):
            lst = []
            if l < 1:
                return lst
            for i in range(len(w)):
                if l == 1:
                    lst.append(w[i])
                for c in combo(w[i+1:], l-1):
                    lst.append(w[i] + c)
            return lst

    Output:

    >>> combinations.combo('abcde',3)
        ['abc', 'abd', 'abe', 'acd', 'ace', 'ade', 'bcd', 'bce', 'bde', 'cde']

    Thoughts:

    • It helps to think about recursion with the assumption that an answer for step n-1 already exists.
    • If you are getting partial answers check the condition surrounding the return statement.
    • Recursion is still not clear (or easy). 

    I have confirmed that this works for bigger data sets and am quite happy with this small victory.

    Python Profiling

    I did a presentation at our local Python User Group meeting tonight. It was well received, but shorter than I had expected. I should've added a lot more code examples. 

    We talked about usage of cProfile, pstats, runsnakerun and timeit. 

    Here are the slides from the presentations: 

    The slides were done using latex-beamer, but I wrote the slides in reStructuredText and used rst2beamer to create the tex file which was then converted to pdf using pdflatex. 

    The source code for the slides are available on github.