I have been told too many times that contributing to open source projects is a great way to learn, and give back to the community at the same time. Even if you cannot code, and contribute to the code base, you can still contribute to the projects’ documentations. So, how do you make impactful contributions with minimal efforts in the smallest possible time?
Make a bot! Program it to automatically scan open source repos on GitHub, check the documentation of each project for potential misspellings, and report its findings for a human check.
#Getting the documentations
To narrow down the scope, I only select some public repos that I am interested in contributing. Since I’ve been working with node and npm packages a lot, I wanted to look for repos of npm packages. I saw that www.npmjs.com has the packages listed with links to their repos. Using a scrapy script, I scraped the top 200 most depended on packages’s github repo link. To make my life easier, I only look for Readme.md at the root directory of the project. If it’s there, I save the raw body to a csv file.
Before passing the text from ReadMe.md to a spell checker, I have to clean up the markdown a little bit. I converted markdown to html. Then I removed all the code blocks tags with their contents. Next, I removed all the html tags, leaving the inner texts intact. Then I replace all the special characters (non a – Z characters) with a whitespace. Combine multiple whitespaces into a single space. Then split the text into words.
For each word, I pass it into a spell-checking library, Enchant. I did not look into Enchant’s specific implementation of its spell checker, but I it worked pretty well and was very easy to set up. There are a couple of problems, however. The misspelled words could be variable names, technical terms not defined in the spell checker, or simply names. I came up with some simple filtering rules to help negate this problem.
#Word filtering, and usage frequency
For each misspelled word,
- Does it have more than one capital letters? If so it’s a probably a variable name.
- If it has a capital letter, is it not the first character? It’s probably also a variable name.
- Is the number of characters fewer than 3? Lots of possible false positives here. Just ignore this group.
The rules were pretty helpful, but there are still a lot false positives left. I tried to come up with better rules, but another idea came to mind. I realized that the frequently used words that are classified as misspelling are probably not misspellings. This is because the same exact misspelling of a word tends to be more rare, as opposed to the systemic error from the spell checker on a certain word, which is actually correct and widely used. Based on this idea, I calculated the usage count of each misspelled word across all repos that I scraped, and rank them. I also keep list of repos where each word is used. For example, ‘additionaly’, ‘aformentioned’ were used once, ‘stringify’ was used 68 times, and ‘html’ was used 511 times. Clearly the two latter words were not misspellings.
The graph below shows the word’s rank by words usage count on the x-axis, and the word usage counts on the y-axis.
The next figure shows the histogram of word usage count.
As you can see, most of the misspelled words are only used a couple of times throughout. So I decided to lower the bin size, and go closer to the single occurrence.
Now we see that almost 1000 words that are labeled as misspelling only appear once throughout the scape. This is a great starting point, and from here I hand check each misspelled word to be sure. Here’s a portion of the sorted output table.