Gordon P. Hemsley

Linguist by day. Web developer by night.

Archive for the ‘Open Source’ Category

Posts about open source software.

Linguistics and the Open Web

Posted by Gordon P. Hemsley on May 15, 2011

Last Saturday, I gave a talk entitled Linguistics and the Open Web at HULLS 2011, the first-annual Hunter [College] Undergraduate Linguistics and Language Studies Conference, organized by the Hunter College Linguistics Club. My talk kicked off the event, at which undergraduate students presented their research. (I later joked that it perhaps should have been called HULA, for “Hunter Undergraduate Linguistics and Activism”, because many of the student talks—including my own—were more of a call to action than particularly academic research… but I digress.) The keynote speakers were Doug Bigham and Ben Zimmer. Click through to the Linguistics Club site to see the full program of talks.

This talk was the first step in my attempt to somehow tie together my open-source/Mozilla life with the linguistics that I have really come to love over these past two years. I think it’s a good first step.

And it was nice to finally meet in person a lot of the people that I had previously known only through Twitter. And, actually, it’s good to know that my rate of meeting such people is increasing. I’ve only know most of them for about year, as opposed to the 7 or so it took to meet the Mozilla and phpBB folks last year.

My talk was only allotted 15 minutes, so it’s rather brief, but I think the slides I’ve made available get across much of what my talk did. It tries to answer these three questions:

  • What is the Open Web?
  • How does the Open Web relate to linguistics?
  • What can I do to participate in the Open Web?

I hope to be able to expand and improve this talk in the future. (It already includes a separate print stylesheet, so if you want to print it out, it comes out pretty.)

In fact, I’ve released it under CC-BY-NC-SA and I plan to put it on GitHub or something so that maybe we could even get it translated into a bunch of languages! And if you want to give the talk yourself, feel free. (Just drop me a line to let me know.)

Your feedback is greatly appreciated!

Posted in Linguistics, Mozilla, Open Source | Tagged: , , , , , , , , , , , , , , , , , | Leave a Comment »

From Universal Subtitles to formant graphs

Posted by Gordon P. Hemsley on March 11, 2011

A few weeks ago, I watched Watson win on Jeopardy! and did a couple of Skype interviews for graduate school. It really got me thinking about machine learning and natural language processing, as I’ve been looking for a way to tie in my web development and programming skills with linguistics in a way that will benefit both fields, as well as “regular” people.

After coming up with and discarding a number of ideas, a shower thinking session led me to a realization: Universal Subtitles is, in effect, building up a corpus of matched video, audio, transcription, and translation! And, because it’s all open, you can remix and play with the data all you want. So, to do just that, I selected the most translated video (which also happened to be the Universal Subtitles introductory video) and began exploring. At first, I looked at a couple of the translations, thinking about doing something with automatic, statistical translation (which is pretty much what Google Translate does, I believe). But I have very little knowledge of that area, so I hit upon another idea: Extracting the text and the audio and matching it up in Praat—automatically.

To do this, I needed the timing information along with the text, and I needed to separate the audio from the video. Doing the latter wound up being the easy part, once I found OggSplit (a part of Ogg Video Tools). I just ran the open Ogg Video file through OggSplit and used Audacity to convert the resulting Ogg Audio file into a WAVE file for Praat to read. But the next part required some work. I remembered from a while back that Hixie and the WHATWG were working on an open subtitling standard, called WebSRT. When I went looking for that, I found that it had been renamed to WebVTT. Universal Subtitles exports its subtitles into many formats, but not yet to WebVTT. Luckily, the renaming of the WebSRT standard to WebVTT was mostly to avoid having to overcome some rather obscure processing issues—the majority of SRT files can become WebVTT files with very little effort. (Namely, adding a “WebVTT” header and converting commas to periods in the timestamps.) I then wrote a script which read the newly-minted WebVTT files and converted them to Praat TextGrids. (This involved “reverse-engineering” the format from TextGrids output by Praat, as I couldn’t find any documentation on the file format itself.)

By the end of this, I had PHP code that could convert unformatted WebVTT files into single-tier TextGrids. I plan to open-source the code, but I can’t decide on a few things: (1) what to call it, (2) where to put it (SourceForge or GitHub), and (3) whether to separate the two out into different projects. (Naturally, the WebVTT parser and the TextGrid generator are in separate modules, but there remains some assumption-based dependencies between them.)

Pairing the WAVE file and the TextGrid revealed some areas for improvement of both Praat and Universal Subtitles. As is inherent in the whole idea of subtitles, dead spaces in the audio (parts without speech) do not get subtitled; thus, the subsequently-generated TextGrid does not fully specify the sections of the tier. Praat doesn’t really like that, as a regular TextGrid specifies even the empty portions of the tier, having positive data from start to finish. Loading a TextGrid that has pieces missing does not make Praat happy, but it still works. (You can practically see the grimace on its face as it sucks it up and makes do.) On the flipside, when matching audio and subtitles up in Praat, you can see how imprecise crowd-sourced subtitling can be. Though it might seem OK when using the web interface to subtitle a video as it plays, when you see the text lined up next to the waveform, it becomes clear that things could be better aligned. (It wouldn’t be that big of a deal, but there doesn’t seem to be any “advanced” interface on Universal Subtitles. There’s not even a place to upload an existing subtitle format, so that you can make corrections locally and upload the new file back to Universal Subtitles.)

Once I had the text and audio lined up, I started extracting formant info from multiple instances of the word “video”. (It appears a total of 11 times in the introductory video, including once in the compound “videomakers” and once in the plural “videos”.) I later found out that about AntConc, and the concept of concordance in general, which helped to identify other words and combinations of words that appear multiple times in the full text of the transcription. (AntConc contains an N-gram viewer, as well.) Long story short, I recorded the formants of the 11 instances of the word “video” and made them available in this Google Docs spreadsheet.

And that’s where the fun started. As you can see in that spreadsheet, I attempted to graph these data points so that I could look at them all together.
Formant Frequencies of "video"
Unfortunately, all of the spreadsheet programs I tried (basically everything besides Excel, because I don’t have that) could not graph the data points on top of each other, so that one could fully see the similarities and differences among the formant frequencies. I thought I was going to have to settle for that, but then Michael Newman reminded me that R Can Do Anything™. So I finally had my task to help me learn R.

After a great deal of fiddling around and reading the docs, to not much avail, I headed over see if the folks in #R could help. It turns out they were more than willing to demonstrate how awesome R is; I am particularly indebted to mrflick, who answered all of my silly questions, one after another. (And, more importantly, and to my amazement, had an answer for all my questions, proving that R really can do anything.) With some more playing around, I was finally able to issue the right commands in the right order and come up with this:
Formant Frequencies of "video" (all)
To construct that graph, I still had to somewhat fudge the data by adding an artificial “meter” column so that each instance of the word turned out to be the same width on the graph, but at least R didn’t complain that I was Doing It Wrong™. I also excluded the two instances where the word wasn’t plain ol’ “video”, to avoid creating misleading patterns. I haven’t yet figured out how to automatically calculate the average wave/frequency patterns, but at least now the graph allows you to see them visually. (I suspect I would have to format my data differently to give R enough information to do that properly—right now it doesn’t know that the data is from nine separate instances of the word.)

To top things off, the R script that I used to generate the data is available as a gist on GitHub (released under CC-BY-SA), so feel free to remix the data or improve the script. Just be sure to leave me a comment here telling me what you did! Also, any ideas on where to go from here are welcome!

Posted in Linguistics, Mozilla, Open Source, SourceForge, Web Development | Tagged: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , | Leave a Comment »

Catching up with myself

Posted by Gordon P. Hemsley on January 23, 2011

Oh, hello Internet. Long time, no see. (That is, if this is the only way you keep track of me. I’ve been tweeting a bit more than I blog.)

This post is basically to bring you up to speed on what’s been going on since my last post, back in July. (I never was a very good blogger, you know. This is actually pretty good for me.)

A lot has happened since then, actually.

First off, I’m no longer working with the Bespin folks—I’m not sure I ever mentioned that. Though I felt a bit guilty about it, I made the decision around the time of the Summit, and I wound up not spending a whole lot of time with them while. (I was running out of things I could help with, anyway, with my JavaScript skills being as poor as they are.) During the Summit, it was announced that Bespin would be changing its name to Skywriter. It was a bit of an insider secret until it was officially announced a few months later, but that doesn’t even matter now. Mozilla decided to change direction slightly and focus more on developer tools as a whole. This decision eventually led up to what happened just the other day: Skywriter has merged with the Ajax.org Cloud9 Editor (ACE). This is the best of both worlds, at it puts the project in the hands of developers better equipped to take care of it, while also ensuring that the original Bespin/Skywriter work does not go to waste.

I also haven’t been much involved with Ubiquity since the release of 0.6. I do believe satyr continues to maintain it, but I don’t know if it will ever see another “official” release. (Satyr has always made snapshot releases directly from the repository, though.) It also doesn’t seem like Taskfox will emerge any time soon. It’s certainly not on the agenda (nobody’s working on it), and the new Panorama (formally TabCandy) is the primary focus of Mitcho, Aza, and others. If all goes according to plan, that will likely be my favorite feature of Firefox 4. (Of course, by the time Firefox 4 comes out, I’ll probably be using Firefox 4.next. I’ve been running 4.0 nightlies for a while now. Probably ever since TabCandy was merged to trunk, now that I think about it.) So I spend some of my days bothering the folks in #tabcandy, complaining about things they usually already know about.

But I do try to make myself useful, too. I’ve attempted to increase my involvement with the Mozilla.org team, as at least there I have the relevant skillset. Unfortunately, it’s been somewhat slow-going. I spent a lot of time at the Summit chasing Reed around trying to get reviews. But Reed is always super busy—thus, I’m still waiting on those reviews. (And I’m not the only one.) So I’ve offered to try to help carry some of the load, in terms of reviewing patches for the Mozilla.org website(s). So, I finally applied for (albeit very limited) commit access—some 6 and a half years since I filed my first Mozilla-related bug. I faxed my Committer Agreement in about a week ago, and hopefully the rest will be handled in the next week or so. I’m quite excited to be able to make a contribution that’s more than removing unused variables or adding half-working tab support.

But my life, unfortunately, has not completely revolved around Mozilla in this past half a year. I finished another semester of school, and the final semester of my undergraduate career (well, the first one, at least) begins on the 31st. On June 2nd, I will finally have a Bachelor’s Degree—in Linguistics. What happens after that, I’m not sure. These past two months have been hectic, as I’ve been applying to graduate schools for linguistics. Though I continue to be torn as to whether I really want to spend the next five years doing more linguistics (what does one do with a Ph.D. in linguistics, besides more linguistics?), my biggest annoyance thus far has been the cost. Between the application fees, the GRE score fees, and transcript fees(!), this process has cost me hundreds and hundreds of dollars! (Oh, and for a procrastinator like me, having to rely on—and worry about—other people’s schedules has been very difficult. There’s no turning an application in the night before if you also need recommendation letters from three other people.)

On the bright side, I have been gathering a lot of linguistics-related ideas that I want to blog about. I haven’t yet figured out how I’m going to do that—some of them are not more than a couple of sentences, so I may spew a bunch out at a time. I’ve also gotten involved with a new project designed to bring linguistics to the masses, à la Scientific American or Popular Mechanics: Popular Linguistics Online. I’ll be writing some things for them, as well as helping out with some of the technical stuff behind the scenes. Everything is very much in the early stages over there, but there is an issue out already, so I encourage you to check it out!

P.S. Please forgive the overuse of the word “so”. It’s 4:30 in the morning.

Posted in Linguistics, Mozilla | Tagged: , , , , , , , , , | Leave a Comment »

Ubiquity 0.6 Released!

Posted by Gordon P. Hemsley on July 21, 2010

About a week and a half ago, the Ubiquity team (I’m the one in red) had a little meeting at the 2010 Mozilla Summit and we discussed the past, present, and future of Ubiquity.

One of the main goals of this meeting, in my mind, was to get a new release of Ubiquity out, so that the greater masses could be exposed to all the wonderful work satyr has been doing over the past many months. After being reprimanded by the hotel staff no less than twice, we finally were able to get down and discuss the logistics of that. We had a couple of issues to deal with. For one, there were still a number of users on the 0.1.x branch of Ubiquity, despite the 0.5.x branch being available for quite a while, and the reasons for this included a lot of backwards compatibility issues: the 0.5.x branch used a new parser that could break some of the older, 0.1.x commands; the 0.5.x branch didn’t properly support Firefox 3.6; etc. And there was also the issue of the 0.5.x branch never being released on AMO, leaving many users unaware that it even existed.

So, originally, the idea was to release satyr’s code as 0.5.5—simply an extension of the 0.5.x branch. However, a number of people on the team felt it best to bump the version up to 0.6, and I didn’t disagree, given the aforementioned issues. And since we were attempting to clean the slate as best as possible with regard to backwards compatibility, I also suggested that we bump the minVersion up to Firefox 3.6 for Ubiquity 0.6, from 3.5 (which probably still works). This had the added benefit of allowing people stuck on Firefox 3.5 to keep plodding happily along with the 0.1.x branch (which has now—finally—been discontinued).

Before I continue, let me just point you to Ubiquity 0.6 on AMO so that you can download if you don’t already have it.

If you allow me to briefly jump ahead a bit, it was soon discovered that Jono (who had access to the AMO account, and who was charged with packaging the release) could not remember his Hg password. I don’t know if that has since been rectified, but the bottom line was that all the changes he made in order to package up the release could not be committed to the Ubiquity repository. So that left the repo and the released 0.6 package as differing from each other. (I think satyr has mostly restored those changes to the repo, but that was only within the past few days.) So releasing Ubiquity 0.6 was quite the event—and I haven’t even mentioned the fact that we completely sprung it on satyr! (I’d told him a few weeks earlier that I was gunning for it to happen, but he had no idea the meeting was even going down.)

Now back to the meeting, where we also discussed the future of Ubiquity. One of the most forefront targets, I think, would be rewriting Ubiquity as a JetPack (or at least with a JetPack wrapper). That would allow much more uniformity across the Ubiquity codebase, as well as giving Ubiquity access to all JetPack has to offer. Mitcho and cers attempted to take the first step towards that goal (that being the wrapper) during the JetPack Hack-A-Thon at the Summit, but ran out of time. So that work still needs to be done.

At the meeting, we also discussed resurrecting the effort to make Ubiquity more ubiquitous (Aza’s pun) by getting it incorporated into Firefox as Taskfox. I don’t recall what the first steps for getting that done are, but I think it’d be a worthy task.

So, at the end of the meeting, I (and, I think, the others) came out seeing the future of Ubiquity as brighter than we previously thought. All we need now are some brilliant, dedicated developers to make it happen. Unfortunately, many of said developers are spending their time with more high-priority tasks: Jono is working on Test Pilot; Mitcho and Aza are working on TabCandy; Atul and cers are working on JetPack. And these are all extremely worthy tasks. But if you want to help out with Ubiquity, don’t hesitate to drop by the #ubiquity channel on the Mozilla IRC server!

Posted in Linguistics, Mozilla | Tagged: , , , , , , , , , , , , , , , , , , , , | Leave a Comment »

Do you use Ubiquity?

Posted by Gordon P. Hemsley on June 16, 2010

As you may or may not know, Ubiquity is officially “on hiatus”. That means that the official Mozilla Labs team is not currently working on it at the moment. Unfortunately, when they made that decision, the latest released version of Ubiquity (0.5.4) was not compatible with Firefox 3.6.

Luckily, community member Satyr Murky (satyr) decided to keep maintaining Ubiquity (all alone!) and was able to bring it to a state where it works in Firefox 3.6 and even the latest trunk builds off mozilla-central (mostly). Satyr also fixed a number of bugs that were present, beyond support for the latest versions of Firefox. Unfortunately, none of Satyr’s fixes have been made officially: Ubiquity has been wallowing in dev-only land in an Hg repository, downloadable only from a BitBucket attachment.

But now Ubiquity 0.5.5 is just about ready (see bug 528417), and I’d like to see it get released. Who’s with me?

Do you use Ubiquity? Which version? (The older 0.1.x line works fine on Firefox 3.6—did you downgrade your Ubiquity?) Did you know about the developmental version? (Your add-on updater didn’t tell you about it, after all.) Or were you too scared to install it? Let me know in the comments.

Posted in Mozilla | Tagged: , , , , , , , | Leave a Comment »

Calling all HTML5 and Bugzilla enthusiasts!

Posted by Gordon P. Hemsley on February 20, 2010

Earlier in the week, I went through the process of filing and fixing bugs 546338 and 546340, both related to fixing <a name> problems in Bugzilla. Once that was successful, I got the idea to do a major overhaul of the Bugzilla templates in order to upgrade them from HTML4 code to HTML5 code (sans presentational markup, which Bugzilla has a ton of). I’ve filed bugs 546838, 547171, 546353, 547311, and 547389 for this purpose.

After spending a few days attempting to accomplish something, under the very helpful and reassuring guidance of Max Kanat-Alexander, I realized that it was a bit much for one person to take on. The sheer number of instances of presentational markup (and I only got so far as looking at @align, @cellspacing, and @cellpadding) is quite overwhelming.

But then I thought: This would be a perfect series of bugs for ‘student-project‘; that is, the keyword used to attract open source students to specific bugs that they can tackle during a semester. If we can get a group of students together, along with myself and Max, we can probably accomplish this much quicker.

If you’re interested in helping out, or you know a student who may fit that description, drop by #mozwebtools on irc.mozilla.org and ping GPHemsley or mkanat.

Posted in Mozilla, Web Development | Tagged: , , , , , , , , | Leave a Comment »

PHP, MySQL, and the BIT field type

Posted by Gordon P. Hemsley on February 8, 2010

As Dave Humphrey once taught me:

When you do a search, and it comes back with no results, it’s a sign that you need to write something.

This is an issue that I came across while testing SASHA (which is available for you to try out, by the way), and I didn’t know if it was a bug or a feature. I could find no mention of it anywhere, and the people in the #mysql IRC channel on FreeNode weren’t especially helpful in helping me get to the bottom of it.

What is the issue, you ask? Well, even that in and of itself is a question, because I don’t know whether it’s a bug (or feature) in PHP or MySQL. However, I’m inclined to think it’s the latter, and I’ll get to why in a moment.

But first, some background. The table that SASHA uses to store schedules uses the BIT field type for keeping track of which days of the week a schedule occurs on. I figured it’d be easiest to use a 7-bit field and just flip a bit for each day of the week. And that worked fine for me on my local test server. But then I had a colleague test SASHA out on his test server, and things went a little wacky.

It took a little while to figure out what was causing our problem, and we finally got to the bottom of it: I was using MySQL 5.0 and he was using MySQL 5.1! Apparently, between 5.0 and 5.1, the return format of a BIT field changed from the literal binary data (output in the browser as a character, because the browser didn’t know it wasn’t) to a decimal representation of that data.

The first problem with that was that I had no idea there was a possibility of getting anything but the raw binary data I was getting on my server. The second problem was coming up with a straight-forward solution to detecting whether the database was feeding us raw binary data or converted decimal data. There was no direct way to do this, but I figured out the next best thing. A simple way to check what kind of data we’re getting is to find out whether it converts cleanly to an actual character. Here’s an excerpt from SASHA that demonstrates:

// MySQL 5.0 returns bit as binary, while MySQL 5.1 returns decimal
if( $days == chr( ord( $days ) ) )
{
	$input = 'binary';
}
else
{
	$input = 'decimal';
}

That seems to do the trick when it comes to handling unpredictable BIT field data.

(Again, I can’t guarantee that this isn’t actually a PHP issue, but I seem to recall us both being around the same version of PHP.)

If you have any insight into the matter, please do leave a comment.

Posted in Mozilla, Open Source, SourceForge | Tagged: , , , , , , , , , , , , , , , | 4 Comments »

GPHemsley.org

Posted by Gordon P. Hemsley on October 13, 2009

I know it’s been a while since I last wrote something here, but I wanted to pop my head in to make a long-overdue announcement. I’ve finally gotten myself an official, centralized place on the Internet: the aptly-named GPHemsley.org. (The .org part means that all donations are accepted—just don’t expect them to be tax deductible.)

I still haven’t gotten it to the point where it contains everything you might want to know about me, but my goal is to eventually make it a one-stop shop for everything I’ve ever done on the Internet. Ever. Right now, though, it just has a list of my blogs and notable papers I’ve written in my college (i.e. adult) career.

I felt this announcement was especially important to make now because there are two linguistics-related blogs writing posts about topics I’ve brought up, and I wouldn’t want to poop the party and have you find out about my new website from them. Perhaps I’ll have made more progress on my website by a week tomorrow.

Posted in Linguistics, Open Source, Web Development | Tagged: , , , , , , , , , , , , , , , , | Leave a Comment »

SVN Support in Bespin

Posted by Gordon P. Hemsley on August 10, 2009

A couple of weeks ago, I kicked off the addition of SVN support to Bespin (bug 493038). This required two things: One was the actual ability to choose which VCS you’re using, as it defaulted to Hg and the auto-detection was primitive and long since functional. (There were rumors that it had even been missing from the code for a while already.) But that was the relatively easy part, as it was mostly just manipulating HTML.

The (relatively) harder part was writing the code that would do the actual work with SVN. (VCS support in Bespin is powered by UVC.) A factor in this difficulty was that the backend code is written in Python, which I’m not especially familiar with. Nevertheless, the process was actually simplified by the way things are set up, because I was able to copy the Hg code and just modify to fit the SVN commands. I was able to add basic checkout, commit, and update support, as well as username/password authentication. Kevin later came in and finished up the push/commit differentiation, among other things. I believe SSH support still needs to be done, but we’re looking for a method to use to do it. (Kevin has suggested using environmental variables, as SVN does not have the ability to pass SSH details via command line, like Hg does.)

Kevin and the other Bespin folks are in the process of getting the 0.4.0 release out the door today or in the next couple of days, and that will include this support for SVN, as well as collaboration.

Posted in Mozilla, Open Source | Tagged: , , , , , , , , , , , , , , , , | 1 Comment »

SASHA 0.1.0-RC1 Released

Posted by Gordon P. Hemsley on July 12, 2009

It’s been a while since I’ve updated here, but I’d like to take the opportunity now to announce the first ever release of the Student Assignment, Scheduling, and Homework Assistant. SASHA 0.1.0-RC1 was released a few weeks ago for testing purposes, before 0.1.0 final is released.

Here is the official announcement:

This marks the first ever release of the Student Assignment, Scheduling, and Homework Assistant. SASHA is a tool designed to help students keep track of their assignments, tests, and other time-sensitive items. This release candidate includes features related to courses in educational institutions, including schedule items, assignments, and tests.

This release candidate is also a call for testers to help ensure that the release is relatively bug-free, and in a usable state. If no major defects are found, then the 0.1.0-RC1 code will be re-released as 0.1.0; else, another release candidate will be forthcoming.

Since SASHA requires institution-specific packages to operate efficiently, this release is mostly geared towards students at the University of Vermont (UVM), as that is the only package that is currently available. Packages must be downloaded separately.

SASHA Website: http://sasha.sourceforge.net/
SASHA Project Page: https://sourceforge.net/projects/sasha/
SASHA 0.1.0-RC1 Files: https://sourceforge.net/project/showfiles.php?group_id=163392&release_id=632770
SASHA 0.1.0-RC1 Release Notes: https://sourceforge.net/project/shownotes.php?group_id=163392&release_id=632770
SASHA Institution Packages: https://sourceforge.net/projects/sasha-pkg/

(This announcement was also submitted to the Slashdot Firehose.)

I’m currently looking for testers to give SASHA a beating to ensure that I haven’t missed any bugs. I’ve been using everything personally (i.e. dogfooding it) since I began writing it, but that also means that I may be too close to it to find the kinks. You may attempt to set up your own instance of SASHA, if you’d like, but be sure to read the instructions set forth in docs/INSTALL. You’ll be lost otherwise.

If you’re worried about setting SASHA up yourself, but are willing to try to help anyway, I’ve also set up a demo at YourSASHA.com for testing (and demonstration) purposes. If you have any comments, questions, suggestions, or any other feedback at all, please do not hesitate to join me in #SASHA-dev on Freenode. I’ll help you out.

Please forgive me for tagging this with Mozilla. While it’s not directly Mozilla-related, it is education-related, so I hope you’ll let it slide. More importantly, I hope SASHA will be able to help you at some point in the future.

Posted in Mozilla, Open Source, SourceForge | Tagged: | Leave a Comment »