Gordon P. Hemsley

Linguist by day. Web developer by night.

Archive for the ‘SourceForge’ Category

Posts related to SourceForge projects.

From Universal Subtitles to formant graphs

Posted by Gordon P. Hemsley on March 11, 2011

A few weeks ago, I watched Watson win on Jeopardy! and did a couple of Skype interviews for graduate school. It really got me thinking about machine learning and natural language processing, as I’ve been looking for a way to tie in my web development and programming skills with linguistics in a way that will benefit both fields, as well as “regular” people.

After coming up with and discarding a number of ideas, a shower thinking session led me to a realization: Universal Subtitles is, in effect, building up a corpus of matched video, audio, transcription, and translation! And, because it’s all open, you can remix and play with the data all you want. So, to do just that, I selected the most translated video (which also happened to be the Universal Subtitles introductory video) and began exploring. At first, I looked at a couple of the translations, thinking about doing something with automatic, statistical translation (which is pretty much what Google Translate does, I believe). But I have very little knowledge of that area, so I hit upon another idea: Extracting the text and the audio and matching it up in Praat—automatically.

To do this, I needed the timing information along with the text, and I needed to separate the audio from the video. Doing the latter wound up being the easy part, once I found OggSplit (a part of Ogg Video Tools). I just ran the open Ogg Video file through OggSplit and used Audacity to convert the resulting Ogg Audio file into a WAVE file for Praat to read. But the next part required some work. I remembered from a while back that Hixie and the WHATWG were working on an open subtitling standard, called WebSRT. When I went looking for that, I found that it had been renamed to WebVTT. Universal Subtitles exports its subtitles into many formats, but not yet to WebVTT. Luckily, the renaming of the WebSRT standard to WebVTT was mostly to avoid having to overcome some rather obscure processing issues—the majority of SRT files can become WebVTT files with very little effort. (Namely, adding a “WebVTT” header and converting commas to periods in the timestamps.) I then wrote a script which read the newly-minted WebVTT files and converted them to Praat TextGrids. (This involved “reverse-engineering” the format from TextGrids output by Praat, as I couldn’t find any documentation on the file format itself.)

By the end of this, I had PHP code that could convert unformatted WebVTT files into single-tier TextGrids. I plan to open-source the code, but I can’t decide on a few things: (1) what to call it, (2) where to put it (SourceForge or GitHub), and (3) whether to separate the two out into different projects. (Naturally, the WebVTT parser and the TextGrid generator are in separate modules, but there remains some assumption-based dependencies between them.)

Pairing the WAVE file and the TextGrid revealed some areas for improvement of both Praat and Universal Subtitles. As is inherent in the whole idea of subtitles, dead spaces in the audio (parts without speech) do not get subtitled; thus, the subsequently-generated TextGrid does not fully specify the sections of the tier. Praat doesn’t really like that, as a regular TextGrid specifies even the empty portions of the tier, having positive data from start to finish. Loading a TextGrid that has pieces missing does not make Praat happy, but it still works. (You can practically see the grimace on its face as it sucks it up and makes do.) On the flipside, when matching audio and subtitles up in Praat, you can see how imprecise crowd-sourced subtitling can be. Though it might seem OK when using the web interface to subtitle a video as it plays, when you see the text lined up next to the waveform, it becomes clear that things could be better aligned. (It wouldn’t be that big of a deal, but there doesn’t seem to be any “advanced” interface on Universal Subtitles. There’s not even a place to upload an existing subtitle format, so that you can make corrections locally and upload the new file back to Universal Subtitles.)

Once I had the text and audio lined up, I started extracting formant info from multiple instances of the word “video”. (It appears a total of 11 times in the introductory video, including once in the compound “videomakers” and once in the plural “videos”.) I later found out that about AntConc, and the concept of concordance in general, which helped to identify other words and combinations of words that appear multiple times in the full text of the transcription. (AntConc contains an N-gram viewer, as well.) Long story short, I recorded the formants of the 11 instances of the word “video” and made them available in this Google Docs spreadsheet.

And that’s where the fun started. As you can see in that spreadsheet, I attempted to graph these data points so that I could look at them all together.
Formant Frequencies of "video"
Unfortunately, all of the spreadsheet programs I tried (basically everything besides Excel, because I don’t have that) could not graph the data points on top of each other, so that one could fully see the similarities and differences among the formant frequencies. I thought I was going to have to settle for that, but then Michael Newman reminded me that R Can Do Anything™. So I finally had my task to help me learn R.

After a great deal of fiddling around and reading the docs, to not much avail, I headed over see if the folks in #R could help. It turns out they were more than willing to demonstrate how awesome R is; I am particularly indebted to mrflick, who answered all of my silly questions, one after another. (And, more importantly, and to my amazement, had an answer for all my questions, proving that R really can do anything.) With some more playing around, I was finally able to issue the right commands in the right order and come up with this:
Formant Frequencies of "video" (all)
To construct that graph, I still had to somewhat fudge the data by adding an artificial “meter” column so that each instance of the word turned out to be the same width on the graph, but at least R didn’t complain that I was Doing It Wrong™. I also excluded the two instances where the word wasn’t plain ol’ “video”, to avoid creating misleading patterns. I haven’t yet figured out how to automatically calculate the average wave/frequency patterns, but at least now the graph allows you to see them visually. (I suspect I would have to format my data differently to give R enough information to do that properly—right now it doesn’t know that the data is from nine separate instances of the word.)

To top things off, the R script that I used to generate the data is available as a gist on GitHub (released under CC-BY-SA), so feel free to remix the data or improve the script. Just be sure to leave me a comment here telling me what you did! Also, any ideas on where to go from here are welcome!

Posted in Linguistics, Mozilla, Open Source, SourceForge, Web Development | Tagged: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , | Leave a Comment »

PHP, MySQL, and the BIT field type

Posted by Gordon P. Hemsley on February 8, 2010

As Dave Humphrey once taught me:

When you do a search, and it comes back with no results, it’s a sign that you need to write something.

This is an issue that I came across while testing SASHA (which is available for you to try out, by the way), and I didn’t know if it was a bug or a feature. I could find no mention of it anywhere, and the people in the #mysql IRC channel on FreeNode weren’t especially helpful in helping me get to the bottom of it.

What is the issue, you ask? Well, even that in and of itself is a question, because I don’t know whether it’s a bug (or feature) in PHP or MySQL. However, I’m inclined to think it’s the latter, and I’ll get to why in a moment.

But first, some background. The table that SASHA uses to store schedules uses the BIT field type for keeping track of which days of the week a schedule occurs on. I figured it’d be easiest to use a 7-bit field and just flip a bit for each day of the week. And that worked fine for me on my local test server. But then I had a colleague test SASHA out on his test server, and things went a little wacky.

It took a little while to figure out what was causing our problem, and we finally got to the bottom of it: I was using MySQL 5.0 and he was using MySQL 5.1! Apparently, between 5.0 and 5.1, the return format of a BIT field changed from the literal binary data (output in the browser as a character, because the browser didn’t know it wasn’t) to a decimal representation of that data.

The first problem with that was that I had no idea there was a possibility of getting anything but the raw binary data I was getting on my server. The second problem was coming up with a straight-forward solution to detecting whether the database was feeding us raw binary data or converted decimal data. There was no direct way to do this, but I figured out the next best thing. A simple way to check what kind of data we’re getting is to find out whether it converts cleanly to an actual character. Here’s an excerpt from SASHA that demonstrates:

// MySQL 5.0 returns bit as binary, while MySQL 5.1 returns decimal
if( $days == chr( ord( $days ) ) )
{
	$input = 'binary';
}
else
{
	$input = 'decimal';
}

That seems to do the trick when it comes to handling unpredictable BIT field data.

(Again, I can’t guarantee that this isn’t actually a PHP issue, but I seem to recall us both being around the same version of PHP.)

If you have any insight into the matter, please do leave a comment.

Posted in Mozilla, Open Source, SourceForge | Tagged: , , , , , , , , , , , , , , , | 4 Comments »

SASHA 0.1.0-RC1 Released

Posted by Gordon P. Hemsley on July 12, 2009

It’s been a while since I’ve updated here, but I’d like to take the opportunity now to announce the first ever release of the Student Assignment, Scheduling, and Homework Assistant. SASHA 0.1.0-RC1 was released a few weeks ago for testing purposes, before 0.1.0 final is released.

Here is the official announcement:

This marks the first ever release of the Student Assignment, Scheduling, and Homework Assistant. SASHA is a tool designed to help students keep track of their assignments, tests, and other time-sensitive items. This release candidate includes features related to courses in educational institutions, including schedule items, assignments, and tests.

This release candidate is also a call for testers to help ensure that the release is relatively bug-free, and in a usable state. If no major defects are found, then the 0.1.0-RC1 code will be re-released as 0.1.0; else, another release candidate will be forthcoming.

Since SASHA requires institution-specific packages to operate efficiently, this release is mostly geared towards students at the University of Vermont (UVM), as that is the only package that is currently available. Packages must be downloaded separately.

SASHA Website: http://sasha.sourceforge.net/
SASHA Project Page: https://sourceforge.net/projects/sasha/
SASHA 0.1.0-RC1 Files: https://sourceforge.net/project/showfiles.php?group_id=163392&release_id=632770
SASHA 0.1.0-RC1 Release Notes: https://sourceforge.net/project/shownotes.php?group_id=163392&release_id=632770
SASHA Institution Packages: https://sourceforge.net/projects/sasha-pkg/

(This announcement was also submitted to the Slashdot Firehose.)

I’m currently looking for testers to give SASHA a beating to ensure that I haven’t missed any bugs. I’ve been using everything personally (i.e. dogfooding it) since I began writing it, but that also means that I may be too close to it to find the kinks. You may attempt to set up your own instance of SASHA, if you’d like, but be sure to read the instructions set forth in docs/INSTALL. You’ll be lost otherwise.

If you’re worried about setting SASHA up yourself, but are willing to try to help anyway, I’ve also set up a demo at YourSASHA.com for testing (and demonstration) purposes. If you have any comments, questions, suggestions, or any other feedback at all, please do not hesitate to join me in #SASHA-dev on Freenode. I’ll help you out.

Please forgive me for tagging this with Mozilla. While it’s not directly Mozilla-related, it is education-related, so I hope you’ll let it slide. More importantly, I hope SASHA will be able to help you at some point in the future.

Posted in Mozilla, Open Source, SourceForge | Tagged: | Leave a Comment »