Gordon P. Hemsley

Linguist by day. Web developer by night.

Archive for the ‘Web Development’ Category

Posts about web development.

From Universal Subtitles to formant graphs

Posted by Gordon P. Hemsley on March 11, 2011

A few weeks ago, I watched Watson win on Jeopardy! and did a couple of Skype interviews for graduate school. It really got me thinking about machine learning and natural language processing, as I’ve been looking for a way to tie in my web development and programming skills with linguistics in a way that will benefit both fields, as well as “regular” people.

After coming up with and discarding a number of ideas, a shower thinking session led me to a realization: Universal Subtitles is, in effect, building up a corpus of matched video, audio, transcription, and translation! And, because it’s all open, you can remix and play with the data all you want. So, to do just that, I selected the most translated video (which also happened to be the Universal Subtitles introductory video) and began exploring. At first, I looked at a couple of the translations, thinking about doing something with automatic, statistical translation (which is pretty much what Google Translate does, I believe). But I have very little knowledge of that area, so I hit upon another idea: Extracting the text and the audio and matching it up in Praat—automatically.

To do this, I needed the timing information along with the text, and I needed to separate the audio from the video. Doing the latter wound up being the easy part, once I found OggSplit (a part of Ogg Video Tools). I just ran the open Ogg Video file through OggSplit and used Audacity to convert the resulting Ogg Audio file into a WAVE file for Praat to read. But the next part required some work. I remembered from a while back that Hixie and the WHATWG were working on an open subtitling standard, called WebSRT. When I went looking for that, I found that it had been renamed to WebVTT. Universal Subtitles exports its subtitles into many formats, but not yet to WebVTT. Luckily, the renaming of the WebSRT standard to WebVTT was mostly to avoid having to overcome some rather obscure processing issues—the majority of SRT files can become WebVTT files with very little effort. (Namely, adding a “WebVTT” header and converting commas to periods in the timestamps.) I then wrote a script which read the newly-minted WebVTT files and converted them to Praat TextGrids. (This involved “reverse-engineering” the format from TextGrids output by Praat, as I couldn’t find any documentation on the file format itself.)

By the end of this, I had PHP code that could convert unformatted WebVTT files into single-tier TextGrids. I plan to open-source the code, but I can’t decide on a few things: (1) what to call it, (2) where to put it (SourceForge or GitHub), and (3) whether to separate the two out into different projects. (Naturally, the WebVTT parser and the TextGrid generator are in separate modules, but there remains some assumption-based dependencies between them.)

Pairing the WAVE file and the TextGrid revealed some areas for improvement of both Praat and Universal Subtitles. As is inherent in the whole idea of subtitles, dead spaces in the audio (parts without speech) do not get subtitled; thus, the subsequently-generated TextGrid does not fully specify the sections of the tier. Praat doesn’t really like that, as a regular TextGrid specifies even the empty portions of the tier, having positive data from start to finish. Loading a TextGrid that has pieces missing does not make Praat happy, but it still works. (You can practically see the grimace on its face as it sucks it up and makes do.) On the flipside, when matching audio and subtitles up in Praat, you can see how imprecise crowd-sourced subtitling can be. Though it might seem OK when using the web interface to subtitle a video as it plays, when you see the text lined up next to the waveform, it becomes clear that things could be better aligned. (It wouldn’t be that big of a deal, but there doesn’t seem to be any “advanced” interface on Universal Subtitles. There’s not even a place to upload an existing subtitle format, so that you can make corrections locally and upload the new file back to Universal Subtitles.)

Once I had the text and audio lined up, I started extracting formant info from multiple instances of the word “video”. (It appears a total of 11 times in the introductory video, including once in the compound “videomakers” and once in the plural “videos”.) I later found out that about AntConc, and the concept of concordance in general, which helped to identify other words and combinations of words that appear multiple times in the full text of the transcription. (AntConc contains an N-gram viewer, as well.) Long story short, I recorded the formants of the 11 instances of the word “video” and made them available in this Google Docs spreadsheet.

And that’s where the fun started. As you can see in that spreadsheet, I attempted to graph these data points so that I could look at them all together.
Formant Frequencies of "video"
Unfortunately, all of the spreadsheet programs I tried (basically everything besides Excel, because I don’t have that) could not graph the data points on top of each other, so that one could fully see the similarities and differences among the formant frequencies. I thought I was going to have to settle for that, but then Michael Newman reminded me that R Can Do Anything™. So I finally had my task to help me learn R.

After a great deal of fiddling around and reading the docs, to not much avail, I headed over see if the folks in #R could help. It turns out they were more than willing to demonstrate how awesome R is; I am particularly indebted to mrflick, who answered all of my silly questions, one after another. (And, more importantly, and to my amazement, had an answer for all my questions, proving that R really can do anything.) With some more playing around, I was finally able to issue the right commands in the right order and come up with this:
Formant Frequencies of "video" (all)
To construct that graph, I still had to somewhat fudge the data by adding an artificial “meter” column so that each instance of the word turned out to be the same width on the graph, but at least R didn’t complain that I was Doing It Wrong™. I also excluded the two instances where the word wasn’t plain ol’ “video”, to avoid creating misleading patterns. I haven’t yet figured out how to automatically calculate the average wave/frequency patterns, but at least now the graph allows you to see them visually. (I suspect I would have to format my data differently to give R enough information to do that properly—right now it doesn’t know that the data is from nine separate instances of the word.)

To top things off, the R script that I used to generate the data is available as a gist on GitHub (released under CC-BY-SA), so feel free to remix the data or improve the script. Just be sure to leave me a comment here telling me what you did! Also, any ideas on where to go from here are welcome!

Posted in Linguistics, Mozilla, Open Source, SourceForge, Web Development | Tagged: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , | Leave a Comment »

wh-movement and T→C movement in English interrogatives

Posted by Gordon P. Hemsley on June 9, 2010

While I was doing my take-home syntax final exam (why do I feel like the modifier order is off in that phrase?) a couple of weeks ago, one of the questions got me thinking. The section of the exam was testing our knowledge of wh-movement and T→C movement in questions, and one particular sentence was giving me a little bit of trouble. To try to figure out where things were supposed to move to, I wound up creating what I’m calling a trace table. That is, a table comparing various related sentences and demonstrating the motivation for various movements. (It’s called a trace table because it allows for an easy comparison of the locations of the tracers and the tracees. And yes, I did just make up those words; and no, I didn’t bother to figure out which is which.)

The particular sentences I used for this trace table all had to do with a man, a cat, and the act of stealing.

I haven’t mentioned yet precisely what about the test question was giving me trouble. It was the fact that, in certain situations, T→C movement does not occur in interrogatives. (Questions in English are normally formed using T→C movement, otherwise known as subject–auxiliary inversion.) So, I decided to figure out exactly what that environment was. We’d previously (accidentally) referenced the situation in class before, but we never went into detail. (Someone happened to ask about a sentence where T→C movement did not occur, and the instructor admitted that she’d been trying to avoid those sentences, so as to avoid overly complicating the lesson.) Beyond that, though, I don’t know what research, if any, has been done regarding these situations. (I assume there has been research, but my extremely brief search did not turn up any.)

Anyway, once the semester was over, I decided to formalize and prettify my trace table and put it up on the Web for all to see.

wh-movement and T→C movement in English interrogatives

The dedicated page goes into more detail, but what it seems to boil down to is this: T→C movement does not occur when there is a trace in the subject position (SpecTP) of the main clause.

I greatly encourage feedback about this, but please read the whole page first, as it has much explanation and background, as well as a more in-depth description of my conclusions. (And please pardon my extensive use of parentheticals in this post; I’m rather tired at the moment, and my brain is wandering all over the place.)

Posted in Linguistics, Web Development | Tagged: , , , , , , , , , , , , , , , , | Leave a Comment »

Calling all HTML5 and Bugzilla enthusiasts!

Posted by Gordon P. Hemsley on February 20, 2010

Earlier in the week, I went through the process of filing and fixing bugs 546338 and 546340, both related to fixing <a name> problems in Bugzilla. Once that was successful, I got the idea to do a major overhaul of the Bugzilla templates in order to upgrade them from HTML4 code to HTML5 code (sans presentational markup, which Bugzilla has a ton of). I’ve filed bugs 546838, 547171, 546353, 547311, and 547389 for this purpose.

After spending a few days attempting to accomplish something, under the very helpful and reassuring guidance of Max Kanat-Alexander, I realized that it was a bit much for one person to take on. The sheer number of instances of presentational markup (and I only got so far as looking at @align, @cellspacing, and @cellpadding) is quite overwhelming.

But then I thought: This would be a perfect series of bugs for ‘student-project‘; that is, the keyword used to attract open source students to specific bugs that they can tackle during a semester. If we can get a group of students together, along with myself and Max, we can probably accomplish this much quicker.

If you’re interested in helping out, or you know a student who may fit that description, drop by #mozwebtools on irc.mozilla.org and ping GPHemsley or mkanat.

Posted in Mozilla, Web Development | Tagged: , , , , , , , , | Leave a Comment »

GPHemsley.org

Posted by Gordon P. Hemsley on October 13, 2009

I know it’s been a while since I last wrote something here, but I wanted to pop my head in to make a long-overdue announcement. I’ve finally gotten myself an official, centralized place on the Internet: the aptly-named GPHemsley.org. (The .org part means that all donations are accepted—just don’t expect them to be tax deductible.)

I still haven’t gotten it to the point where it contains everything you might want to know about me, but my goal is to eventually make it a one-stop shop for everything I’ve ever done on the Internet. Ever. Right now, though, it just has a list of my blogs and notable papers I’ve written in my college (i.e. adult) career.

I felt this announcement was especially important to make now because there are two linguistics-related blogs writing posts about topics I’ve brought up, and I wouldn’t want to poop the party and have you find out about my new website from them. Perhaps I’ll have made more progress on my website by a week tomorrow.

Posted in Linguistics, Open Source, Web Development | Tagged: , , , , , , , , , , , , , , , , | Leave a Comment »

Online Video Editing Using HTML5 <video>?

Posted by Gordon P. Hemsley on April 30, 2009

This thought just popped into my head a couple a seconds ago, so I thought I’d throw it out there. Has anyone considered (or is anyone actively developing) an online video editing service that takes advantage of all the nice use features afforded by HTML5′s <video> tag? It just seems like it would be the perfect thing to do, especially with support in the upcoming Firefox 3.5 and Safari 4 releases.

Any thoughts?

Update: WTF? WordPress doesn’t automatically escape HTML symbols in post titles?!

Update 2: Nor does it support <small> tags in its posts?!

Posted in Mozilla, Open Source, Web Development | Tagged: , , , , , , , , , , , , | 1 Comment »

 
Follow

Get every new post delivered to your Inbox.