Gordon P. Hemsley

Linguist by day. Web developer by night.

Posts Tagged ‘Linguistics’

Linguistics and the Open Web

Posted by Gordon P. Hemsley on May 15, 2011

Last Saturday, I gave a talk entitled Linguistics and the Open Web at HULLS 2011, the first-annual Hunter [College] Undergraduate Linguistics and Language Studies Conference, organized by the Hunter College Linguistics Club. My talk kicked off the event, at which undergraduate students presented their research. (I later joked that it perhaps should have been called HULA, for “Hunter Undergraduate Linguistics and Activism”, because many of the student talks—including my own—were more of a call to action than particularly academic research… but I digress.) The keynote speakers were Doug Bigham and Ben Zimmer. Click through to the Linguistics Club site to see the full program of talks.

This talk was the first step in my attempt to somehow tie together my open-source/Mozilla life with the linguistics that I have really come to love over these past two years. I think it’s a good first step.

And it was nice to finally meet in person a lot of the people that I had previously known only through Twitter. And, actually, it’s good to know that my rate of meeting such people is increasing. I’ve only know most of them for about year, as opposed to the 7 or so it took to meet the Mozilla and phpBB folks last year.

My talk was only allotted 15 minutes, so it’s rather brief, but I think the slides I’ve made available get across much of what my talk did. It tries to answer these three questions:

  • What is the Open Web?
  • How does the Open Web relate to linguistics?
  • What can I do to participate in the Open Web?

I hope to be able to expand and improve this talk in the future. (It already includes a separate print stylesheet, so if you want to print it out, it comes out pretty.)

In fact, I’ve released it under CC-BY-NC-SA and I plan to put it on GitHub or something so that maybe we could even get it translated into a bunch of languages! And if you want to give the talk yourself, feel free. (Just drop me a line to let me know.)

Your feedback is greatly appreciated!

Posted in Linguistics, Mozilla, Open Source | Tagged: , , , , , , , , , , , , , , , , , | Leave a Comment »

From Universal Subtitles to formant graphs

Posted by Gordon P. Hemsley on March 11, 2011

A few weeks ago, I watched Watson win on Jeopardy! and did a couple of Skype interviews for graduate school. It really got me thinking about machine learning and natural language processing, as I’ve been looking for a way to tie in my web development and programming skills with linguistics in a way that will benefit both fields, as well as “regular” people.

After coming up with and discarding a number of ideas, a shower thinking session led me to a realization: Universal Subtitles is, in effect, building up a corpus of matched video, audio, transcription, and translation! And, because it’s all open, you can remix and play with the data all you want. So, to do just that, I selected the most translated video (which also happened to be the Universal Subtitles introductory video) and began exploring. At first, I looked at a couple of the translations, thinking about doing something with automatic, statistical translation (which is pretty much what Google Translate does, I believe). But I have very little knowledge of that area, so I hit upon another idea: Extracting the text and the audio and matching it up in Praat—automatically.

To do this, I needed the timing information along with the text, and I needed to separate the audio from the video. Doing the latter wound up being the easy part, once I found OggSplit (a part of Ogg Video Tools). I just ran the open Ogg Video file through OggSplit and used Audacity to convert the resulting Ogg Audio file into a WAVE file for Praat to read. But the next part required some work. I remembered from a while back that Hixie and the WHATWG were working on an open subtitling standard, called WebSRT. When I went looking for that, I found that it had been renamed to WebVTT. Universal Subtitles exports its subtitles into many formats, but not yet to WebVTT. Luckily, the renaming of the WebSRT standard to WebVTT was mostly to avoid having to overcome some rather obscure processing issues—the majority of SRT files can become WebVTT files with very little effort. (Namely, adding a “WebVTT” header and converting commas to periods in the timestamps.) I then wrote a script which read the newly-minted WebVTT files and converted them to Praat TextGrids. (This involved “reverse-engineering” the format from TextGrids output by Praat, as I couldn’t find any documentation on the file format itself.)

By the end of this, I had PHP code that could convert unformatted WebVTT files into single-tier TextGrids. I plan to open-source the code, but I can’t decide on a few things: (1) what to call it, (2) where to put it (SourceForge or GitHub), and (3) whether to separate the two out into different projects. (Naturally, the WebVTT parser and the TextGrid generator are in separate modules, but there remains some assumption-based dependencies between them.)

Pairing the WAVE file and the TextGrid revealed some areas for improvement of both Praat and Universal Subtitles. As is inherent in the whole idea of subtitles, dead spaces in the audio (parts without speech) do not get subtitled; thus, the subsequently-generated TextGrid does not fully specify the sections of the tier. Praat doesn’t really like that, as a regular TextGrid specifies even the empty portions of the tier, having positive data from start to finish. Loading a TextGrid that has pieces missing does not make Praat happy, but it still works. (You can practically see the grimace on its face as it sucks it up and makes do.) On the flipside, when matching audio and subtitles up in Praat, you can see how imprecise crowd-sourced subtitling can be. Though it might seem OK when using the web interface to subtitle a video as it plays, when you see the text lined up next to the waveform, it becomes clear that things could be better aligned. (It wouldn’t be that big of a deal, but there doesn’t seem to be any “advanced” interface on Universal Subtitles. There’s not even a place to upload an existing subtitle format, so that you can make corrections locally and upload the new file back to Universal Subtitles.)

Once I had the text and audio lined up, I started extracting formant info from multiple instances of the word “video”. (It appears a total of 11 times in the introductory video, including once in the compound “videomakers” and once in the plural “videos”.) I later found out that about AntConc, and the concept of concordance in general, which helped to identify other words and combinations of words that appear multiple times in the full text of the transcription. (AntConc contains an N-gram viewer, as well.) Long story short, I recorded the formants of the 11 instances of the word “video” and made them available in this Google Docs spreadsheet.

And that’s where the fun started. As you can see in that spreadsheet, I attempted to graph these data points so that I could look at them all together.
Formant Frequencies of "video"
Unfortunately, all of the spreadsheet programs I tried (basically everything besides Excel, because I don’t have that) could not graph the data points on top of each other, so that one could fully see the similarities and differences among the formant frequencies. I thought I was going to have to settle for that, but then Michael Newman reminded me that R Can Do Anything™. So I finally had my task to help me learn R.

After a great deal of fiddling around and reading the docs, to not much avail, I headed over see if the folks in #R could help. It turns out they were more than willing to demonstrate how awesome R is; I am particularly indebted to mrflick, who answered all of my silly questions, one after another. (And, more importantly, and to my amazement, had an answer for all my questions, proving that R really can do anything.) With some more playing around, I was finally able to issue the right commands in the right order and come up with this:
Formant Frequencies of "video" (all)
To construct that graph, I still had to somewhat fudge the data by adding an artificial “meter” column so that each instance of the word turned out to be the same width on the graph, but at least R didn’t complain that I was Doing It Wrong™. I also excluded the two instances where the word wasn’t plain ol’ “video”, to avoid creating misleading patterns. I haven’t yet figured out how to automatically calculate the average wave/frequency patterns, but at least now the graph allows you to see them visually. (I suspect I would have to format my data differently to give R enough information to do that properly—right now it doesn’t know that the data is from nine separate instances of the word.)

To top things off, the R script that I used to generate the data is available as a gist on GitHub (released under CC-BY-SA), so feel free to remix the data or improve the script. Just be sure to leave me a comment here telling me what you did! Also, any ideas on where to go from here are welcome!

Posted in Linguistics, Mozilla, Open Source, SourceForge, Web Development | Tagged: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , | Leave a Comment »

Catching up with myself

Posted by Gordon P. Hemsley on January 23, 2011

Oh, hello Internet. Long time, no see. (That is, if this is the only way you keep track of me. I’ve been tweeting a bit more than I blog.)

This post is basically to bring you up to speed on what’s been going on since my last post, back in July. (I never was a very good blogger, you know. This is actually pretty good for me.)

A lot has happened since then, actually.

First off, I’m no longer working with the Bespin folks—I’m not sure I ever mentioned that. Though I felt a bit guilty about it, I made the decision around the time of the Summit, and I wound up not spending a whole lot of time with them while. (I was running out of things I could help with, anyway, with my JavaScript skills being as poor as they are.) During the Summit, it was announced that Bespin would be changing its name to Skywriter. It was a bit of an insider secret until it was officially announced a few months later, but that doesn’t even matter now. Mozilla decided to change direction slightly and focus more on developer tools as a whole. This decision eventually led up to what happened just the other day: Skywriter has merged with the Ajax.org Cloud9 Editor (ACE). This is the best of both worlds, at it puts the project in the hands of developers better equipped to take care of it, while also ensuring that the original Bespin/Skywriter work does not go to waste.

I also haven’t been much involved with Ubiquity since the release of 0.6. I do believe satyr continues to maintain it, but I don’t know if it will ever see another “official” release. (Satyr has always made snapshot releases directly from the repository, though.) It also doesn’t seem like Taskfox will emerge any time soon. It’s certainly not on the agenda (nobody’s working on it), and the new Panorama (formally TabCandy) is the primary focus of Mitcho, Aza, and others. If all goes according to plan, that will likely be my favorite feature of Firefox 4. (Of course, by the time Firefox 4 comes out, I’ll probably be using Firefox 4.next. I’ve been running 4.0 nightlies for a while now. Probably ever since TabCandy was merged to trunk, now that I think about it.) So I spend some of my days bothering the folks in #tabcandy, complaining about things they usually already know about.

But I do try to make myself useful, too. I’ve attempted to increase my involvement with the Mozilla.org team, as at least there I have the relevant skillset. Unfortunately, it’s been somewhat slow-going. I spent a lot of time at the Summit chasing Reed around trying to get reviews. But Reed is always super busy—thus, I’m still waiting on those reviews. (And I’m not the only one.) So I’ve offered to try to help carry some of the load, in terms of reviewing patches for the Mozilla.org website(s). So, I finally applied for (albeit very limited) commit access—some 6 and a half years since I filed my first Mozilla-related bug. I faxed my Committer Agreement in about a week ago, and hopefully the rest will be handled in the next week or so. I’m quite excited to be able to make a contribution that’s more than removing unused variables or adding half-working tab support.

But my life, unfortunately, has not completely revolved around Mozilla in this past half a year. I finished another semester of school, and the final semester of my undergraduate career (well, the first one, at least) begins on the 31st. On June 2nd, I will finally have a Bachelor’s Degree—in Linguistics. What happens after that, I’m not sure. These past two months have been hectic, as I’ve been applying to graduate schools for linguistics. Though I continue to be torn as to whether I really want to spend the next five years doing more linguistics (what does one do with a Ph.D. in linguistics, besides more linguistics?), my biggest annoyance thus far has been the cost. Between the application fees, the GRE score fees, and transcript fees(!), this process has cost me hundreds and hundreds of dollars! (Oh, and for a procrastinator like me, having to rely on—and worry about—other people’s schedules has been very difficult. There’s no turning an application in the night before if you also need recommendation letters from three other people.)

On the bright side, I have been gathering a lot of linguistics-related ideas that I want to blog about. I haven’t yet figured out how I’m going to do that—some of them are not more than a couple of sentences, so I may spew a bunch out at a time. I’ve also gotten involved with a new project designed to bring linguistics to the masses, à la Scientific American or Popular Mechanics: Popular Linguistics Online. I’ll be writing some things for them, as well as helping out with some of the technical stuff behind the scenes. Everything is very much in the early stages over there, but there is an issue out already, so I encourage you to check it out!

P.S. Please forgive the overuse of the word “so”. It’s 4:30 in the morning.

Posted in Linguistics, Mozilla | Tagged: , , , , , , , , , | Leave a Comment »

wh-movement and T→C movement in English interrogatives

Posted by Gordon P. Hemsley on June 9, 2010

While I was doing my take-home syntax final exam (why do I feel like the modifier order is off in that phrase?) a couple of weeks ago, one of the questions got me thinking. The section of the exam was testing our knowledge of wh-movement and T→C movement in questions, and one particular sentence was giving me a little bit of trouble. To try to figure out where things were supposed to move to, I wound up creating what I’m calling a trace table. That is, a table comparing various related sentences and demonstrating the motivation for various movements. (It’s called a trace table because it allows for an easy comparison of the locations of the tracers and the tracees. And yes, I did just make up those words; and no, I didn’t bother to figure out which is which.)

The particular sentences I used for this trace table all had to do with a man, a cat, and the act of stealing.

I haven’t mentioned yet precisely what about the test question was giving me trouble. It was the fact that, in certain situations, T→C movement does not occur in interrogatives. (Questions in English are normally formed using T→C movement, otherwise known as subject–auxiliary inversion.) So, I decided to figure out exactly what that environment was. We’d previously (accidentally) referenced the situation in class before, but we never went into detail. (Someone happened to ask about a sentence where T→C movement did not occur, and the instructor admitted that she’d been trying to avoid those sentences, so as to avoid overly complicating the lesson.) Beyond that, though, I don’t know what research, if any, has been done regarding these situations. (I assume there has been research, but my extremely brief search did not turn up any.)

Anyway, once the semester was over, I decided to formalize and prettify my trace table and put it up on the Web for all to see.

wh-movement and T→C movement in English interrogatives

The dedicated page goes into more detail, but what it seems to boil down to is this: T→C movement does not occur when there is a trace in the subject position (SpecTP) of the main clause.

I greatly encourage feedback about this, but please read the whole page first, as it has much explanation and background, as well as a more in-depth description of my conclusions. (And please pardon my extensive use of parentheticals in this post; I’m rather tired at the moment, and my brain is wandering all over the place.)

Posted in Linguistics, Web Development | Tagged: , , , , , , , , , , , , , , , , | Leave a Comment »

“Patton and I”—Object or Subject?

Posted by Gordon P. Hemsley on February 2, 2010

[Note: I know I haven’t posted in a while. That’s the kind of relationship I have with my blogs. I also know that, when I do post, I post about computer stuff. But this is my blog, about my life. And my life also involves linguistics stuff. So here’s the first of what will likely be a number of posts relating to linguistics. If that bothers you… deal with it.]

Last night, while at the Grammys, Al Yankovic (you know, Weird Al) tweeted a picture with a caption that read:

Patton and I having a last-minute brawl before the show.

Knowing that he is usually as much of a stickler for grammar as I am (perhaps even moreso), I tweeted to him:

@alyankovic Patton and *me*, Al. Come on! You know better!

I was hoping to get a response from him, but I instead got a response from Jacinta of New Hampshire. (Not having much evidence to go on, I’m going to assume this person is female for the remainder of this post.) Here’s what she said:

@GPHemsley sorry, but Al is right… it’s Patton and *I*

To that, I replied with:

@Jacinta716 Not it’s not. “Patton and I” is the object of the untensed sentence fragment, so it should be “Patton and me”. #linguistics

As an aside (and another attempt to get Al to weigh in), I also tweeted:

I may have just started a grammar war on Twitter about a simple caption for a photo @alyankovic took. #linguistics

After that, Jacinta really let me have it. She devoted five tweets in a row to supporting her claim:

@GPHemsley “Patton and I” is the SUBJECT of this sentence; Al is correct.

@GPHemsley You use the same pronoun as you would if you had a singular subject in the sentence.

@GPHemsley Patton is texting like a 12 year-old girl. I am texting like a 12 year-old girl. Patton and I are texting like 12 year-old girls.

(I realized later that this was referencing another tweet that Al had made afterwards.)

@GPHemsley “Patton and I” is not the object. If you said “someone is throwing incorrect grammar rules at Patton and me” then you’d be right.

@GPHemsley “Patton and I” is not part of a sentence fragment. Although this is. And so is this. Which is why Al is right. And you are not.

And then she added:

@GPHemsley I wouldn’t call it a war. It’s…an educated discussion. It’s a lot better than most of the crap that people put on Twitter! 😉

Originally, I started tweeting back to her:

@Jacinta716 “Patton and I” is not the subject of the sentence. The subject of the sentence is implied; it refers to the picture.

@Jacinta716 The difference with these examples is that they are tensed. In that last sentence, “Patton and I” is indeed correct.

I was going attempt to diagram the sentence using bracket notation and go on to further support my claim. But when I went to phpSyntaxTree to diagram it for real, I realized I had a problem. The way I was diagramming it did indeed put “Patton and I” in the subject position of the subordinate sentence (which is still untensed):
[S [NP This] [Aux ] [VP [V is] [NP [NP a picture] [PP [P of] [S [NP Patton and me] [VP [V having] [NP a last-minute brawl]] [PP before the show]]]]]]
However, the point of view I was arguing was that the picture itself was the subject and “Patton and I/me” was the object. (I originally tweeted the sentence that included the implied part, but I later deleted it. I used that sentence in the diagram.)

Now, here’s the problem. I still think I’m right in saying that “Patton and I/me” is the object of the sentence and that it should be “me”, not “I”. But right now I’m at a loss to explain why. It doesn’t help that my diagram doesn’t take full advantage of X-Bar Theory and its extensions/improvements (and, thus, uses ternary branching to attach an adjunct), nor that I haven’t drawn the semantic relationships between words. But I wanted to get this down in a format longer than 140 characters so that a proper discussion could be had.

So… Is “Patton and I/me” the object or the subject? Is it both? Is there a different implication that could be had that could change the answer to those questions? What is the grammar of picture captions, specifically, and sentence fragments, in general?

This post seems to raise more questions than it answers, but it’s quite likely that I’ve made a mistake somewhere in my diagram that would lead me down this path. Please correct me if you can. Otherwise, let the discussion begin!

Posted in Linguistics | Tagged: , , , , , , , , , , | 3 Comments »


Posted by Gordon P. Hemsley on October 13, 2009

I know it’s been a while since I last wrote something here, but I wanted to pop my head in to make a long-overdue announcement. I’ve finally gotten myself an official, centralized place on the Internet: the aptly-named GPHemsley.org. (The .org part means that all donations are accepted—just don’t expect them to be tax deductible.)

I still haven’t gotten it to the point where it contains everything you might want to know about me, but my goal is to eventually make it a one-stop shop for everything I’ve ever done on the Internet. Ever. Right now, though, it just has a list of my blogs and notable papers I’ve written in my college (i.e. adult) career.

I felt this announcement was especially important to make now because there are two linguistics-related blogs writing posts about topics I’ve brought up, and I wouldn’t want to poop the party and have you find out about my new website from them. Perhaps I’ll have made more progress on my website by a week tomorrow.

Posted in Linguistics, Open Source, Web Development | Tagged: , , , , , , , , , , , , , , , , | Leave a Comment »

My Foray into Mozilla Education

Posted by Gordon P. Hemsley on February 14, 2009

I’ve recently rebooted this blog based on a suggestion by David Humphrey (humph) of Mozilla Education (wiki) during a discussion in #education. I’ll now be using it blog about my endeavors across the Internet related to software development (particularly the open source kind), as well as any other coding experiences I may have (including website development).

I’m excited to get involved with Mozilla Education, because that means I’ll be able to put my new Linguistics major to work (I’m currently attending the University of Vermont), while also building upon my 10 years of web development—not to mention being able to contribute to a community that I’ve been following and wanting to get involved with for those same 10 years. (I used Netscape 4 back in the day, was thrilled when Netscape 6 came out, soon switched to Mozilla Application Suite, and was finally convinced to switch to Firefox, where I’ve been ever since.)

In the coming days and weeks, I’ll be working with humph and others to decide where I’ll best fit. He suggested that I start with getting familiar with Ubiquity, so that I can perhaps help with their development of their natural language processing engine. He also mentioned the possibility of improving the tools that the localization team uses to translate Mozilla products into languages other than English, particularly via the Web. In the meantime, though, I have to brush up on my JavaScript, because it is an integral part of most Mozilla products, especially Ubiquity.

So I hope this will be a good experience for me, and I hope that I will be able to contribute something that other people will consider useful in the course of their using (or developing) Mozilla products.

Posted in Mozilla | Tagged: , , , , , , , , , , , , , , , | Leave a Comment »