Gordon P. Hemsley

Linguist by day. Web developer by night.

Linguistics and the Open Web

Posted by Gordon P. Hemsley on May 15, 2011

Last Saturday, I gave a talk entitled Linguistics and the Open Web at HULLS 2011, the first-annual Hunter [College] Undergraduate Linguistics and Language Studies Conference, organized by the Hunter College Linguistics Club. My talk kicked off the event, at which undergraduate students presented their research. (I later joked that it perhaps should have been called HULA, for “Hunter Undergraduate Linguistics and Activism”, because many of the student talks—including my own—were more of a call to action than particularly academic research… but I digress.) The keynote speakers were Doug Bigham and Ben Zimmer. Click through to the Linguistics Club site to see the full program of talks.

This talk was the first step in my attempt to somehow tie together my open-source/Mozilla life with the linguistics that I have really come to love over these past two years. I think it’s a good first step.

And it was nice to finally meet in person a lot of the people that I had previously known only through Twitter. And, actually, it’s good to know that my rate of meeting such people is increasing. I’ve only know most of them for about year, as opposed to the 7 or so it took to meet the Mozilla and phpBB folks last year.

My talk was only allotted 15 minutes, so it’s rather brief, but I think the slides I’ve made available get across much of what my talk did. It tries to answer these three questions:

  • What is the Open Web?
  • How does the Open Web relate to linguistics?
  • What can I do to participate in the Open Web?

I hope to be able to expand and improve this talk in the future. (It already includes a separate print stylesheet, so if you want to print it out, it comes out pretty.)

In fact, I’ve released it under CC-BY-NC-SA and I plan to put it on GitHub or something so that maybe we could even get it translated into a bunch of languages! And if you want to give the talk yourself, feel free. (Just drop me a line to let me know.)

Your feedback is greatly appreciated!


Posted in Linguistics, Mozilla, Open Source | Tagged: , , , , , , , , , , , , , , , , , | Leave a Comment »

From Universal Subtitles to formant graphs

Posted by Gordon P. Hemsley on March 11, 2011

A few weeks ago, I watched Watson win on Jeopardy! and did a couple of Skype interviews for graduate school. It really got me thinking about machine learning and natural language processing, as I’ve been looking for a way to tie in my web development and programming skills with linguistics in a way that will benefit both fields, as well as “regular” people.

After coming up with and discarding a number of ideas, a shower thinking session led me to a realization: Universal Subtitles is, in effect, building up a corpus of matched video, audio, transcription, and translation! And, because it’s all open, you can remix and play with the data all you want. So, to do just that, I selected the most translated video (which also happened to be the Universal Subtitles introductory video) and began exploring. At first, I looked at a couple of the translations, thinking about doing something with automatic, statistical translation (which is pretty much what Google Translate does, I believe). But I have very little knowledge of that area, so I hit upon another idea: Extracting the text and the audio and matching it up in Praat—automatically.

To do this, I needed the timing information along with the text, and I needed to separate the audio from the video. Doing the latter wound up being the easy part, once I found OggSplit (a part of Ogg Video Tools). I just ran the open Ogg Video file through OggSplit and used Audacity to convert the resulting Ogg Audio file into a WAVE file for Praat to read. But the next part required some work. I remembered from a while back that Hixie and the WHATWG were working on an open subtitling standard, called WebSRT. When I went looking for that, I found that it had been renamed to WebVTT. Universal Subtitles exports its subtitles into many formats, but not yet to WebVTT. Luckily, the renaming of the WebSRT standard to WebVTT was mostly to avoid having to overcome some rather obscure processing issues—the majority of SRT files can become WebVTT files with very little effort. (Namely, adding a “WebVTT” header and converting commas to periods in the timestamps.) I then wrote a script which read the newly-minted WebVTT files and converted them to Praat TextGrids. (This involved “reverse-engineering” the format from TextGrids output by Praat, as I couldn’t find any documentation on the file format itself.)

By the end of this, I had PHP code that could convert unformatted WebVTT files into single-tier TextGrids. I plan to open-source the code, but I can’t decide on a few things: (1) what to call it, (2) where to put it (SourceForge or GitHub), and (3) whether to separate the two out into different projects. (Naturally, the WebVTT parser and the TextGrid generator are in separate modules, but there remains some assumption-based dependencies between them.)

Pairing the WAVE file and the TextGrid revealed some areas for improvement of both Praat and Universal Subtitles. As is inherent in the whole idea of subtitles, dead spaces in the audio (parts without speech) do not get subtitled; thus, the subsequently-generated TextGrid does not fully specify the sections of the tier. Praat doesn’t really like that, as a regular TextGrid specifies even the empty portions of the tier, having positive data from start to finish. Loading a TextGrid that has pieces missing does not make Praat happy, but it still works. (You can practically see the grimace on its face as it sucks it up and makes do.) On the flipside, when matching audio and subtitles up in Praat, you can see how imprecise crowd-sourced subtitling can be. Though it might seem OK when using the web interface to subtitle a video as it plays, when you see the text lined up next to the waveform, it becomes clear that things could be better aligned. (It wouldn’t be that big of a deal, but there doesn’t seem to be any “advanced” interface on Universal Subtitles. There’s not even a place to upload an existing subtitle format, so that you can make corrections locally and upload the new file back to Universal Subtitles.)

Once I had the text and audio lined up, I started extracting formant info from multiple instances of the word “video”. (It appears a total of 11 times in the introductory video, including once in the compound “videomakers” and once in the plural “videos”.) I later found out that about AntConc, and the concept of concordance in general, which helped to identify other words and combinations of words that appear multiple times in the full text of the transcription. (AntConc contains an N-gram viewer, as well.) Long story short, I recorded the formants of the 11 instances of the word “video” and made them available in this Google Docs spreadsheet.

And that’s where the fun started. As you can see in that spreadsheet, I attempted to graph these data points so that I could look at them all together.
Formant Frequencies of "video"
Unfortunately, all of the spreadsheet programs I tried (basically everything besides Excel, because I don’t have that) could not graph the data points on top of each other, so that one could fully see the similarities and differences among the formant frequencies. I thought I was going to have to settle for that, but then Michael Newman reminded me that R Can Do Anything™. So I finally had my task to help me learn R.

After a great deal of fiddling around and reading the docs, to not much avail, I headed over see if the folks in #R could help. It turns out they were more than willing to demonstrate how awesome R is; I am particularly indebted to mrflick, who answered all of my silly questions, one after another. (And, more importantly, and to my amazement, had an answer for all my questions, proving that R really can do anything.) With some more playing around, I was finally able to issue the right commands in the right order and come up with this:
Formant Frequencies of "video" (all)
To construct that graph, I still had to somewhat fudge the data by adding an artificial “meter” column so that each instance of the word turned out to be the same width on the graph, but at least R didn’t complain that I was Doing It Wrong™. I also excluded the two instances where the word wasn’t plain ol’ “video”, to avoid creating misleading patterns. I haven’t yet figured out how to automatically calculate the average wave/frequency patterns, but at least now the graph allows you to see them visually. (I suspect I would have to format my data differently to give R enough information to do that properly—right now it doesn’t know that the data is from nine separate instances of the word.)

To top things off, the R script that I used to generate the data is available as a gist on GitHub (released under CC-BY-SA), so feel free to remix the data or improve the script. Just be sure to leave me a comment here telling me what you did! Also, any ideas on where to go from here are welcome!

Posted in Linguistics, Mozilla, Open Source, SourceForge, Web Development | Tagged: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , | Leave a Comment »

My thoughts on Standard English

Posted by Gordon P. Hemsley on February 27, 2011

This post began as a comment on Facebook in response to Clark Whelton’s What Happens in Vagueness Stays in Vagueness and a follow-up to Language Log’s rebuttal, The curious specificity of speechwriters. But it quickly evolved into something that begins to express my feelings about the whole issue of Standard English and how it is taught in (U.S.) schools.

I was discussing these two articles with one of my former English teachers. While conceding that there is no set year (neither the 1985 that Whelton claims nor the 1977 that happens to come up in the LL research) for which the “decline” in writing began, he argued that the decline was indeed upon us. In particular, he claimed that “the idea of the decline in precise, informative, and effectively communicated language is well-founded”. I agreed, but only by reiterating what Mark Liberman wrote on LL:

In fact, the narratives of real children are typically full of detail. The use of appropriate summarizing abstractions develops later, as I understand it; and the ability to speak …at length without saying anything concrete at all is mastered fully only by mature politicians and their speechwriters.

I was claiming that language use is more detailed when we’re children, and it gets more and more simplified as we grow older, as body language and other supra-linguistic cues come into play.

My former English teacher sees the supposed increase of speaking (and writing) without saying anything as being “exacerbated by the gradual infusion of more and more text/IM language into our daily discourse”.

These are the points I brought up in response:

(1) With the advent of the Internet, young people are writing more every day than they used to. Writing has become an integral part of all facets of life, not just the “educated” parts. The difference is, Standard English is not always used online. Because it is not required to be. There are no grades for your writing online. There is nothing stopping you from just pounding the keyboard with your fist and publishing “cfgghjkhgfdjhyhendhxcb” for all the world to see. But that leads to my second point:

(2) Life has become less and less formal over the years. Equality and civil rights have improved, and there is less reason to worry about oppression for who you are or what you say. Thus, the distinction between spoken English and written English, which was once starkly contrasting, has been greatly diminished. People write what they speak, the way they speak. There is no longer any artificial restriction on the process, no translation necessary.

(3) And then there’s the education system. The way English and writing is taught in schools (and I’m not by any means singling any particular teacher out), it’s as if these arbitrary rules are still the only thing out there. The five-paragraph essay, for example. Or the idea that you somehow have to use big words and long sentences to get your point across. Not ending a sentence with a preposition. The list goes on. These are all things that are taught in school, implicitly or explicitly, as if they are the be-all, end-all way to write. Thus, the idea of Standard English that a student has in their head by the time they reach high school and college is so completely skewed from what good writing is that it all becomes incoherent.

So time should not be spent blaming students for their poor writing. It should be spent reforming the system, one step at a time, to eliminate the inconsistencies, dispel the myths, and create better writers. And it should do so by working with the reality that students experience outside of school, not against it, as if school was some alternate dimension where all the rules are different. (This, incidentally, was also the argument behind the ill-fated “Ebonics” debate way-back-when. Teach students in the language and life they already know, and they’ll be better able to see—and take advantage of—the connection between that life and what you’re trying to teach them.)

Then maybe we can knock off all this complaining about bad writing and get back to actually producing something worth reading for a change.

Note: This post was more or less complete by the evening of February 27, but I didn’t get around to publishing it until March 6.

Posted in Linguistics | Tagged: , , , , | Leave a Comment »

You may not read this post about cake and ice cream

Posted by Gordon P. Hemsley on February 15, 2011

Let me start by saying that I’ve had it up to here with logic. I learned it just fine during Math A back in 9th grade, and the “Discrete Mathematics for Computer Science” course (a requirement for my ill-fated computational linguistics minor) I took last year was plenty refresher. Now I’m taking two logic courses this semester, and I’m relearning the same thing—twice! Well, actually, one of them is supposed to be Semantics and Pragmatics, but it’s pretty much all been logic up until this point. We don’t actually have a Semantics textbook; we’re using Logic in Linguistics. But I’m also taking Modern Logic (PHIL 109), at the insistence of my “advisor”. And that’s where my story starts.

Unfortunately, perhaps, this Modern Logic course is a philosophy course, rather than a linguistics or computer science course. That means it’s run out of the philosophy department and, in my case, taught by a philosophy grad student. This particular grad student has rubbed me the wrong way since the first day of class, as he puts on an air of “I know everything and you’re all idiots”. (Unless, of course, he’s saying something about philosophy or philosophers; then you’re expected to know what he’s talking about. Which I don’t.) But there are at least a couple of students in the class (besides me) that have brains that function just fine and who are able to understand and discuss the subject matter we are learning. And the discussion does come up—though it doesn’t last.

And that leads me to my story. Today, we had a quiz in class. Once the quiz ended, we went over it as a group. The first question on the quiz had you create, basically, this:

  1. You have cake or ice cream for dessert.
  2. If you have cake for dessert, you may not have ice cream.
  3. You do not have ice cream.
  4. You have cake.

The second question had you translate that into formal notation. With P = “You have cake for dessert” and Q = “You have ice cream for dessert”, it comes out like this:

  1. P∨Q
  2. P→¬Q
  3. ¬Q
  4. P

All of this is fine and dandy, and I wasn’t arguing with it. (It was, after all, what I put on my quiz.) However, when the instructor wrote the English version of (2) on the board, he wrote “cannot” instead of “may not”. I don’t remember who commented on the difference, but I raised my hand and stated that it would have been better (i.e. less ambiguous) if “cannot” had been written on the quiz instead of “may not”, because “may not” has the possibility of being interpreted in two different ways. One is the way he intended, specifically ¬Q. But the other is one that would have made (2) essentially equivalent to (1). That is, because of “may”, there would have still remained the possibility of having ice cream for dessert, even if you had cake for dessert. Formally, there would be a possibility, however remote, to translate English (2) into P→(Q∨¬Q).

He didn’t particularly care for this assertion. This surprised me, as I was taking it for granted—the reason I raised my hand was to point it out and suggest that “cannot” was the better option; I didn’t expect him to disagree that the possibility existed. But he did. And he was quite firm about it. He wasn’t at all open to the possibility that I could be right. He didn’t even consider it, except to try to explain to me how I was wrong.

But I knew I was right. I’ve had plenty of experience in this area. In addition to my pet peeve of people writing “can not” when they mean “cannot” (they don’t mean the same thing!), words such as MAY, SHOULD, and MUST are important in the world of Internet standards. Most standards out there today, including those by the W3C, start off by saying that they intend to use such words as defined by RFC 2119 (BCP 14). Here’s what it says for the definition of MAY:

MAY   This word, or the adjective "OPTIONAL", mean that an item is
truly optional.  One vendor may choose to include the item because a
particular marketplace requires it or because the vendor feels that
it enhances the product while another vendor may omit the same item.
An implementation which does not include a particular option MUST be
prepared to interoperate with another implementation which does
include the option, though perhaps with reduced functionality. In the
same vein an implementation which does include a particular option
MUST be prepared to interoperate with another implementation which
does not include the option (except, of course, for the feature the
option provides.)

Thus, the word “may” indicates that something is optional. In the case of “may not have ice cream”, it means that, according to the standard, it is optional for you to not have ice cream. When combined with the common meaning of “may not” that forbids, you have two possible options: [ [ may ] [ not have ice cream ] ] or [ [ may not ] [ have ice cream] ]. The former means that there is a possibility that you will have ice cream (and also that you will not have ice cream); the latter means you are forbidden from having ice cream.

To combine this issue with the “cannot” vs. “can not” one, you have these four options:

  1. You can have ice cream.
  2. You cannot have ice cream.
  3. You can not have ice cream.
  4. You cannot not have ice cream.

Of these, sentence (1) means you are allowed to (or have the ability to) have ice cream; (2) means you are not allowed to (or do not have the ability to) have ice cream. Sentence (3) is letting you know that no one is forcing you to have ice cream; sentence (4) tells you the exact opposite (or someone is trying to express that they really want you to have ice cream). Thus, sentence (2) here corresponds with the colloquial usage of “may not” that forbids (¬Q); sentence (3) corresponds with the other possibility I suggested ((Q∨¬Q)).

My raising of this matter caused quite a twitter on Twitter among my colleagues, and Twitoaster has attempted to keep track of it for you here. As my tweeps have noted, this is an issue of both prosody and scope.

This type of ambiguity and confusion has gotten other people in trouble before, too. Take a look at Ben Zimmer’s On Language column about the issue, as well as some corresponding Language Log articles here and here.

Posted in Linguistics | Tagged: , , , , , , , , , , , , , , | 3 Comments »

Catching up with myself

Posted by Gordon P. Hemsley on January 23, 2011

Oh, hello Internet. Long time, no see. (That is, if this is the only way you keep track of me. I’ve been tweeting a bit more than I blog.)

This post is basically to bring you up to speed on what’s been going on since my last post, back in July. (I never was a very good blogger, you know. This is actually pretty good for me.)

A lot has happened since then, actually.

First off, I’m no longer working with the Bespin folks—I’m not sure I ever mentioned that. Though I felt a bit guilty about it, I made the decision around the time of the Summit, and I wound up not spending a whole lot of time with them while. (I was running out of things I could help with, anyway, with my JavaScript skills being as poor as they are.) During the Summit, it was announced that Bespin would be changing its name to Skywriter. It was a bit of an insider secret until it was officially announced a few months later, but that doesn’t even matter now. Mozilla decided to change direction slightly and focus more on developer tools as a whole. This decision eventually led up to what happened just the other day: Skywriter has merged with the Ajax.org Cloud9 Editor (ACE). This is the best of both worlds, at it puts the project in the hands of developers better equipped to take care of it, while also ensuring that the original Bespin/Skywriter work does not go to waste.

I also haven’t been much involved with Ubiquity since the release of 0.6. I do believe satyr continues to maintain it, but I don’t know if it will ever see another “official” release. (Satyr has always made snapshot releases directly from the repository, though.) It also doesn’t seem like Taskfox will emerge any time soon. It’s certainly not on the agenda (nobody’s working on it), and the new Panorama (formally TabCandy) is the primary focus of Mitcho, Aza, and others. If all goes according to plan, that will likely be my favorite feature of Firefox 4. (Of course, by the time Firefox 4 comes out, I’ll probably be using Firefox 4.next. I’ve been running 4.0 nightlies for a while now. Probably ever since TabCandy was merged to trunk, now that I think about it.) So I spend some of my days bothering the folks in #tabcandy, complaining about things they usually already know about.

But I do try to make myself useful, too. I’ve attempted to increase my involvement with the Mozilla.org team, as at least there I have the relevant skillset. Unfortunately, it’s been somewhat slow-going. I spent a lot of time at the Summit chasing Reed around trying to get reviews. But Reed is always super busy—thus, I’m still waiting on those reviews. (And I’m not the only one.) So I’ve offered to try to help carry some of the load, in terms of reviewing patches for the Mozilla.org website(s). So, I finally applied for (albeit very limited) commit access—some 6 and a half years since I filed my first Mozilla-related bug. I faxed my Committer Agreement in about a week ago, and hopefully the rest will be handled in the next week or so. I’m quite excited to be able to make a contribution that’s more than removing unused variables or adding half-working tab support.

But my life, unfortunately, has not completely revolved around Mozilla in this past half a year. I finished another semester of school, and the final semester of my undergraduate career (well, the first one, at least) begins on the 31st. On June 2nd, I will finally have a Bachelor’s Degree—in Linguistics. What happens after that, I’m not sure. These past two months have been hectic, as I’ve been applying to graduate schools for linguistics. Though I continue to be torn as to whether I really want to spend the next five years doing more linguistics (what does one do with a Ph.D. in linguistics, besides more linguistics?), my biggest annoyance thus far has been the cost. Between the application fees, the GRE score fees, and transcript fees(!), this process has cost me hundreds and hundreds of dollars! (Oh, and for a procrastinator like me, having to rely on—and worry about—other people’s schedules has been very difficult. There’s no turning an application in the night before if you also need recommendation letters from three other people.)

On the bright side, I have been gathering a lot of linguistics-related ideas that I want to blog about. I haven’t yet figured out how I’m going to do that—some of them are not more than a couple of sentences, so I may spew a bunch out at a time. I’ve also gotten involved with a new project designed to bring linguistics to the masses, à la Scientific American or Popular Mechanics: Popular Linguistics Online. I’ll be writing some things for them, as well as helping out with some of the technical stuff behind the scenes. Everything is very much in the early stages over there, but there is an issue out already, so I encourage you to check it out!

P.S. Please forgive the overuse of the word “so”. It’s 4:30 in the morning.

Posted in Linguistics, Mozilla | Tagged: , , , , , , , , , | Leave a Comment »

Ubiquity 0.6 Released!

Posted by Gordon P. Hemsley on July 21, 2010

About a week and a half ago, the Ubiquity team (I’m the one in red) had a little meeting at the 2010 Mozilla Summit and we discussed the past, present, and future of Ubiquity.

One of the main goals of this meeting, in my mind, was to get a new release of Ubiquity out, so that the greater masses could be exposed to all the wonderful work satyr has been doing over the past many months. After being reprimanded by the hotel staff no less than twice, we finally were able to get down and discuss the logistics of that. We had a couple of issues to deal with. For one, there were still a number of users on the 0.1.x branch of Ubiquity, despite the 0.5.x branch being available for quite a while, and the reasons for this included a lot of backwards compatibility issues: the 0.5.x branch used a new parser that could break some of the older, 0.1.x commands; the 0.5.x branch didn’t properly support Firefox 3.6; etc. And there was also the issue of the 0.5.x branch never being released on AMO, leaving many users unaware that it even existed.

So, originally, the idea was to release satyr’s code as 0.5.5—simply an extension of the 0.5.x branch. However, a number of people on the team felt it best to bump the version up to 0.6, and I didn’t disagree, given the aforementioned issues. And since we were attempting to clean the slate as best as possible with regard to backwards compatibility, I also suggested that we bump the minVersion up to Firefox 3.6 for Ubiquity 0.6, from 3.5 (which probably still works). This had the added benefit of allowing people stuck on Firefox 3.5 to keep plodding happily along with the 0.1.x branch (which has now—finally—been discontinued).

Before I continue, let me just point you to Ubiquity 0.6 on AMO so that you can download if you don’t already have it.

If you allow me to briefly jump ahead a bit, it was soon discovered that Jono (who had access to the AMO account, and who was charged with packaging the release) could not remember his Hg password. I don’t know if that has since been rectified, but the bottom line was that all the changes he made in order to package up the release could not be committed to the Ubiquity repository. So that left the repo and the released 0.6 package as differing from each other. (I think satyr has mostly restored those changes to the repo, but that was only within the past few days.) So releasing Ubiquity 0.6 was quite the event—and I haven’t even mentioned the fact that we completely sprung it on satyr! (I’d told him a few weeks earlier that I was gunning for it to happen, but he had no idea the meeting was even going down.)

Now back to the meeting, where we also discussed the future of Ubiquity. One of the most forefront targets, I think, would be rewriting Ubiquity as a JetPack (or at least with a JetPack wrapper). That would allow much more uniformity across the Ubiquity codebase, as well as giving Ubiquity access to all JetPack has to offer. Mitcho and cers attempted to take the first step towards that goal (that being the wrapper) during the JetPack Hack-A-Thon at the Summit, but ran out of time. So that work still needs to be done.

At the meeting, we also discussed resurrecting the effort to make Ubiquity more ubiquitous (Aza’s pun) by getting it incorporated into Firefox as Taskfox. I don’t recall what the first steps for getting that done are, but I think it’d be a worthy task.

So, at the end of the meeting, I (and, I think, the others) came out seeing the future of Ubiquity as brighter than we previously thought. All we need now are some brilliant, dedicated developers to make it happen. Unfortunately, many of said developers are spending their time with more high-priority tasks: Jono is working on Test Pilot; Mitcho and Aza are working on TabCandy; Atul and cers are working on JetPack. And these are all extremely worthy tasks. But if you want to help out with Ubiquity, don’t hesitate to drop by the #ubiquity channel on the Mozilla IRC server!

Posted in Linguistics, Mozilla | Tagged: , , , , , , , , , , , , , , , , , , , , | Leave a Comment »

Do you use Ubiquity?

Posted by Gordon P. Hemsley on June 16, 2010

As you may or may not know, Ubiquity is officially “on hiatus”. That means that the official Mozilla Labs team is not currently working on it at the moment. Unfortunately, when they made that decision, the latest released version of Ubiquity (0.5.4) was not compatible with Firefox 3.6.

Luckily, community member Satyr Murky (satyr) decided to keep maintaining Ubiquity (all alone!) and was able to bring it to a state where it works in Firefox 3.6 and even the latest trunk builds off mozilla-central (mostly). Satyr also fixed a number of bugs that were present, beyond support for the latest versions of Firefox. Unfortunately, none of Satyr’s fixes have been made officially: Ubiquity has been wallowing in dev-only land in an Hg repository, downloadable only from a BitBucket attachment.

But now Ubiquity 0.5.5 is just about ready (see bug 528417), and I’d like to see it get released. Who’s with me?

Do you use Ubiquity? Which version? (The older 0.1.x line works fine on Firefox 3.6—did you downgrade your Ubiquity?) Did you know about the developmental version? (Your add-on updater didn’t tell you about it, after all.) Or were you too scared to install it? Let me know in the comments.

Posted in Mozilla | Tagged: , , , , , , , | Leave a Comment »

wh-movement and T→C movement in English interrogatives

Posted by Gordon P. Hemsley on June 9, 2010

While I was doing my take-home syntax final exam (why do I feel like the modifier order is off in that phrase?) a couple of weeks ago, one of the questions got me thinking. The section of the exam was testing our knowledge of wh-movement and T→C movement in questions, and one particular sentence was giving me a little bit of trouble. To try to figure out where things were supposed to move to, I wound up creating what I’m calling a trace table. That is, a table comparing various related sentences and demonstrating the motivation for various movements. (It’s called a trace table because it allows for an easy comparison of the locations of the tracers and the tracees. And yes, I did just make up those words; and no, I didn’t bother to figure out which is which.)

The particular sentences I used for this trace table all had to do with a man, a cat, and the act of stealing.

I haven’t mentioned yet precisely what about the test question was giving me trouble. It was the fact that, in certain situations, T→C movement does not occur in interrogatives. (Questions in English are normally formed using T→C movement, otherwise known as subject–auxiliary inversion.) So, I decided to figure out exactly what that environment was. We’d previously (accidentally) referenced the situation in class before, but we never went into detail. (Someone happened to ask about a sentence where T→C movement did not occur, and the instructor admitted that she’d been trying to avoid those sentences, so as to avoid overly complicating the lesson.) Beyond that, though, I don’t know what research, if any, has been done regarding these situations. (I assume there has been research, but my extremely brief search did not turn up any.)

Anyway, once the semester was over, I decided to formalize and prettify my trace table and put it up on the Web for all to see.

wh-movement and T→C movement in English interrogatives

The dedicated page goes into more detail, but what it seems to boil down to is this: T→C movement does not occur when there is a trace in the subject position (SpecTP) of the main clause.

I greatly encourage feedback about this, but please read the whole page first, as it has much explanation and background, as well as a more in-depth description of my conclusions. (And please pardon my extensive use of parentheticals in this post; I’m rather tired at the moment, and my brain is wandering all over the place.)

Posted in Linguistics, Web Development | Tagged: , , , , , , , , , , , , , , , , | Leave a Comment »

Calling all HTML5 and Bugzilla enthusiasts!

Posted by Gordon P. Hemsley on February 20, 2010

Earlier in the week, I went through the process of filing and fixing bugs 546338 and 546340, both related to fixing <a name> problems in Bugzilla. Once that was successful, I got the idea to do a major overhaul of the Bugzilla templates in order to upgrade them from HTML4 code to HTML5 code (sans presentational markup, which Bugzilla has a ton of). I’ve filed bugs 546838, 547171, 546353, 547311, and 547389 for this purpose.

After spending a few days attempting to accomplish something, under the very helpful and reassuring guidance of Max Kanat-Alexander, I realized that it was a bit much for one person to take on. The sheer number of instances of presentational markup (and I only got so far as looking at @align, @cellspacing, and @cellpadding) is quite overwhelming.

But then I thought: This would be a perfect series of bugs for ‘student-project‘; that is, the keyword used to attract open source students to specific bugs that they can tackle during a semester. If we can get a group of students together, along with myself and Max, we can probably accomplish this much quicker.

If you’re interested in helping out, or you know a student who may fit that description, drop by #mozwebtools on irc.mozilla.org and ping GPHemsley or mkanat.

Posted in Mozilla, Web Development | Tagged: , , , , , , , , | Leave a Comment »

PHP, MySQL, and the BIT field type

Posted by Gordon P. Hemsley on February 8, 2010

As Dave Humphrey once taught me:

When you do a search, and it comes back with no results, it’s a sign that you need to write something.

This is an issue that I came across while testing SASHA (which is available for you to try out, by the way), and I didn’t know if it was a bug or a feature. I could find no mention of it anywhere, and the people in the #mysql IRC channel on FreeNode weren’t especially helpful in helping me get to the bottom of it.

What is the issue, you ask? Well, even that in and of itself is a question, because I don’t know whether it’s a bug (or feature) in PHP or MySQL. However, I’m inclined to think it’s the latter, and I’ll get to why in a moment.

But first, some background. The table that SASHA uses to store schedules uses the BIT field type for keeping track of which days of the week a schedule occurs on. I figured it’d be easiest to use a 7-bit field and just flip a bit for each day of the week. And that worked fine for me on my local test server. But then I had a colleague test SASHA out on his test server, and things went a little wacky.

It took a little while to figure out what was causing our problem, and we finally got to the bottom of it: I was using MySQL 5.0 and he was using MySQL 5.1! Apparently, between 5.0 and 5.1, the return format of a BIT field changed from the literal binary data (output in the browser as a character, because the browser didn’t know it wasn’t) to a decimal representation of that data.

The first problem with that was that I had no idea there was a possibility of getting anything but the raw binary data I was getting on my server. The second problem was coming up with a straight-forward solution to detecting whether the database was feeding us raw binary data or converted decimal data. There was no direct way to do this, but I figured out the next best thing. A simple way to check what kind of data we’re getting is to find out whether it converts cleanly to an actual character. Here’s an excerpt from SASHA that demonstrates:

// MySQL 5.0 returns bit as binary, while MySQL 5.1 returns decimal
if( $days == chr( ord( $days ) ) )
	$input = 'binary';
	$input = 'decimal';

That seems to do the trick when it comes to handling unpredictable BIT field data.

(Again, I can’t guarantee that this isn’t actually a PHP issue, but I seem to recall us both being around the same version of PHP.)

If you have any insight into the matter, please do leave a comment.

Posted in Mozilla, Open Source, SourceForge | Tagged: , , , , , , , , , , , , , , , | 4 Comments »