One of the questions that comes up often regarding is:

How do you represent the timed transcript?

The representation of a transcript is probably one of the most contentious pieces of the Hyperaudio architecture and can often invoke the dreaded XML vs JSON debate.

What we actually use is HTML to represent the transcript. The way I think of it, is that we are creating Hypertranscript so why not use Hypertext. Of course Hypertext isn't HTML, as I swiftly found out when I met the inventor of Hypertext and Hypermedia, Ted Nelson at the Internet Archive earlier this month. Turns out Ted is not the world's biggest HTML fan, which I soon found out when I showed him what our transcripts looked like!

Thankfully he did have some good things to say about Hyperaudio and of course I have to slip it into a blog post somewhere:

It's delightful to see what you're doing at
--very much the same kind of
referential editing that Xanadu is based on.

Ted Nelson, The Inventor of Hypertext and Hypermedia

Format Freedom and Readability

Actually we're not using HTML for all transcript representations we also bring back timing in JSON from our API, something that collaborators like the BBC and Popup Archive are happy about, as they can use our data to help train their speech to text algorithms and JSON is a convenient format.

The thing I like about our HTML based transcripts is that if you add to a web page that doesn't include the Hyperaudio libs, it just displays - with all it's formatting intact.

Speaking of formatting, as long as you adhere to a few simple rules, you can format your transcript anyway you want.

Let's take a quick look at a simple example:

<a data-m="123456">Hello </a>
<a data-m="234567">world!</a>

The core piece of this is the data-m attribute, we plan to make our system work with any tag as long as you use the appropriate attribute.

So you might prefer

<span data-m="123456">Hello </span>
<span data-m="234567">world!</span>
or even
<i data-m="123456">Hello </i>
<i data-m="234567">world!</i>
Semantically things are left up to you.

The nice thing is, if you wish to format your transcript you are welcome to surround parts with <p>s or <div>s or whatever you want, it will still work.

One Time

In order to keep things nice and minimal, we decided early on that we would just expose the start time of the word. This doesn't mean we won't consider adding end time or duration, just that the Hyperaudio system will work with just a start time specified in the transcript.

A start time tells you just enough so that if you click on a word, you can go straight to the part of the media where it is spoken. When highlighting a block of text for sharing or editing however it may be desirable to know the end time. Usually we default to one second after the start-time of the last word. In the Hyperaudio Pad, we provide a trim function for adjusting the end time manually.


We use the new semantic tags that were introduced with HTML5 to add meaning and structure to our transcripts.

Image courtesy Michel Dumontier . Creative Commons Licensed

The whole transcript is an <article> which can contain a number of <section>s. An <article> also has a <header> and <footer> as does a <section>.

In the case of the original transcript the <article> contains only one <section> but in the case of a remix it can contain more than one. Transitions are also specified in <sections>s.

The <header>s will be used to wrap any information that would usually go in a header such as title, author, date-stamp whereas footers will be used to store credits.

We feel that crediting your source is so important that we aim to generate them automatically from the content of a mix and have them scroll on at the end by default.

Here's an example of a valid Hypertranscript we're using now (view source to see the structure).


I mentioned that the effects and transitions are stored in the hypertranscript, actually everything we need to know to interpret and playback the transcript is stored within the HTML. This makes a hypertranscript a truly portable document format - further, it makes the format readable by machine and human alike.

In the Hyperaudio Pad, instead of storing state adjacently to the transcript, the state is stored in the transcript, meaning when we cut, paste and move things around in the transcript, as long as we respect the structure, it all just works as it's interpreted each time you hit play.

Also a mix created from various transcripts is represented in the same format as a transcript, sections can hold effects as well as references to media and associated transcript snippets. Actually a source Hypertranscript can be thought of a mix containing one source, the complete transcript for that source and no effects.

Alternative Representations

People often suggest alternative representations from JSON, Microformats and even WebVTT. My gut feeling is that they won't be as flexible or readable, but what does my gut know? I'm really happy to discuss ways of improving our approach and alternative approaches to representing word time-aligned transcripts.

If you have any comments on this please reply to the Google Group topic I've started on the subject. Oh and if you want to be amongst the first to know when you can start using Hyperaudio in anger (we're nearly there!) drop your email address here and for general updates follow @hyperaud_io.

Mark Boas