Rick's Tech Talk

All Tech Talk, Most of the Time

Hiding HTML Using Data URIs and base64

I've been spending the past few evenings noodling with the HTML5 <canvas> element. I often bring up the HTML Canvas 2D Context Working Draft specification. The spec shows what functions are available to the canvas' context, and the examples are instructive. However, the spec also introduced me to a method for "hiding" HTML text directly inside another HTML file, using the "data URI" technique.

In the spec, one of the examples shows a very nice graphics hack to display pretty glowing lines.


The spec does show the example "code" that renders this display. However, if you were to look at the HTML of the spec page (or of this page, for example), you'll see a block of code that looks like this:

<a href="data:text/html;charset=utf-8;base64,PCFET0NUWVBFIEhUTUw%2BDQo8aHRtbCBs

"Whoa," I thought. "Here's a way to mask HTML!"

Intrigued, I learned that a "data URI" (NOTE: the URI is the string that the "href" attribute is set to) can express any "resource" MimeType. In this case, the MimeType is "text/html;charset=utf8", which means that the data encoded is an HTML page (using the UTF-8 character set). This data URI technique can also be used for embedding small images such as logos. Specifying "base64" says that the resource is encoded in Base64.

Because I knew that the base64 data in the above link was just HTML, I spent a little time trying to find out how to translate it back into HTML. I stumbled on a page that showed me how to do the translation of base64 back to English on the UNIX command line. The page offered some one liners in Perl, but the comments yielded the technique I ended up using: the base64 utility command.

Since I have Cygwin, I already have this base64 utility. I saved the base64 data into a text file (keeping it all on one line), and ran it as follows:

base64 -d < base64stuff  > base64stuff.txt

The "-d" switch says to decode the data. I did get hung up with some partially decoded files. After looking at the output of a good encoding, I saw that the base64 data in the "pretty glowing lines" code contained plus signs (+). These were hard to see as they were entity encoded (%2B). However, the broken output suggested a break of some sort. I ended up replacing every "%2B" in the file with the + sign.

I did think that this base64 data URI would be a great way to "hide" the HTML (at least from non-technical people). However, while writing this post, I saw that the resulting HTML page from the base64 data is already translated! Just view the source of the "pretty glowing lines" demo page, and what you see is regular human-readable HTML. So much for trying to be secretive with data URIs!

It does lead me to ask why they (the World Wide Web Consortium) would even use this technique to begin with. The only thing I could come up with was that they wanted to keep the example with the specification. That's a good reason: anytime they send around the HTML of the spec, they send along the examples as well. Convenient!



Very interesting stuff! I saw this for the first time in a phishing e-mail I received last week (one of the most convincing I've seen). It had the entire source of a large banking account page embedded in a link, so there was no end point to be taken down (except where the form would post) and the address in the browser looked fairly innocent.