Images in text files

The latest step in Project AntiShaun is to get it to work with images. There are plenty of ways to do this. Unfortunately, most of them don’t work.

The way images are represented in the OpenDoc format is complicated. You have to change like eight things just because there’s a picture in there, and some of those are counts of how many pictures. I’m hoping to figure out how many of these things I have to do myself and how many I can depend on LibreOffice to figure out on its own. However, simply unzipping and rezipping just doesn’t work, because mimetype file. (Silly OpenDocument spec. In fairness, these things were not meant to be unzipped and messed with.)

I’ll get it working though. I’m making progress, just slowly.

Code coverage

Unit tests are important for having a good application. They tell you that your code works. I’m trying to achieve 100% code coverage, which means that my tests ensure that all my code works. I’m using a tool called DotCover. It’s surprisingly intelligent, and knows what parts of my code are executed and by which tests.

 

 

I was testing one class at a time, but some classes’ methods access other classes. I was initially surprised when testing a single class ended up covering about 25% of my code. I also kind of wanted to track that coverage separately, so that I know which classes I meant to test and which ones I didn’t. However, that means I’m probably not Mocking correctly, or have my code too tightly coupled.

 

I’ll deal with that later. For now, goodnight.

Time travel

If your idea of a dream career is to travel back in time to engage in a battle of wits with a younger version of yourself, software development is perfect for you. When every line of code is your best, every refactor is an attempt to top the best you could have done, even as recently as a week ago. When the codebase is small, and your improvements in knowledge since the last time you worked on the code are minimal, refactoring is at its hardest. It’s like fighting yourself from a week ago. This is where the rapid learning part of being a developer comes into play.

One of the most important ablities to a developer is being able to look at code and say “I bet there’s a better way to do this.” This is second in importance only to the ability to actually find and implement said better way. Refactoring my own recently-written code is pushing my capacity for both of the above to the limit. It’s hard, but it’s very good practice.

Programming with ADD

I recently ran across a comment on Reddit that summarizes pretty well what it’s like having ADD. Allow me to supplement that mostly accurate explanation: ADD sucks but it’s awesome.

It sucks because my subconscious mind, rather than my conscious mind, sets the priorities for my attention.  In class, it takes constant effort and discipline just to not start inspecting the instructor’s desktop setup or thinking about how outdated the university’s website is. (Some pages’ HTML is signed by Microsoft Frontpage! )  ADD makes it a constant struggle to get to the point where your problems are the same as the ones everyone else has, simply because your brain has its own priorities which usually have to be forcibly reordered to get anything done.

It’s awesome because on those occasions when your brain’s priorities line up with your own, you can have amazing focus.

Right now, I’m listening to a song on repeat. It’s a fast-paced techno song, one that the part of my brain that usually bothers me can occupy itself keeping up with. In this way, I distract myself from distracting myself long enough to actually get some work done. Even in real life I’m implementing some of my own bugfixes. I’m still learning how to harness my own brain and make it useful, but on the way I’m having fun and getting things done.

 

Diving in

My project is undergoing a radical restructuring. By which I mean I’m taking all of the code and moving it around or rewriting it. Basically starting from the ground up except I have skills and resources (the existing code) available that I didn’t before.

I pretty much just jumped in, started defining new classes and obsoleting old ones, broke a bunch of stuff, and started putting it back together better. I’m almost done with the first stage. It feels good knowing that although this task seemed daunting at first, just by going at it for a while I’ve plowed through most of it.

 

Also, some of my blog posts will become shorter and more frequent, I think. Maybe 2 short ones and a long one each week.

A Good Place to Start

It works!

My project is, at a basic level, functioning.

For now it’s called the Data Injector, but I’m trying to come up with a better name. Maybe by the time you’re done reading, you’ll have thought of one. If you do, let me know, I might very well use it!

It starts with an .odt file. This is a fairly unassuming open document format, and one that both Microsoft Office and the open-source alternative LibreOffice can open, edit, and save to. If I say something about this document that doesn’t make sense, or if you’re just curious, I recommend my article on the subject, since it was written specifically in preparation for the one you’re reading now.

However, this is not just any .odt file. it has….input fields!  Which are actually fairly boring. These don’t do anything interesting really, except subtly let you know they’re there, usually by highlighting themselves gray. Also they let you name them. They’re quite easy to put in. The really interesting part happens with what you put inside them, and what it lets you do to it.
Remember when we took one of these .odt documents apart and put it back together? (Yeah, sorry I haven’t played with a .docx file yet. I’ve been busy. Hopefully soon you’ll see with what.) Well, there was that content.xml file. These input fields insert special tags that are named whatever you chose to name the input field. In our case, we named them “Template” so as to make it obvious what they’re for.

But that’s not all. It also has….sections and script tags! Which, like input fields,  mark certain things in the xml. Sections divide the page into, well, sections; they’re kind of like paragraphs but with more visibly defined upper and lower edges. Script tags are like input fields except they actually don’t occupy any space in the document, they just show up as little gray rectangles that can go anywhere without moving anything else. They’re intended for containing computer code to be read and executed, and we’ll actually be using it for just that.  As I’ll explain later, though, in our case they go inside the sections.

Microsoft developed a tool called Razor that can automatically detect and execute code formatted in a specific way in a document. The word “document” in the computer world generally refers to any file that is intended to be read rather than executed. We’re working with XML, which is a language used to describe things, and is formatted nicely in a way that is easy for computers to read. Thus, any XML file is also an XML document.

Razor is usually used to dynamically (read: whenever you need to, as opposed to once and then never changes) generate content in web pages. For this reason, it works very nicely with  markup languages like HTML and XML, which are designed to contain and present that content. Since it works so well with XML, it’s not terribly difficult to  use it to insert things into a normal XML document (like content.xml from the .odt file) instead of a web page.

There was already a library that took strings as templates and got its data by giving it a POCO. (Plain Old C# Object) Using the Razor parsing engine, it finds any Razor statements in the template and brings in the data from the model. (String is a term that means any information represented as text rather than  in a way designed for the computer to understand it. It comes from stringing a bunch of characters together.)

I had a library, and I had a template. Models I could make on my own, at least for testing purposes. Most of my problems came from either not being familiar with Razor (I’m getting there, little by little) or from getting the components or services to work together. Since Razor relies heavily on @ signs, and we needed to be able to have them in the document without Razor trying to parse them. This is where things like input fields, script tags, and sections come in handy.

Loading up the XML file and being able to change it was fairly easy. Correctly transforming it into what we needed was harder, but doable. Control-flow statements (where sometimes we want to do the same thing multiple times with different information, or decide whether to display one piece of data based on the value of a different one) were more… interesting.

Regular insertion of information just uses input fields, which are easy to find and process. These control flow statements are in script tags that are actually inside the section tags they control. This presents an interesting problem: How to execute statements that affect a block of XML that actually contains those statements? The answer was actually pretty simple but required some knowledge I didn’t have at the time about how Razor worked. The solution was to modify the xml block that contained not only the script but the entire section, so that it was all inside the Razor statement (which was removed from inside the script tag and placed so that it surrounded the section).

For anyone I lost somewhere in there, it looked like this:

<main document>
<section>
<script>
Razor expression ( )
<end script>
<end section>
<rest of document>

 

And afterward it looked like this:

<main document>
Razor expression (
<section>
<end section>
) [This marks the end of the Razor expression]
<rest of document>

which is pretty neat.

So after writing tests to ensure it was working as intended, version 1.0 was done. It still needs internal revision and better testing, both of which I’m working on. It also has a large amount of expansion coming, and will eventually hopefully be a lot more interesting than what it does now. But it already does a small part of the things it is supposed to eventually do, and it does so decently well.

 

Speaking of which, I’ve explained to you how this works, but not directly shown you what it does. How about a little show for our show and tell?

(Disclaimer as before: These images probably look really fuzzy embedded in my post. Click on them with your middle mouse button/mouse wheel and they will open in a new tab, with much better quality. You won’t lose your spot on the post.)

 

2014-06-23_10-25-31

This is what my template document looks like before running through my Data Injector. Note the gray parts; those are input fields, and will be detected, interpreted, and filled in. The second line contains things that might normally be executed as Razor statements, but because of computer magic the library knows only to interpret things inside the input fields as Razor.

After running the template through the program, with the appropriate model, it looks like this:

23th_of_6

(First rule of programming: the computer will do what you tell it to, regardless of whether that was what you actually wanted it to do)

While the template and model here could use a bit of polishing, the program itself worked correctly. There are plenty of applications for this sort of thing, including quite a few that I never would have considered. How many can you think of, and what do you think is a better name for it than the rather bland Data Injector?  Let me know in the comments. Comments are good.

Hope you enjoyed this, and I’ll see (read?) you next time. (Which will hopefully be closer to the difference between my first two posts rather than between this one and the previous one)

Entering the Library

I started my internship/apprenticeship program this week. Part of the process is keeping a blog about my progress. Since blogging is something I’ve occasionally wished I did more often, this is a pleasant opportunity . I’m sure as I start working these posts will become quite technical, but for the moment I can wax poetic.

As I write this I am embarking on a journey into a source of knowledge, a library from which I can draw valuable information. Some of it is timeless and its wisdom will shape my actions from here forward. Other is more temporary, growing obsolete as technology advances. This is not exactly a bad thing, as it enforces a constant state of learning.

Programming is fascinating in the rate at which the available knowledge expands. To be a great programmer one must stay on top of the curve. I have merely begun to try to swim upstream in the cascade of information available, but it is not an altogether unpleasant experience.

I began reading a book as part of my apprenticeship, The Pragmatic Programmer by Andrew Hunt and David Thomas. It reads less like a cohesive book than an encyclopedia, with many independent sections connected together in a web of related topics rather than linearly. As I read I could feel my brain absorbing and processing information, a feeling I unfortunately had not experienced in a while. I had nearly forgotten what learning, the act of seeking out information that is truly interesting and useful and adding that information to the store of tools available, felt like. I would prefer not to forget again. I enjoy learning, and programming offers not only an opportunity but a necessity to learn and continue learning for the entirety of my career.

However, the title of my post has multiple meanings. You see, a library is a source of information. As any programmer knows, anything that takes place in the digital world is simply a representation of information of some kind. Hence it is no accident that any new tool added to a programmer’s arsenal is often called a library. You see, as an apprentice I also will eventually have a project of my own, an open-source tool I will develop that will then be available for all to use. I am still building knowledge in preparation for working on this project.

 

 

There is a certain beauty to starting a new project file, akin to setting foot in an open field. Perhaps the project comes with certain infrastructure pieces according to your taste, the field set already with an easily modifiable frame and some raw materials. Either way, you’re setting foot into the beginning of something. I’m looking forward to standing in that field and watching as I build up my own library from the ground up. In both senses of the word.

What’s in a Word?

The answer is a surprising amount. Actually, I’m not working with Microsoft Word but with the free alternative LibreOffice, which uses the OpenDocument .odt format, but the principle is the same. (After the OpenDocument XML-based standard caught on, Microsoft actually reworked their own file format to use XML. XML stands for eXtensible Markup Language, and is used for storing data in a format that is easy for computers to understand. You can perform this unzip trick with .docx and other .***x MSOffice files, but I believe it doesn’t work with, for instance, .doc files. [After all, the X in that extension stands for XML{Which is kind of amusing when you consider that they’re nesting abbreviations at that point<Kind of like these nested peripheral comments?|Yes, kind of like these nested peripheral comments. Oh great, here come the silly-looking closures. Maybe it’ll look like a guy with a hat.|>}])

Yup, definitely a guy with a hat.

(Incidentally, apparently on at least one occasion a criminal has been caught by using the information we are about to access. You’ll see what kind of information I mean in a minute, but for now, enough parenthetical commentary, let’s dig into this file!)

These word processors use a technique for storing their information that may seem a bit odd at first, but is completely reasonable after some thought. Every word document you have ever written is actually several XML files, all zipped together. If you’re curious in seeing the innards of a Word file for yourself, unzip it using your tool of choice (I used 7-zip) and take a look inside. For this example I made a brand new document in LibreOffice, and typed “This is some text content.” into it. Then I saved and closed the file, and unzipped it.

(PSA: WordPress appears to decrease the quality of images posted here, probably so you can all load the page quickly. As we come to each image, just control+click or middle mouse click on it to open it in a new tab. This will show you the image at full resolution, and you’ll keep your place on this page.)

Content

There’s quite a lot here for a single, simple line of text in a document. Let’s start with content.xml, we should be able to find our actual content there pretty easily right?WellMaybeNot

I don’t see it anywhere….wait…..That’s a single, really long line of XML. Really, really long. With no linebreaks. Ew. After a quick find+replace,(I replaced the < character with itself plus a line break before it [representing the line break with \n because of computer magic]) it looks a lot better:

ContentXml

 

Ah, there it is, down near the bottom. But what’s the rest of this “Content” stuff?
It’s configuration data, information that tells the word processor how to format what you type into it. Things like fonts, styling, and the format of the document itself. (Yes, that top line is still absurdly long. I think it contains a bunch of namespace information and references to specifications, which while interesting are outside the scope of this post.)

But now you must be thinking, “There’s a fair amount of information here, but surely not enough to account for every single possibility. If my font and style are here, where are all the other settings?” Well you’d be right, and would probably look no further than….Settings

Let’s crack that one open. Whoops, another hugely long line of text. The same search+replace turns the file much more palatable but still quite large. (for human eyes anyway) This one has much more in it than the last one, (almost 200 lines after the reformatting compared to 31) but it’s actually fairly readable. I invite you to explore it a bit. (If you’re feeling daring, modify some of the variables, rezip the files, and see how your original document changed! Just be sure to back it up first if you don’t want to lose the file entirely, you might corrupt it. )

Hmm, so there’s the content and the settings, what’s the rest of all this then? Well the manifest just lists the important files, so that’s not super interesting. The mimetype file is actually fascinating. It defines the format of the document, meaning the file extension is only there for the benefit of the user. But it’s not very large and that’s all it does, so let’s move on. How about meta.xml? That sounds fun. Let’s take a look.

2014-05-28_16-44-13

 

Well this doesn’t have anything at all to do with the content of the document, does it? Well, yes and no. This is the metadata, information about the document itself, such as its date of creation; the version of LibreOffice used to create it; the quantities of characters, words, tables etc. in it; and so on. This is where, in the metadata of a deleted file on a flash drive, police found the name of the perpetrator of a collar bomb hoax.

But we’re still not done. The Styles file contains a list of possible styles (in xml format), basically defining all the possibilities for styling etc. open to you. (For more savvy people, editing this file is one way to add or modify the styles build into your word processor. It may not be the best way, but it is a way.) Thumbnails contains just one thumbnail, the one used for all your documents. Meta-INF just contains another manifest file, listing a different set of files. Configurations2, though, contains a bunch of subfolders.Configs2

Accelerator contains a file called current.xml…..which is empty.

Images contains a subfolder called bitmaps….which is empty.

All the others are empty too.

 

This is because 

LibreOffice, like its predecessor OpenOffice, is very extensible and modifiable. I am currently using a completely unmodified version, but I understand some of the extensions are very useful, and I’m open to recommendations. Yet another reason to prefer open software over the proprietary stuff.

I’m already having quite a bit of fun with this blog, and I hope it will prove useful and/or entertaining to people both technical and not. While I do have my own priorities, this being a work-related blog, I have no qualms about updating it on my own time. Feel free to suggest topics for future posts, especially ones you don’t understand, and I will endeavor to understand and explain them to the best of my ability. Next time I think I’ll explore an actual .docx file, and look at the differences between it and the .odt format. I hope you’ve enjoyed Entering the Library with me.

The observations of an explorer and student in the world of computer science.