Data Viz 101 with Jer Thorp and Wes Grubbs

Much of this session was actually writing code to make a demo program, but I captured the part that was more chatty.  The picture above is as far as I got in my Processing demo… after that I got lost and started blogging.  Jer has the presentation available if anyone wants it.

Jer – I often get questions – what data set should I use?  I say – make it meaningful to you, and don’t overlook the possibility of making your own data.  Your email with subject lines and time stamps for example, it’s a huge source of information that’s personally relevant to you.

The data we’ll use today is clean, but very often data is not clean. The reality is you’ll have to clean it. It’s a combination of manual labor and using some new tools which have just been developed which help us normalize large data sets.

Hack  the data into understandable objects.  They have a color, a weight. How do you break it up into little pieces, so how do you parse that?  The data that usually comes to you involves objects, so let those define themselves.  This isn’t that tricky as a concept, but it’s useful.

Jer – I’m an amateur cook.  I like the concept of Mise en place.  You have the stuff all ready to go in individual ramekins, then you toss it together.  The more time you can spend getting this ready to go, the faster and better it will go when you’re cooking.  So our first job is to prepare the data into these containers.

Wes — Then once you can start rendering it onscreen, you can start to see the weight of things on the screen.  You can see things quickly (see Wes’ company, Pitch Interactive).

Jer — Visualization isn’t just the end product, it’s good process as well.  You have to look at what the data look like before you decide what it’s going to be in the end.

So the steps are:

  • 1: Get the data
  • 2: Parse the data into useful objects
  • 3: Render the objects on screen

Jer – I was at Strata, O’Reilly’s big data conference, and was given a paper with objects that had been observed by the Kepler telescope, of what are supposed to be planets orbiting distant stars.  So I began by putting these into a comma-separated file, because they were in a PDF (which Processing can’t read).  The idea of useful objects is very simple here: planets.  I then developed a list of things I wanted to visualize.  Their size, the distance from their star, their rotation speed, their temperature.  So I ended up with a list of 1236 planets which I could then go and ask to do things. And I just matched the translation, which was easy – temperature was mapped to color, but the other aspects mapped directly. (Jer shows the Kepler Exoplanet Candidates project).

Wes – We’re comparing the bible and the quran, to see how the same words or concepts are used in each (he shows the project).

Jer – I want to talk a bit about the different formats data might come in.  If you see data in one of the other two formats, you’ll know what you have to do to deal with that type.  You want to build your system so the type of data shouldn’t matter.  You may see it in JSON, CSV, XML.  And other ways which we won’t discuss now.  This is Processing-centric.  XML is structured and easy, though not flexible.  JSON is flexible, but not structured or easy.  And CSV is easy, but not structured, and totally inflexible.

Wes – the way I try to explain it to new clients, is… just give it to me in CSV, you can export it from Excel in that format.  Processing doesn’t have built-in support for CSV so we use JAVA libraries like opencsv.

Jer – I feel like XML had a revolutionary effect on the world.  All data that’s stored in XML is so shareable.

Wes — Form field names are included and repeated for each file, so this can get bulky and take a lot of development time.

Jer – So you can either optimize it or put it into a leaner form.

Wes – JSON, JavaScript Object Notation, stores data as JavaScript objects, the lack of named structures can make it difficult to understand.  It’s like a condensed version of XML.

Jer – But the border between JSON and JAVA is not easy to deal with, so it’s not always ideal to put it into Processing.

They show Jonathan Harris’ and Sep Kamvar’s We Feel Fine, a processing application on the web from the early 2000’s.  It tracks when bloggers use the word “feel” in their posts and follows them, adds the weather, sorts by sex… the feeling is the object here.  Search for males that feel accepted.  What city was it in, what was the weather in that city?  This project has an API.  So we can go and query that database’s 7 years of feelings that are being scraped from LiveJournal and BlogSpot.

Let’s open Processing.  In 50-20 minutes we’d like to get our first visualization of this data.  Drag the data file from your finder window into the sketch.

(We add some code).

You’re building an actual running computer program … not a file that gets played by some other program.  It’s actually running on your machine.  Processing takes things that are hard to do in Java and makes them easy.  That hard work is done in the background by Ben and Casey.

It’s very common to be dealing with a bunch of data points that aren’t uniform.  Not all of these data points will have data for gender.  You used to have to build a font object to render text, now it puts in a default font. Default fill in Processing is white.

We make a little program showing some random feelings and placing them around the screen.

If data is constantly changing, you can get it from the API of the source… in this case, the API for We Feel Fine.

Wes – it’s a good idea to have as many things defined up top as you can, then just refer back to them.  Remember that 0 is the very first element in an array.  1 is the second value.  This always trips people up.  So if I want to see which city has happier bloggers, I can create sets of positive and negative city feelings.  It’s a simple technique.  Run City 1, run City 2, then both of those cities are done.

A float is a number with decimals, while an integer has no decimals.

Jer – For color, I just find the minimum temperature of all my planets, the maximum temperature of all my planets, and then I just fill in the hues between.  This data set is really useful. It’s been running for 7 years.  you can see trends in cities by gender and age.  We only touched the tip of the iceberg.  Next time you see something in a newspaper saying “Psychologists prove people are happiest just after Christmas” or some such thing, look at the source of the dataset, and chances are, it’s from the We Feel Fine dataset.  Because it’s so big, it tends to normalize itself a little bit, even if it is just about bloggers.

Jer: There’s a great Processing book and learning programming book, by Daniel Schiffman, he’s a great teacher.  It’s called Learning Processing.

Wes: There are a ton of JavaScript libraries. Cinder is great, if you’re interested in C.

Jer: Processing was never meant to be a DataViz language, but among the other things it’s become, we use it for that now, so we’ve added (in 2.0) a lot more ways to incorporate data.  Processing will be the place for novel data visualization going forward. publishing to JavaScript is instant.  It’s amazing.