I recently updated this website, and I then decided to include a wordcloud on
my start page. This wordcloud represents the titles and abstracts of all papers
I’ve saved in my Mendeley library. Here I will now describe how I made this.
First of all, you need to localize the Mendeley SQLite database. I use a Mac,
so I can find it in
$HOME/Library/Application\ Support/Mendeley\ Desktop/<email>@www.mendeley.com.sqlite.
Just replace <email> with the email address you use for Mendeley. If you can’t
find the database, take a look at this Mendeley support page.
In R, we’ll then extract the relevant columns from the database.
Now we have a data frame with two columns containing the title and abstract for
every document in the library. We could create a wordcloud directly from this
data, but it would probably not look very good since it would contain variants
of the same words, i.e. gene and genes or analyzed and analysis. What we want
to do is to find the common denominator among these words, and this
process is called stemming. I will do
this in R using the tm (text mining) package.
tm has a data structure called Corpus, but in this simple context, I find
it easier to just work with character vectors where each element is a word.
To make sure that all words look the same, we transform them to lower case
and remove punctuation and numbers, and finally stem them.
If we look at these words now, many of them does not make sense. For example,
the words “promote”, “promoter”, “promotion” and “promoted” would be all stemmed
to the form “promot”, which is not a word (as far as I know). To avoid this
gibberish, we can use the stemCompletion function. What it does is that it
looks through the stemmed words and transform them to the most common original
variant of the stemmed word. This is the default behavior, and you can look at
the documentation for stemCompletion for other options.
Now we have a character vector of words that should make sense. If we would want
to create the wordcloud directly in R, it is possible. However, then you would
want to remove common words such as “the”, “of” and “and”. This can be done with
the removeWords function together with stopwords that contains a number of
stop words for a number of languages. The wordcloud package
can then be used to create the actual wordcloud.
In my case I didn’t create the wordcloud in R, but used the online service
Wordle. Simply save the words to a text file and just
paste it into their service. You can then choose among different color schemes
and layouts to produce a wordcloud that you like.
I spend the majority of my days connected to remote servers via SSH. It’s a
pleasant way of working in many aspects, but since I arrived in the US I’ve
experienced my fair share of bad networks. At some points I’ve been thrown out
from the network once a minute. In most cases, SSH is able to maintain the
connection, but sometimes I have to reconnect. Unfortunately, SSH does not detect
this right away. It just stands there, waiting for the connection to come back
and doesn’t respond to signals while doing it.
Instead of closing the terminal window to speed up the restart of the connection,
I have now found out that it is possible to close the connection using SSH
escape sequences. In this case press Enter and then ~. to terminate the session.
With ~? you can see other available escape sequences.
So, I’m doing it again. Creating a blog that I probably won’t write anything
in, that is. Anyway, I find this a bit more interesting than my previous
attempts. This site was built using Jekyll, a static site generator.
When I first heard about it I realized this was something I had been trying to
achieve for a while; something where I could easily have a common header and
footer for a website and just alter the main content. This is of course possible
using e.g. PHP, but every time I’ve started such a project, I’ve ended up
creating all kinds of “nice to have”-features that just ended up messing up the
whole thing. I’ve also tried using Wordpress, but it’s just too bulky for my
needs.
And yes, it’s quite fascinating that I haven’t found out about this until now.
I’m preparing a manuscript for PLOS ONE and saw this in the figure guidelines:
Figure text must be in Arial font, between 8 and 12 point.
Easy, I thought. Just a matter of specifying a font family in the device I print to.
Think again.
Apparently, R does not support using different font definitions. Of course I’m
not the first person to encounter this problem. In an excellent post by
Gavin Simpson he explains how to come around this, and even to embed fonts
in PDFs printed by R. In short, take a look at the extrafont package.
It enables you to use fonts on your system in your R figures.
To get this to work properly, I had to specify one single directory to look
when importing the fonts in since I apparently had multiple copies of Arial
on my system (Mac OSX 10.9). This can be done easily by using
font_import(path=c('/Library/Fonts'), recursive=FALSE) to ensure that only
one copy is used.
Oh, that's right, I have a blog. Had almost forgotten about it.
I just faced the problem of setting multiple authors in a LaTeX document together with their affiliations. On top of that, one of the authors had multiple affiliations. Google returned this excellent answer on TeX StackExchange. In short; use the authblk package:
\thanks is useful to add footnotes to the authors, in this case their email addresses.
Today I faced a task where I had to parse huge XML-files. And when I say huge,
I mean 6-14 GB. My weapon of choice is Python, since I’m comfortable with
it. However, I had never parsed XML with it before. Because of the size of the
files, it was unfeasible to load the entire file into memory, and for me that
was not necessary either.
After Googling for a while, I found that many people recommend the
ElementTree module and it’s C-equivalent cElementTree. The function
iterparse proved to be a real life saver. By iterating through the
element tree and deleting elements as you go, you will only consume small
amounts of memory. The following snippet is more or less taken from
the documentation.
I don’t like the way I had to specify the namespaces, but I guess there’s a
better way of doing it. When running this on a 6.4 GB XML file (161 million
rows, 81 million elements), the code above did not consume more than 15 MB of
memory. I don’t remember how long it took, but it was reasonably fast.
I've never played Minecraft (if you don't count my futile attempts at playing Minicraft on my Windows Phone), but now it seems a (free) version of it is coming to the Raspberry Pi. Not only can you play it like the ordinary Minecraft, you can also program it in real time! Check out the video below.
So, if I haven't gotten started with Minecraft yet, this is a golden opportunity for some procrastination! It's hopefully going to be released by the end of this year, and I can barely wait!
My Raspberry Pi has been collecting dust for far too long now. I've had some projects in mind, but I haven't found any motivation for realising them yet. Now however, the Raspberry Pi camera is on its way! Hopefully it will be available sometime in the first half of next year, and I want one. For a while I've been thinking that I want to try to implement some machine learning methods for e.g. facial recognition, just for fun. This seems like a golden opportunity! It shoots 1080p at 30 fps according to DesignSpark.
Since I have a webcam in my computer, that should work as well, but I'm having trouble installing the Python bindings for OpenCV. It feels like I've tried every configuration possible, but I still can't get it to work. If I get some time on my hands I will primarily try to fix that, but the Raspberry Pi camera is a tempting option. Even if I get it to work on my Mac, I will probably buy it anyway.
As of now, I’ve had Internet in my apartment for a few weeks, and as soon as I
got it I apparently stopped posting here. Constructive, yeah I know.
Anyhow, today I had my start seminar for my PhD where I introduced my project
to the rest of the group plus some other people. It went better than I
expected, and I’m really excited to get started for real. However, there’s a
lot of administrative stuff that has to be solved before I can really dig in.
I really like it at the University. The predecessor to the University of today
was founded in 1859, and in 2005 it got University status and was renamed to
the Norwegian University of Life Sciences. Some of the buildings are quite old,
but in a charming way. With narrow corridors and strange floor plans, it was a
mess finding your way around in the beginning. Just next to my office there’s
actually a small dairy. Sometimes they get cream left over, and it’s just to
go there and fill a bottle. Pretty nice!
I still don't have an Internet connection in my apartment, but now it should be on it's way!
Meanwhile, I found that this years Ig Nobel Prize in acoustics was awarded to Kazutaka Kurihara and Koji Tsukada for creating the SpeechJammer. It's a device that disturbs people's speech by simply playing back what the person is saying, but with a tiny delay. I've experienced this myself working at the technical support of a major swedish ISP. Sometimes I could hear my own voice in the headset, with a delay, and it was tremendously difficult trying to talk to the customer without stuttering.
If I only thought of this application back then (2007), I could've been awarded an Ig Nobel Prize instead. Ah, well...