Mendeley wordcloud

I recently updated this website, and I then decided to include a wordcloud on my start page. This wordcloud represents the titles and abstracts of all papers I’ve saved in my Mendeley library. Here I will now describe how I made this.

First of all, you need to localize the Mendeley SQLite database. I use a Mac, so I can find it in $HOME/Library/Application\ Support/Mendeley\ Desktop/<email>@www.mendeley.com.sqlite. Just replace <email> with the email address you use for Mendeley. If you can’t find the database, take a look at this Mendeley support page. In R, we’ll then extract the relevant columns from the database.

library(RSQLite)

con <- dbConnect(RSQLite::SQLite(), "~/Library/Application Support/Mendeley Desktop/<email>@www.mendeley.com.sqlite")
res <- dbSendQuery(con, "SELECT title, abstract FROM Documents")
dbdata <- dbFetch(res)

Now we have a data frame with two columns containing the title and abstract for every document in the library. We could create a wordcloud directly from this data, but it would probably not look very good since it would contain variants of the same words, i.e. gene and genes or analyzed and analysis. What we want to do is to find the common denominator among these words, and this process is called stemming. I will do this in R using the tm (text mining) package. tm has a data structure called Corpus, but in this simple context, I find it easier to just work with character vectors where each element is a word.

words <- unlist(strsplit(as.character(dbdata), " "))

To make sure that all words look the same, we transform them to lower case and remove punctuation and numbers, and finally stem them.

library(tm)
words <- tolower(words)
words <- removePunctuation(words)
words <- removeNumbers(words)
words.stem <- stemDocument(words)

If we look at these words now, many of them does not make sense. For example, the words “promote”, “promoter”, “promotion” and “promoted” would be all stemmed to the form “promot”, which is not a word (as far as I know). To avoid this gibberish, we can use the stemCompletion function. What it does is that it looks through the stemmed words and transform them to the most common original variant of the stemmed word. This is the default behavior, and you can look at the documentation for stemCompletion for other options.

words.stem <- stemCompletion(words.stem, dictionary = words)

Now we have a character vector of words that should make sense. If we would want to create the wordcloud directly in R, it is possible. However, then you would want to remove common words such as “the”, “of” and “and”. This can be done with the removeWords function together with stopwords that contains a number of stop words for a number of languages. The wordcloud package can then be used to create the actual wordcloud.

In my case I didn’t create the wordcloud in R, but used the online service Wordle. Simply save the words to a text file and just paste it into their service. You can then choose among different color schemes and layouts to produce a wordcloud that you like.

Mendeley wordcloud

R Visualization

Comment


Interrupting an SSH session

I spend the majority of my days connected to remote servers via SSH. It’s a pleasant way of working in many aspects, but since I arrived in the US I’ve experienced my fair share of bad networks. At some points I’ve been thrown out from the network once a minute. In most cases, SSH is able to maintain the connection, but sometimes I have to reconnect. Unfortunately, SSH does not detect this right away. It just stands there, waiting for the connection to come back and doesn’t respond to signals while doing it.

Instead of closing the terminal window to speed up the restart of the connection, I have now found out that it is possible to close the connection using SSH escape sequences. In this case press Enter and then ~. to terminate the session. With ~? you can see other available escape sequences.

bash

Comment


(Last?) First Post

So, I’m doing it again. Creating a blog that I probably won’t write anything in, that is. Anyway, I find this a bit more interesting than my previous attempts. This site was built using Jekyll, a static site generator. When I first heard about it I realized this was something I had been trying to achieve for a while; something where I could easily have a common header and footer for a website and just alter the main content. This is of course possible using e.g. PHP, but every time I’ve started such a project, I’ve ended up creating all kinds of “nice to have”-features that just ended up messing up the whole thing. I’ve also tried using Wordpress, but it’s just too bulky for my needs.

And yes, it’s quite fascinating that I haven’t found out about this until now.

I even managed to migrate my old Wordpress posts into Jekyll very smoothly using the instructions from Hadi Harari.

Hopefully this will inspire me to write about my (not so) adventurous life as a PhD student in bioinformatics.

Comment


PLOS figures in R

I’m preparing a manuscript for PLOS ONE and saw this in the figure guidelines:

Figure text must be in Arial font, between 8 and 12 point.

Easy, I thought. Just a matter of specifying a font family in the device I print to.

Think again.

Apparently, R does not support using different font definitions. Of course I’m not the first person to encounter this problem. In an excellent post by Gavin Simpson he explains how to come around this, and even to embed fonts in PDFs printed by R. In short, take a look at the extrafont package. It enables you to use fonts on your system in your R figures.

To get this to work properly, I had to specify one single directory to look when importing the fonts in since I apparently had multiple copies of Arial on my system (Mac OSX 10.9). This can be done easily by using font_import(path=c('/Library/Fonts'), recursive=FALSE) to ensure that only one copy is used.

Comment


Authors and affiliations in LaTeX

Oh, that's right, I have a blog. Had almost forgotten about it.

I just faced the problem of setting multiple authors in a LaTeX document together with their affiliations. On top of that, one of the authors had multiple affiliations. Google returned this excellent answer on TeX StackExchange. In short; use the authblk package:

\documentclass[a4paper,11pt]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{authblk}

\title{More than one Author with different Affiliations}
\author[1]{Author A\thanks{A.A@university.edu}}
\author[2]{Author D\thanks{D.D@university.edu}}
\author[1,2]{Author E\thanks{E.E@university.edu}}
\affil[1]{Department of Computer Science, \LaTeX\ University}
\affil[2]{Department of Mechanical Engineering, \LaTeX\ University}

\begin{document}
  \maketitle
\end{document}

\thanks is useful to add footnotes to the authors, in this case their email addresses.

Comment


Python and XML

Today I faced a task where I had to parse huge XML-files. And when I say huge, I mean 6-14 GB. My weapon of choice is Python, since I’m comfortable with it. However, I had never parsed XML with it before. Because of the size of the files, it was unfeasible to load the entire file into memory, and for me that was not necessary either.

After Googling for a while, I found that many people recommend the ElementTree module and it’s C-equivalent cElementTree. The function iterparse proved to be a real life saver. By iterating through the element tree and deleting elements as you go, you will only consume small amounts of memory. The following snippet is more or less taken from the documentation.

import xml.etree.cElementTree as ET

# Get an iterable tree
context = ET.iterparse('huge_xml_file.xml', events=('start', 'end'))
# Get an iterator
context = iter(etree)
# Get the root element to be able to clear it
event, root = context.next()

# Iterate through the tree and to what you have to do
for event, elem in context:
    if event == 'start' and elem.tag == '{some_namespace}some_tag_name':
        # The element was opened here
    if event == 'end' and elem.tag == '{some_namespace}some_tag_name':
        # Now the whole element is available! Process it here.
        # When done with it, call
        root.clear() # to free up some memory

I don’t like the way I had to specify the namespaces, but I guess there’s a better way of doing it. When running this on a 6.4 GB XML file (161 million rows, 81 million elements), the code above did not consume more than 15 MB of memory. I don’t remember how long it took, but it was reasonably fast.

File formats Python

Comment


Minecraft: Pi Edition

I've never played Minecraft (if you don't count my futile attempts at playing Minicraft on my Windows Phone), but now it seems a (free) version of it is coming to the Raspberry Pi. Not only can you play it like the ordinary Minecraft, you can also program it in real time! Check out the video below.

So, if I haven't gotten started with Minecraft yet, this is a golden opportunity for some procrastination! It's hopefully going to be released by the end of this year, and I can barely wait!

Minecraft Procrastination Python Raspberry Pi

Comment


Raspberry Pi Camera

My Raspberry Pi has been collecting dust for far too long now. I've had some projects in mind, but I haven't found any motivation for realising them yet. Now however, the Raspberry Pi camera is on its way! Hopefully it will be available sometime in the first half of next year, and I want one. For a while I've been thinking that I want to try to implement some machine learning methods for e.g. facial recognition, just for fun. This seems like a golden opportunity! It shoots 1080p at 30 fps according to DesignSpark.

Since I have a webcam in my computer, that should work as well, but I'm having trouble installing the Python bindings for OpenCV. It feels like I've tried every configuration possible, but I still can't get it to work. If I get some time on my hands I will primarily try to fix that, but the Raspberry Pi camera is a tempting option. Even if I get it to work on my Mac, I will probably buy it anyway.

Machine learning Raspberry Pi

Comment


Slow going

As of now, I’ve had Internet in my apartment for a few weeks, and as soon as I got it I apparently stopped posting here. Constructive, yeah I know.

Anyhow, today I had my start seminar for my PhD where I introduced my project to the rest of the group plus some other people. It went better than I expected, and I’m really excited to get started for real. However, there’s a lot of administrative stuff that has to be solved before I can really dig in.

I really like it at the University. The predecessor to the University of today was founded in 1859, and in 2005 it got University status and was renamed to the Norwegian University of Life Sciences. Some of the buildings are quite old, but in a charming way. With narrow corridors and strange floor plans, it was a mess finding your way around in the beginning. Just next to my office there’s actually a small dairy. Sometimes they get cream left over, and it’s just to go there and fill a bottle. Pretty nice!

Ås Internet

Comment


SpeechJammer

I still don't have an Internet connection in my apartment, but now it should be on it's way!

Meanwhile, I found that this years Ig Nobel Prize in acoustics was awarded to Kazutaka Kurihara and Koji Tsukada for creating the SpeechJammer. It's a device that disturbs people's speech by simply playing back what the person is saying, but with a tiny delay. I've experienced this myself working at the technical support of a major swedish ISP. Sometimes I could hear my own voice in the headset, with a delay, and it was tremendously difficult trying to talk to the customer without stuttering.

If I only thought of this application back then (2007), I could've been awarded an Ig Nobel Prize instead. Ah, well...

Research

Comment