How to do science with a computer: workflow tools and OpenSource philosophy

Spread the love

I have two excellent things on my desk, a Linux Journal article by Andy Wills, and a newly published book by Stefano Allesina and Madlen Wilmes.

They are:

Computing Skills for Biologists: A Toolbox by Stefano Allesina and Madlen Wilmes, Princeton University Press.

Open Science, Open Source, and R, by Andy Wills, Linux Journal

Why OpenSource?

OpenSource science means, among other things, using OpenSource software to do the science. For some aspects of software this is not important. It does not matter too much if a science lab uses Microsoft Word or if they use LibreOffice Write.

However, since it does matter if you use LibreOffice Calc as your spreadsheet, as long as you are eschewing proprietary spreadsheets, you might as well use the OpenSource office package LibreOffice or equivalent, and then use the OpenSource presentation software, word processor, and spreadsheet.

OpenSource programs like Calc, R (a stats package), and OpenSource friendly software development tools like Python and the GPL C Compilers, etc. do matter. Why? Because your science involves calculating things, and software is a magic calculating box. You might be doing actual calculations, or production of graphics, or management of data, or whatever. All of the software that does this stuff is on the surface a black box, and just using it does not give you access to what is happening under the hood.

But, if you use OpenSoucre software, you have both direct and indirect access to the actual technologies that are key to your science project. You can see exactly how the numbers are calculated or the graphic created, if you want to. It might not be easy, but at least you don’t have to worry about the first hurdle in looking under the hood that happens with commercial software: they won’t let you do it.

Direct access to the inner workings of the software you use comes in the form of actually getting involved in the software development and maintenance. For most people, this is not something you are going to do in your scientific endeavor, but you could get involved with some help from a friend or colleague. For example, if you are at a University, there is a good chance that somewhere in your university system there is a computer department that has an involvement in OpenSource software development. See what they are up to, find out what they know about the software you are using. Who knows, maybe you can get a special feature included in your favorite graphics package by helping your new found computer friends cop an internal University grant! You might be surprised as to what is out there, as well as what is in there.

In any event, it is explicitly easy to get involved in OpenSource software projects because they are designed that way. Or, usually are and always should be.

The indirect benefit comes from the simple fact that these projects are OpenSource. Let me give you an example form the non scientific world. (it is a made up example, but it could reflect reality and is highly instructive.)

Say there is an operating system or major piece of software competing in a field of other similar products. Say there is a widely used benchmark standard that compares the applications and ranks them. Some of the different products load up faster than others, and use less RAM. That leaves both time (for you) and RAM (for other applications) that you might value a great deal. All else being equal, pick the software that loads faster in less space, right?

Now imagine a group of trollish deviants meeting in a smoky back room of the evile corporation that makes one of these products. They have discovered that if they leave a dozen key features that all the competitors use out of the loading process, so they load later, they can get a better benchmark. Without those standard components running, the software will load fast and be relatively small. It happens to be the case, however, that once all the features are loaded, this particular product is the slowest of them all, and takes up the most RAM. Also, the process of holding back functionality until it is needed is annoying to the user and sometimes causes memory conflicts, causing crashes.

In one version of this scenario, the concept of selling more of the product by using this performance tilting trick is considered a good idea, and someone might even get a promotion for thinking of it. That would be something that could potentially happen in the world of proprietary software.

In a different version of this scenario the idea gets about as far as the water cooler before it is taken down by a heavy tape dispenser to the head and kicked to death. That would be what would certainly happen in the OpenSource world.

So, go OpenSource! And, read the paper from Linux Journal, which by the way has been producing some great articles lately, on this topic.

The Scientists Workflow and Software

You collect and manage data. You write code to process or analyze data. You use statistical tools to turn data into analytically meaningful numbers. You make graphs and charts. You write stuff and integrate the writing with the pretty pictures, and produce a final product.

The first thing you need to understand if you are developing or enhancing the computer side of your scientific endevour is that you need the basic GNU tools and command line access that comes automatically if you use Linux. You can get the same stuff with a few extra steps if you use Windows. The Apple Mac system is in between with the command line tools already built in, but not quite as in your face available.

You may need to have an understanding of Regular Expressions, and how to use them on the command line (using sed or awk, perhaps) and in programming, perhaps in python.

You will likely want to master the R environment because a) it is cool and powerful and b) a lot of your colleagues use R so you will want to have enough under your belt to share code and data now and then. You will likely want to master Python, which is becoming the default scientific programming language. It is probably true that anything you can do in R you can do in Python using the available tools, but it is also true that the most basic statistical stuff you might be doing is easier in R than Python since R is set up for it. The two systems are relatively easy to use and very powerful, so there is no reason to not have both in your toolbox. If you don’t chose the Python route, you may want to supplement R with gnu plotting tools.

You will need some sort of relational database setup in your lab, some kind of OpenSource SQL lanaguge based system.

You will have to decide on your own if you are into LaTex. If you have no idea what I’m talking about, don’t worry, you don’t need to know. If you do know what I’m talking about, you probably have the need to typeset math inside your publications.

Finally, and of utmost importance, you should be willing to spend the upfront effort making your scientific work flow into scripts. Say you have a machine (or a place on the internet or an email stream if you are working collaboratively) where some raw data spits out. These data need some preliminary messing around with to discard what you don’t want, convert numbers to a proper form, etc. etc. Then, this fixed-up data goes through a series of analyses, possibly several parallel streams of analysis, to produce a set of statistical outputs, tables, graphics, or a new highly transformed data set you send on to someone else.

If this is something you do on a regular basis, and it likely is because your lab or field project is set up to get certain data certain ways, then do certain things to it, then ideally you would set up a script, likely in bash but calling gnu tools like sed or awk, or running Python programs or R programs, and making various intermediate files and final products and stuff. You will want to bother with making the first run of these operations take three times longer to set up, so that all the subsequent runs take one one hundredth of the time to carry out, or can be run unattended.

Nothing, of course, is so simple as I just suggested … you will be changing the scripts and Python programs (and LaTeX specs) frequently, perhaps. Or you might have one big giant complex operation that you only need to run once, but you KNOW it is going to screw up somehow … a value that is entered incorrectly or whatever … so the entire thing you need to do once is actually something you have to do 18 times. So make the whole process a script.

Aside form convenience and efficiency, a script does something else that is vitally important. It documents the process, both for you and others. This alone is probably more important than the convenience part of scripting your science, in many cases.

Being small in a world of largeness

Here is a piece of advice you wont get from anyone else. As you develop your computer working environment, the set of software tools and stuff that you use to run R or Python and all that, you will run into opportunities to install some pretty fancy and sophisticated developments systems that have many cool bells and whistles, but that are really designed for team development of large software projects, and continual maintenance over time of versions of that software as it evolves as a distributed project.

Don’t do that unless you need to. Scientific computing often not that complex or team oriented. Sure, you are working with a team, but probably not a team of a dozen people working on the same set of Python programs. Chances are, much of the code you write is going to be tweaked to be what you need it to be then never change. There are no marketing gurus coming along and asking you to make a different menu system to attract millennials. You are not competing with other products in a market of any sort. You will change your software when your machine breaks and you get a new one, and the new one produces output in a more convenient style than the old one. Or whatever.

In other words, if you are running an enterprise level operation, look into systems like Anaconda. If you are a handful of scientists making and controlling your own workflow, stick with the simple scripts and avoid the snake. The setup and maintenance of an enterprise level system for using R and Python is probably more work before you get your first t-test or histogram than it is worth. This is especially true if you are more or less working on your own.

Culture

Another piece of advice. Some software decisions are based on deeply rooted cultural norms or fetishes that make no sense. I’m an emacs user. This is the most annoying, but also, most powerful, of all text editors. Here is an example of what is annoying about emac. In the late 70s, computer keyboards had a “meta” key (it was actually called that) which is now the alt key. Emacs made use of the metakey. No person has seen or used a metakey since about 1979, but emacs refuses to change its documentation to use the word “alt” for this key. Rather, the documentation says somethin like “here, use the meta key, which on some keyboards is the alt key.” That is a cultural fetish.

Using LaTeX might be a fetish as well. Obliviously. It is possible that for some people, using R is a fetish and they should rethink and switch to using Python for what they are doing. The most dangerous fetish, of course, is using proprietary scientific software because you think only if you pay hundreds of dollars a year to use SPSS or BMD for stats, as opposed to zero dollars a year for R, will your numbers be acceptable. In fact, the reverse is true. Only with an OpenSource stats package can you really be sure how the stats or other values are calculated.

And finally…

And my final piece of advice is to get and use this book: Computing Skills for Biologists: A Toolbox by Allesina and Wilmes.

This book focuses on Python and not R, and covers Latex which, frankly, will not be useful for many. This also means that the regular expression work in the book is not as useful for all applications, as might be the case with a volume like Mastering Regular Expressions. But overall, this volume does a great job of mapping out the landscape of scripting-oriented scientific computing, using excellent examples from biology.

Mastering Regular Expressions can and should be used as a textbook for an advanced high school level course to prep young and upcoming investigators for when they go off and apprentice in labs at the start of their career. It can be used as a textbook in a short seminar in any advanced program to get everyone in a lab on the same page. I suppose it would be treat if Princeton came out with a version for math and physical sciences, or geosciences, but really, this volume can be generalized beyond biology.

Stefano Allesina is a professor in the Department of Ecology and Evolution at the University of Chicago and a deputy editor of PLoS Computational Biology. Madlen Wilmes is a data scientist and web developer.

Have you read the breakthrough novel of the year? When you are done with that, try:

In Search of Sungudogo by Greg Laden, now in Kindle or Paperback
*Please note:
Links to books and other items on this page and elsewhere on Greg Ladens' blog may send you to Amazon, where I am a registered affiliate. As an Amazon Associate I earn from qualifying purchases, which helps to fund this site.

Spread the love

One thought on “How to do science with a computer: workflow tools and OpenSource philosophy

  1. All good points. I’ll add that if your’e using Excel for any data analysis you’re doing it wrong — most of Excel’s statistics and probability functions have issues. The ASA had a policy statement in place several years ago that no serious statistical work should be done with it, and classes shouldn’t be taught using it. I don’t know about the open source spreadsheets, but I’d be surprised if they were any better.

    Two more points about R: using R inside Rstudio is a great help. You can still do analyses from the command line in the console, but have better control (IMO) over saving plots and other items. It’s a little easier to develop your own R functions and more in this setting. Use of RMarkdown allows you to generate reports in pdf, html, and Word format (I never use the last one, but it is possible), presentations. R Notebooks contain all output as well as code, and interested readers can download the source once your work is posted and not only examine the code but run it and tweak it as they might desire. Unless you have a need for a full-fledged stand-alone Latex installation you can install “Tinytex” and include Latex code in your work as desired.

    Finally — a comment defending Microsoft’s R (MRO, Microsoft R Open, and MRS, Microsoft R Server). Everything you can do with the open source R can be done with it, and Microsoft R can be used with Rstudio. There are a few exceptions going from it to the open source version, but there are some packages available for Microsoft R that are not available for the open source version. The primary advantage of the Microsoft versions comes in high performance implementations of many procedures (glm, clustering, and others), the ability to work with larger data sets, tight integration with SQL, and good support for multithreading (the latter is something open source R isn’t so good at).

Leave a Reply

Your email address will not be published. Required fields are marked *