Tag Archives: software

How to do science with a computer: workflow tools and OpenSource philosophy

I have two excellent things on my desk, a Linux Journal article by Andy Wills, and a newly published book by Stefano Allesina and Madlen Wilmes.

They are:

Computing Skills for Biologists: A Toolbox by Stefano Allesina and Madlen Wilmes, Princeton University Press.

Open Science, Open Source, and R, by Andy Wills, Linux Journal

Why OpenSource?

OpenSource science means, among other things, using OpenSource software to do the science. For some aspects of software this is not important. It does not matter too much if a science lab uses Microsoft Word or if they use LibreOffice Write.

However, since it does matter if you use LibreOffice Calc as your spreadsheet, as long as you are eschewing proprietary spreadsheets, you might as well use the OpenSource office package LibreOffice or equivalent, and then use the OpenSource presentation software, word processor, and spreadsheet.

OpenSource programs like Calc, R (a stats package), and OpenSource friendly software development tools like Python and the GPL C Compilers, etc. do matter. Why? Because your science involves calculating things, and software is a magic calculating box. You might be doing actual calculations, or production of graphics, or management of data, or whatever. All of the software that does this stuff is on the surface a black box, and just using it does not give you access to what is happening under the hood.

But, if you use OpenSoucre software, you have both direct and indirect access to the actual technologies that are key to your science project. You can see exactly how the numbers are calculated or the graphic created, if you want to. It might not be easy, but at least you don’t have to worry about the first hurdle in looking under the hood that happens with commercial software: they won’t let you do it.

Direct access to the inner workings of the software you use comes in the form of actually getting involved in the software development and maintenance. For most people, this is not something you are going to do in your scientific endeavor, but you could get involved with some help from a friend or colleague. For example, if you are at a University, there is a good chance that somewhere in your university system there is a computer department that has an involvement in OpenSource software development. See what they are up to, find out what they know about the software you are using. Who knows, maybe you can get a special feature included in your favorite graphics package by helping your new found computer friends cop an internal University grant! You might be surprised as to what is out there, as well as what is in there.

In any event, it is explicitly easy to get involved in OpenSource software projects because they are designed that way. Or, usually are and always should be.

The indirect benefit comes from the simple fact that these projects are OpenSource. Let me give you an example form the non scientific world. (it is a made up example, but it could reflect reality and is highly instructive.)

Say there is an operating system or major piece of software competing in a field of other similar products. Say there is a widely used benchmark standard that compares the applications and ranks them. Some of the different products load up faster than others, and use less RAM. That leaves both time (for you) and RAM (for other applications) that you might value a great deal. All else being equal, pick the software that loads faster in less space, right?

Now imagine a group of trollish deviants meeting in a smoky back room of the evile corporation that makes one of these products. They have discovered that if they leave a dozen key features that all the competitors use out of the loading process, so they load later, they can get a better benchmark. Without those standard components running, the software will load fast and be relatively small. It happens to be the case, however, that once all the features are loaded, this particular product is the slowest of them all, and takes up the most RAM. Also, the process of holding back functionality until it is needed is annoying to the user and sometimes causes memory conflicts, causing crashes.

In one version of this scenario, the concept of selling more of the product by using this performance tilting trick is considered a good idea, and someone might even get a promotion for thinking of it. That would be something that could potentially happen in the world of proprietary software.

In a different version of this scenario the idea gets about as far as the water cooler before it is taken down by a heavy tape dispenser to the head and kicked to death. That would be what would certainly happen in the OpenSource world.

So, go OpenSource! And, read the paper from Linux Journal, which by the way has been producing some great articles lately, on this topic.

The Scientists Workflow and Software

You collect and manage data. You write code to process or analyze data. You use statistical tools to turn data into analytically meaningful numbers. You make graphs and charts. You write stuff and integrate the writing with the pretty pictures, and produce a final product.

The first thing you need to understand if you are developing or enhancing the computer side of your scientific endevour is that you need the basic GNU tools and command line access that comes automatically if you use Linux. You can get the same stuff with a few extra steps if you use Windows. The Apple Mac system is in between with the command line tools already built in, but not quite as in your face available.

You may need to have an understanding of Regular Expressions, and how to use them on the command line (using sed or awk, perhaps) and in programming, perhaps in python.

You will likely want to master the R environment because a) it is cool and powerful and b) a lot of your colleagues use R so you will want to have enough under your belt to share code and data now and then. You will likely want to master Python, which is becoming the default scientific programming language. It is probably true that anything you can do in R you can do in Python using the available tools, but it is also true that the most basic statistical stuff you might be doing is easier in R than Python since R is set up for it. The two systems are relatively easy to use and very powerful, so there is no reason to not have both in your toolbox. If you don’t chose the Python route, you may want to supplement R with gnu plotting tools.

You will need some sort of relational database setup in your lab, some kind of OpenSource SQL lanaguge based system.

You will have to decide on your own if you are into LaTex. If you have no idea what I’m talking about, don’t worry, you don’t need to know. If you do know what I’m talking about, you probably have the need to typeset math inside your publications.

Finally, and of utmost importance, you should be willing to spend the upfront effort making your scientific work flow into scripts. Say you have a machine (or a place on the internet or an email stream if you are working collaboratively) where some raw data spits out. These data need some preliminary messing around with to discard what you don’t want, convert numbers to a proper form, etc. etc. Then, this fixed-up data goes through a series of analyses, possibly several parallel streams of analysis, to produce a set of statistical outputs, tables, graphics, or a new highly transformed data set you send on to someone else.

If this is something you do on a regular basis, and it likely is because your lab or field project is set up to get certain data certain ways, then do certain things to it, then ideally you would set up a script, likely in bash but calling gnu tools like sed or awk, or running Python programs or R programs, and making various intermediate files and final products and stuff. You will want to bother with making the first run of these operations take three times longer to set up, so that all the subsequent runs take one one hundredth of the time to carry out, or can be run unattended.

Nothing, of course, is so simple as I just suggested … you will be changing the scripts and Python programs (and LaTeX specs) frequently, perhaps. Or you might have one big giant complex operation that you only need to run once, but you KNOW it is going to screw up somehow … a value that is entered incorrectly or whatever … so the entire thing you need to do once is actually something you have to do 18 times. So make the whole process a script.

Aside form convenience and efficiency, a script does something else that is vitally important. It documents the process, both for you and others. This alone is probably more important than the convenience part of scripting your science, in many cases.

Being small in a world of largeness

Here is a piece of advice you wont get from anyone else. As you develop your computer working environment, the set of software tools and stuff that you use to run R or Python and all that, you will run into opportunities to install some pretty fancy and sophisticated developments systems that have many cool bells and whistles, but that are really designed for team development of large software projects, and continual maintenance over time of versions of that software as it evolves as a distributed project.

Don’t do that unless you need to. Scientific computing often not that complex or team oriented. Sure, you are working with a team, but probably not a team of a dozen people working on the same set of Python programs. Chances are, much of the code you write is going to be tweaked to be what you need it to be then never change. There are no marketing gurus coming along and asking you to make a different menu system to attract millennials. You are not competing with other products in a market of any sort. You will change your software when your machine breaks and you get a new one, and the new one produces output in a more convenient style than the old one. Or whatever.

In other words, if you are running an enterprise level operation, look into systems like Anaconda. If you are a handful of scientists making and controlling your own workflow, stick with the simple scripts and avoid the snake. The setup and maintenance of an enterprise level system for using R and Python is probably more work before you get your first t-test or histogram than it is worth. This is especially true if you are more or less working on your own.

Culture

Another piece of advice. Some software decisions are based on deeply rooted cultural norms or fetishes that make no sense. I’m an emacs user. This is the most annoying, but also, most powerful, of all text editors. Here is an example of what is annoying about emac. In the late 70s, computer keyboards had a “meta” key (it was actually called that) which is now the alt key. Emacs made use of the metakey. No person has seen or used a metakey since about 1979, but emacs refuses to change its documentation to use the word “alt” for this key. Rather, the documentation says somethin like “here, use the meta key, which on some keyboards is the alt key.” That is a cultural fetish.

Using LaTeX might be a fetish as well. Obliviously. It is possible that for some people, using R is a fetish and they should rethink and switch to using Python for what they are doing. The most dangerous fetish, of course, is using proprietary scientific software because you think only if you pay hundreds of dollars a year to use SPSS or BMD for stats, as opposed to zero dollars a year for R, will your numbers be acceptable. In fact, the reverse is true. Only with an OpenSource stats package can you really be sure how the stats or other values are calculated.

And finally…

And my final piece of advice is to get and use this book: Computing Skills for Biologists: A Toolbox by Allesina and Wilmes.

This book focuses on Python and not R, and covers Latex which, frankly, will not be useful for many. This also means that the regular expression work in the book is not as useful for all applications, as might be the case with a volume like Mastering Regular Expressions. But overall, this volume does a great job of mapping out the landscape of scripting-oriented scientific computing, using excellent examples from biology.

Mastering Regular Expressions can and should be used as a textbook for an advanced high school level course to prep young and upcoming investigators for when they go off and apprentice in labs at the start of their career. It can be used as a textbook in a short seminar in any advanced program to get everyone in a lab on the same page. I suppose it would be treat if Princeton came out with a version for math and physical sciences, or geosciences, but really, this volume can be generalized beyond biology.

Stefano Allesina is a professor in the Department of Ecology and Evolution at the University of Chicago and a deputy editor of PLoS Computational Biology. Madlen Wilmes is a data scientist and web developer.

Writing Software for Writers

This is especially for writers of big things. If you write small things, like blog posts or short articles, your best tool is probably a text editor you like and a way to handle markdown language. Chances are you use a word processor like MS Word or LibreOffice, and that is both overkill and problematic for other reasons, but if it floats your boat, happy sailing. But really, the simpler the better for basic writing and composition and file management. If you have an editor or publisher that requires that you only exchange documents in Word format, you can shoot your text file with markdown into a Word document format easily, or just copy and paste into your word processor and fiddle.

(And yes, a “text editor” and a “word processor” are not the same thing.)

But if you have larger documents, such as a book, to work on, then you may have additional problems that require somewhat heroic solutions. For example, you will need to manage sections of text in a large setting, moving things around, and leaving large undone sections, and finally settling on a format for headings, chapters, parts, sections, etc. after trying out various alternative structures.

You will want to do this effectively, without the necessary fiddling taking too much time, or ruining your project if something goes wrong. Try moving a dozen different sections around in an 80,000 word document file. Not easy. Or, if you divide your document into many small files, how do you keep them in order? There are ways, but most of the ways are clunky and some may be unreliable.

If you use Windows (I don’t) or a Mac (I do sometimes) then you should check out Scrivener. You may have heard about it before, and we have discussed it here. But you may not know that there is a new version and it has some cool features added to all the other cool features it already had.

The most important feature of Scrivener is that it has a tree that holds, as its branches, what amount to individual text files (with formatting and all, don’t worry about that) which you can freely move around. The tree can have multiple hierarchical levels, in case you want a large scale structure that is complex, like multiple books each with several parts containing multiple chapters each with one or more than one scene. No problem.

Imagine the best outlining program you’ve ever used. Now, improve it so it is better than that. Then blend it with an excellent word processing system so you can do all your writing in it.

Then, add features. There are all sorts of features that allow you to track things, like how far along the various chapters or sections are, or which chapters hold which subplots, etc. Color coding. Tags. Places to take notes. Metadata, metadata, metadata. A recent addition is a “linguistic focus” which allows you to chose a particular construct such as “nouns” or “verbs” or dialog (stuff in quotation marks) and make it all highlighted in a particular subdocument.

People will tell you that the index card and cork board feature is the coolest. It is cool, but I like the other stuff better, and rarely use the index cards on the cork board feature myself. But it is cool.

The only thing negative about all these features is that there are so many of them that there will be a period of distraction as you figure out which way to have fun using them.

Unfortunately for me, I like to work in Linux, and my main computer is, these days, a home built Linux box that blows the nearby iMac out of the water on speed and such. I still use the iMac to write, and I’ve stripped most of the other functionality away from that computer, to make that work better. So, when I’m using Scrivener, I’m not getting notices from twitter or Facebook or other distractions. But I’d love to have Scrivener on Linux.

If you are a Linux user and like Scrivener let them know that you’d buy Scrivener for Linux if if was avaialable! There was a beta version of Scrivener for Linux for a while, but it stopped being developed, then stopped being maintained, and now it is dead.

In an effort to have something like Scrivener on my Linux machine, I searched around for alternatives. I did not find THE answer, but I found some things of interest.

I looked at Kit Scenarist, but it was freemium which I will not go near. I like OpenSource projects the best, but if they don’t exist and there is a reasonable paid alternative, I’ll pay (like Scrivener, it has a modest price tag, and is worth it) . Bibisco is an entirely web based thing. I don’t want my writing on somebody’s web cloud.

yWriter looks interesting and you should look into it (here). It isn’t really available for Linux, but is said to work on Mono, which I take to be like Wine. So, I didn’t bother, but I’m noting it here in case you want to.

oStorybook is java based and violated a key rule I maintain. When software is installed on my computer, there has to be a way to start it up, like telling me the name of the software, or putting it on the menu or something. I think Java based software is often like this. Anyway, I didn’t like its old fashioned menus and I’m not sure how well maintained it is.

Writers Cafe is fun to look at and could be perfect for some writers. It is like yWrite in that it is a set of solutions someone thought would be good. I tried several of the tools and found that some did not work so well. It cost money (but to try is free) and isn’t quite up to it, in my opinion, but it is worth a look just to see for yourself.

Plume Creator is apparently loved by many, and is actually in many Linux distros. I played around with it for a while. I didn’t like the menu system (disappearing menus are not my thing at all) and the interface is a bit quirky and not intuitive. But I think it does have some good features and I recommend looking at it closely.

The best of the lot seems to be Manuskript. It is in Beta form but seems to work well. It is essentially a Scrivener clone, more or less, and works in a similar way with many features. In terms of overall slickness and oomph, Manuskript is maybe one tenth or one fifth of Scrivener (in my subjective opinion) but is heading in that direction. And, if your main goal is simply to have a hierarchy of scenes and chapters and such that you can move around in a word processor, then you are there. I don’t like the way the in line spell checker works but it does exist and it does work. This software is good enough that I will use it for a project (already started) and I do have hope for it.

Using Scrivener on Linux the Other Way.

There is of course a way to use Scrivener on Linux, if you have a Mac laying around, and I do this for some projects. Scrivener has a mode that allows for storing the sub documents in your projects as text files that you can access directly and edit with a text editor. If you keep these in Dropbox, you can use emacs (or whatever) on Linux to do your writing and such, and Scrivener on the Mac to organize the larger document. Sounds clunky, is dangerous, but it actually works pretty well for certain projects.

Scrivener can look like this.

OpenOffice May Close The Door

The history of what we call “OpenOffice” is complex and confusing. It started as a project of Sun corporation, to develop an office suit that was not Microsoft Office, to use internally. Later, a version became more generally available known as Star Office, but also, a version called “OpenOffice” soon became available as well. The current histories say that Star Office was commercial, but my memory is that it never cost money to regular users. I think the idea was that large corporations would pay, individuals not. This was all back around 2000, plus or minus a year or two.

In any event, the Open Office project built two things of great importance. First, it made a set of software applications roughly comparable to the key elements in Microsoft’s Office Suite, including a word processor, a spreadsheet, a presentation app, and, depending, something that draws and something that relates to databases.

The second thing it did was to create and develop an important open source document format.

But, believe it or not, in the world of software development and programming, even in the happy fuzzy world of OpenSource, there can be fights. And, not just the fun and tongue in cheek fights over which religion you are (vi vs. Linux). These fights often involve differences in points of view between megacorporations that get involved in OpenSource projects, and the unwashed masses of programmers contributing to such things. The majority of code is written and maintained by corporations, much of that in the hands of a very small number, but the contributions from individuals not linked to corporations is extremely important.

In the case of OpenOffice, the tensions were between the broader Office-interested development community and big corporations shifted in 2010 when Sun corporation which had always been involved in OO development, was purchased by Oracle Corporation. Oracle has not been friendly to OpenSource in the past, so the wider community freaked. There is a side plot here involving Java, which we will ignore. Oracle didn’t end up doing anything clearly bad against the OpenOffice project. But, they also ended up not doing anything good, either, which is essentially a death sentence for a project like this. Later in the same year, an organization called The Document Foundation was created and took on the job of forking OpenOffice.

Forking is where a given lineage of software is split to create an alternative. Sometimes this is to bring some software in a different direction, perhaps for a more specialized use. Sometimes it is a way of resolving conflict, much as hunter gatherers undergo fission and fusion in their settlement patterns, by separating antagonists or putting a distinct wall between antagonistic goals. In this case, while the latter is probably part of it, the main reason for the fork and its main effect was to get the project under the control of an active development community so work could be continued before the project stagnated.

That fork became known as LibreOffice. For some time now, it has been recommended that if you are going to install an OpenSource office suite on your Windows, Linux, or Apple Computer, it should be LibreOffice.

One could argue that the OpenOffice suit or its analog (earlier, Star office, later the LibreOffice fork) is the most important single project in OpenSource, because an office suite is a key part of almost all desktop computer configurations. Of course, most servers don’t need or require an office suite, and there, web servers and database servers, and a few other things, are more important. But to the average end user (in business or private life) being able to open up a “Word Document” (a term misapplied to the category of “wordprocessor document”), or to run a spreadsheet, or to make a presentation, etc. is essential, and that is what an office suit provides. OpenOffice was comparable to Microsoft Office, and now, LibreOffice is comparable to Microsoft Office. By some accounts, better, though many Microsoft Office users have, well, a different religion.

Now, it is being reported that the mostly ignored, maligned by some, historically important yet now out of date OpenOffice project is about to byte the dust. As it were.

Dennis Hamilton, VP of the group that runs OpenOffice, “… proposed a shutdown of OpenOffice as one option if the project could not meet the goals it had set. ‘My concern is that the project could end with a bang or a whimper. My interest is in seeing any retirement happen gracefully. That means we need to consider it as a contingency. For contingency plans, no time is a good time, but earlier is always better than later.'” [Source]

Approximately 160 million copies of LibreOffice have been downloaded to date. The closing of the OpenOffice project, should that happen, will probably have little effect on LibreOffice, since most people had already walked away from the venerable old but flawed grandaddy of OO Suites.