Tag Archives: Technology

How to do science with a computer: workflow tools and OpenSource philosophy

I have two excellent things on my desk, a Linux Journal article by Andy Wills, and a newly published book by Stefano Allesina and Madlen Wilmes.

They are:

Computing Skills for Biologists: A Toolbox by Stefano Allesina and Madlen Wilmes, Princeton University Press.

Open Science, Open Source, and R, by Andy Wills, Linux Journal

Why OpenSource?

OpenSource science means, among other things, using OpenSource software to do the science. For some aspects of software this is not important. It does not matter too much if a science lab uses Microsoft Word or if they use LibreOffice Write.

However, since it does matter if you use LibreOffice Calc as your spreadsheet, as long as you are eschewing proprietary spreadsheets, you might as well use the OpenSource office package LibreOffice or equivalent, and then use the OpenSource presentation software, word processor, and spreadsheet.

OpenSource programs like Calc, R (a stats package), and OpenSource friendly software development tools like Python and the GPL C Compilers, etc. do matter. Why? Because your science involves calculating things, and software is a magic calculating box. You might be doing actual calculations, or production of graphics, or management of data, or whatever. All of the software that does this stuff is on the surface a black box, and just using it does not give you access to what is happening under the hood.

But, if you use OpenSoucre software, you have both direct and indirect access to the actual technologies that are key to your science project. You can see exactly how the numbers are calculated or the graphic created, if you want to. It might not be easy, but at least you don’t have to worry about the first hurdle in looking under the hood that happens with commercial software: they won’t let you do it.

Direct access to the inner workings of the software you use comes in the form of actually getting involved in the software development and maintenance. For most people, this is not something you are going to do in your scientific endeavor, but you could get involved with some help from a friend or colleague. For example, if you are at a University, there is a good chance that somewhere in your university system there is a computer department that has an involvement in OpenSource software development. See what they are up to, find out what they know about the software you are using. Who knows, maybe you can get a special feature included in your favorite graphics package by helping your new found computer friends cop an internal University grant! You might be surprised as to what is out there, as well as what is in there.

In any event, it is explicitly easy to get involved in OpenSource software projects because they are designed that way. Or, usually are and always should be.

The indirect benefit comes from the simple fact that these projects are OpenSource. Let me give you an example form the non scientific world. (it is a made up example, but it could reflect reality and is highly instructive.)

Say there is an operating system or major piece of software competing in a field of other similar products. Say there is a widely used benchmark standard that compares the applications and ranks them. Some of the different products load up faster than others, and use less RAM. That leaves both time (for you) and RAM (for other applications) that you might value a great deal. All else being equal, pick the software that loads faster in less space, right?

Now imagine a group of trollish deviants meeting in a smoky back room of the evile corporation that makes one of these products. They have discovered that if they leave a dozen key features that all the competitors use out of the loading process, so they load later, they can get a better benchmark. Without those standard components running, the software will load fast and be relatively small. It happens to be the case, however, that once all the features are loaded, this particular product is the slowest of them all, and takes up the most RAM. Also, the process of holding back functionality until it is needed is annoying to the user and sometimes causes memory conflicts, causing crashes.

In one version of this scenario, the concept of selling more of the product by using this performance tilting trick is considered a good idea, and someone might even get a promotion for thinking of it. That would be something that could potentially happen in the world of proprietary software.

In a different version of this scenario the idea gets about as far as the water cooler before it is taken down by a heavy tape dispenser to the head and kicked to death. That would be what would certainly happen in the OpenSource world.

So, go OpenSource! And, read the paper from Linux Journal, which by the way has been producing some great articles lately, on this topic.

The Scientists Workflow and Software

You collect and manage data. You write code to process or analyze data. You use statistical tools to turn data into analytically meaningful numbers. You make graphs and charts. You write stuff and integrate the writing with the pretty pictures, and produce a final product.

The first thing you need to understand if you are developing or enhancing the computer side of your scientific endevour is that you need the basic GNU tools and command line access that comes automatically if you use Linux. You can get the same stuff with a few extra steps if you use Windows. The Apple Mac system is in between with the command line tools already built in, but not quite as in your face available.

You may need to have an understanding of Regular Expressions, and how to use them on the command line (using sed or awk, perhaps) and in programming, perhaps in python.

You will likely want to master the R environment because a) it is cool and powerful and b) a lot of your colleagues use R so you will want to have enough under your belt to share code and data now and then. You will likely want to master Python, which is becoming the default scientific programming language. It is probably true that anything you can do in R you can do in Python using the available tools, but it is also true that the most basic statistical stuff you might be doing is easier in R than Python since R is set up for it. The two systems are relatively easy to use and very powerful, so there is no reason to not have both in your toolbox. If you don’t chose the Python route, you may want to supplement R with gnu plotting tools.

You will need some sort of relational database setup in your lab, some kind of OpenSource SQL lanaguge based system.

You will have to decide on your own if you are into LaTex. If you have no idea what I’m talking about, don’t worry, you don’t need to know. If you do know what I’m talking about, you probably have the need to typeset math inside your publications.

Finally, and of utmost importance, you should be willing to spend the upfront effort making your scientific work flow into scripts. Say you have a machine (or a place on the internet or an email stream if you are working collaboratively) where some raw data spits out. These data need some preliminary messing around with to discard what you don’t want, convert numbers to a proper form, etc. etc. Then, this fixed-up data goes through a series of analyses, possibly several parallel streams of analysis, to produce a set of statistical outputs, tables, graphics, or a new highly transformed data set you send on to someone else.

If this is something you do on a regular basis, and it likely is because your lab or field project is set up to get certain data certain ways, then do certain things to it, then ideally you would set up a script, likely in bash but calling gnu tools like sed or awk, or running Python programs or R programs, and making various intermediate files and final products and stuff. You will want to bother with making the first run of these operations take three times longer to set up, so that all the subsequent runs take one one hundredth of the time to carry out, or can be run unattended.

Nothing, of course, is so simple as I just suggested … you will be changing the scripts and Python programs (and LaTeX specs) frequently, perhaps. Or you might have one big giant complex operation that you only need to run once, but you KNOW it is going to screw up somehow … a value that is entered incorrectly or whatever … so the entire thing you need to do once is actually something you have to do 18 times. So make the whole process a script.

Aside form convenience and efficiency, a script does something else that is vitally important. It documents the process, both for you and others. This alone is probably more important than the convenience part of scripting your science, in many cases.

Being small in a world of largeness

Here is a piece of advice you wont get from anyone else. As you develop your computer working environment, the set of software tools and stuff that you use to run R or Python and all that, you will run into opportunities to install some pretty fancy and sophisticated developments systems that have many cool bells and whistles, but that are really designed for team development of large software projects, and continual maintenance over time of versions of that software as it evolves as a distributed project.

Don’t do that unless you need to. Scientific computing often not that complex or team oriented. Sure, you are working with a team, but probably not a team of a dozen people working on the same set of Python programs. Chances are, much of the code you write is going to be tweaked to be what you need it to be then never change. There are no marketing gurus coming along and asking you to make a different menu system to attract millennials. You are not competing with other products in a market of any sort. You will change your software when your machine breaks and you get a new one, and the new one produces output in a more convenient style than the old one. Or whatever.

In other words, if you are running an enterprise level operation, look into systems like Anaconda. If you are a handful of scientists making and controlling your own workflow, stick with the simple scripts and avoid the snake. The setup and maintenance of an enterprise level system for using R and Python is probably more work before you get your first t-test or histogram than it is worth. This is especially true if you are more or less working on your own.

Culture

Another piece of advice. Some software decisions are based on deeply rooted cultural norms or fetishes that make no sense. I’m an emacs user. This is the most annoying, but also, most powerful, of all text editors. Here is an example of what is annoying about emac. In the late 70s, computer keyboards had a “meta” key (it was actually called that) which is now the alt key. Emacs made use of the metakey. No person has seen or used a metakey since about 1979, but emacs refuses to change its documentation to use the word “alt” for this key. Rather, the documentation says somethin like “here, use the meta key, which on some keyboards is the alt key.” That is a cultural fetish.

Using LaTeX might be a fetish as well. Obliviously. It is possible that for some people, using R is a fetish and they should rethink and switch to using Python for what they are doing. The most dangerous fetish, of course, is using proprietary scientific software because you think only if you pay hundreds of dollars a year to use SPSS or BMD for stats, as opposed to zero dollars a year for R, will your numbers be acceptable. In fact, the reverse is true. Only with an OpenSource stats package can you really be sure how the stats or other values are calculated.

And finally…

And my final piece of advice is to get and use this book: Computing Skills for Biologists: A Toolbox by Allesina and Wilmes.

This book focuses on Python and not R, and covers Latex which, frankly, will not be useful for many. This also means that the regular expression work in the book is not as useful for all applications, as might be the case with a volume like Mastering Regular Expressions. But overall, this volume does a great job of mapping out the landscape of scripting-oriented scientific computing, using excellent examples from biology.

Mastering Regular Expressions can and should be used as a textbook for an advanced high school level course to prep young and upcoming investigators for when they go off and apprentice in labs at the start of their career. It can be used as a textbook in a short seminar in any advanced program to get everyone in a lab on the same page. I suppose it would be treat if Princeton came out with a version for math and physical sciences, or geosciences, but really, this volume can be generalized beyond biology.

Stefano Allesina is a professor in the Department of Ecology and Evolution at the University of Chicago and a deputy editor of PLoS Computational Biology. Madlen Wilmes is a data scientist and web developer.

Practical Binary Analysis: Book Review

A computer program is like a memo. Often, a vague memo.

You are the boss. You want a pile of files to be put away. You could do it yourself, but instead you instruct someone else to do it. There are a lot of them and they are all mixed up. So you write a memo to an employee that says “put the files away” and sis-bam-boom you’re all set.

Or are you? Continue reading Practical Binary Analysis: Book Review

Do Not Upgrade To The New Chrome! Yet.

The new Chrome browser by Google, Chrome 69, is probably an important improvement in browser functionality, look and feel, and security. But, as you might expect, the first version available for general users is buggy, perhaps very buggy. I would wait a little while for the bugs to get all hunted down and exterminated. How long? A week or two should do it.

What is new in the new Google Chrome 69 Browser?

Continue reading Do Not Upgrade To The New Chrome! Yet.

I knew it, I saw this coming! (Microsoft-Linux)

Some time ago it dawned on me that a future Microsoft operating system, a version of Windows, would be based on Linux. It only makes sense. There is no better operating system to base a desktop, server, or other specialized OS on, for normal hardware. Eventually, this would dawn on Microsoft. I thought it might have a few years ago when Microsoft went from being openly aggressive against Linux and OpenSource, to being neutral, to being nice, and eventually contributing.

And now… Continue reading I knew it, I saw this coming! (Microsoft-Linux)

Girls With Dreams and Women With Cards

Natasha Ravinand is the founder of “She Dreams in Code,” a nonprofit focused on increasing opportunities for middle school girls to engage in coding. She is also the author of Girls With Dreams: Inspiring Girls to Code and Create in the New Generation. In this book, Ntasha interviews several women in engineering and technology in order to assemble a compendium of inspiration for others like her, who want to engage in technology without the usual and common obstacles.

Natasha Ravinand is a Junior at Northwood High School (Irvine, CA). She is considered to be one of the top high schoolers in the coding world. Hello world. @natasharavinand
Here’s two facts you need to know. 1) Only 25% of the adults engaged in science and technology (STEM) are women. 2) This is a HUGE percentage compared to what it was only a few years ago. So, we are in a bad place, but also, we are moving quickly out of that place. Continue reading Girls With Dreams and Women With Cards

A New Robot For Littler Kids

The typical robot these days (such as the Makeblock DIY mBot and the Tomo) hooks up to an android or iOS device, via blue tooth, and allows for programming using a scratch-like programming language.

The smaller of the two kits, normally about $60 but under $50 last time I looked.
For somewhat younger kids, and for kids who do not happen to have a tablet they are allowed to use because they drool on it and stuff, there is an alternative that is on one hand a little harder to code but on the other hand more intuitive and very creative. I speak of the Botley Coding Robot, which comes in two styles: 1) Learning Resources Botley the Coding Robot Activity Set, 77 Pieces and Learning Resources Botley the Coding Robot, 45 Pieces. (I tested the latter, but they are the same in the parts that matter). Continue reading A New Robot For Littler Kids

How to keep your kids out of trouble in this modern age

Do you worry that your kid is going to be rejected from civilization, or, at least, college or the boy scouts or something, because of dumb stuff they do on line? Do you see evidence that your children are copying the jerky characters that grace our TV screens and movies, and are becoming too annoying, compared to how we all were when we grew up? Do you want to just tell the up coming generation to GET OFF THE LAWN!!!!

Here is a way to do that. Continue reading How to keep your kids out of trouble in this modern age

Wait, don’t buy an Echo yet!

I had mentioned before that we are enjoying our Amazon Echo, one of those robots that listens and then responds with a certain degree of intelligence.

We don’t use the Echo for very many things, but that is partly because we are not in the habit. For example, if I’m sitting in a certain chair in the library, reading, I have to stand up and turn around and kind of bend over in a certain direction to see the clock on the wall. Or, I can say, “Alexa, what time is it?” and the Echo Dot tells me. But, I almost never think of asking Alexa. But over time I’m sure I’ll get in the habit, and after that, stop moving around as much. Which will ultimately lead to atrophy about the time the robots take over, which I assume is their plan.

I use Alexa’s shopping list, we ask it questions one might as Google Assistant (but Google Assistant is much more likely so far to come up with the answer). Alexa has a large number of useful information and entertainment services, which we are using more and more, such as getting a news update, the weather, and so on.

In any event, I recommend giving Alexa a try, and if you happen to have an Internet Of Things devices, then you simply have to pepper your home with dots and stop moving entirely.

But, the reason you don’t want to just go out and buy an Echo or related device at this moment is because Amazon just came out with a new line of them. Here is some basic information to help you get oriented. Then, if you pick the second generation Echo as your first Alexa device, go for it, otherwise, I might wait until the other devices are out for a few weeks to see how people like them.

If you want to cut to the chase, CLICK HERE to see a page at Amazon.com with the details, including a product grid to help you pick out which robot you want to have as your new overlord.

The Echo Dot (2nd Generation) is your basic entry level device. It has an adequate speaker but not really good enough for music, but it also has an output you can hook to your own speakers. Your first device should be this inexpensive dot. Then, later, if you want to upgrade to a fancier device, you can still use this one as a second device say, in your garage or bathroom or somewhere.

The Second Generation Echo is essentially the Echo Dot sitting on top of a high quality speaker, and runs about twice the cost of the Echo dot.

The new Echo Plus includes a hub from which to run your smart home devices, has a somewhat better sound system than the Echo 2nd gen, and is slightly larger. This will cost you fifty another fifty bucks, so now we are up to $150, but since it includes the hub it is probably worth it.

The new Echo Spot is Echo Dot size but with a screen, small at 2.5″, but possibly useful. This is not cheaap ($129). The sound quality sis probably better than the traditional Dot. It does not have the hub.

The top of the line is the Echo Show. This has top speakers, a 7 inchs screen, and blue-tooth only audio output (all the others have plug in audio output).

All these devices can control smart home items, and allow free audio calls between people with Echos across North America. They all stream music, etc. using the services that you may or may not have such as Spotify, Pandora, Amazon Music, etc.

I’m not sure that I personally grok the combination of devices. Maybe I want a hub that is separate and inexpensive. Maybe I want a screen that is 7 inches or so to wall mount but it is only an output screen, but it can sit near my front door and tell me the weather, something about traffic, and if the garage door is open. I’ll have to think about it.

For now I’ll stick with my dot, and keep playing around with home made devices and robots until I see how it goes.

The big question for YOU is which device to get if this is your first one. I would recommend the Echo Dot then see how it goes, just to be conservative.

However, make sure you get a second generation Echo Dot, or Echo.

Also, Amazon is currently running a promotion where you can get the Echo Dot plus a Fire TV stick (which is roughly like a Roku, I believe) for about 90 bucks, which is cheap.. And, you can browse around for certified refurbished devices which will save you typically ten or twenty percent. Not a huge savings but they are certified.