Charles Warden's Science Blog: Bioinformatics 101: General Coding Information

UNIX:

UNIX Tutorial
SSH Linux/Unix commands
SCP Linux/Unix commands
basic linux commands from Google Code
"Bioinformatics one-liners" - common bash commands
Useful/Basic PBS Commands
'perf' examples for performance statistics
Various Notes:

"ps -eHj" shows hierarchy of processes (from this lesson)

Perl:

Downloading Perl

For Windows: Install using ActivePerl
Perl usually comes pre-installed on Macs

References

perldoc

definitions for common perl functions
also contains some tutorials

BioPerl

contains modules for common bioinformatic analysis

Beginning Perl for Bioinformatics

good book to start learning Perl (it's what I used!)
Exercises available for free download

Mastering Perl for Bioinformatics

Parallel Computing

Stack Overflow discussion on limiting threads in Perl
PerlMonks discussion on threads in Perl
Semaphore example
Stack Overflow discussion on using for-loop to add threads
In a nutshell, I would say the steps are as follows:

Load Dependencies

use threads;
use Thread::Semaphore;

Create semaphore (with maximum number of reads)

our $semaphore = Thread::Semaphore->new($max_threads);

Use subprocess to apply for multiple threads.

That should look something like this (for each sample/process):
push @threads, threads->create(\&function_name, $func_var1, $func_var2,...)
foreach (@threads) {$_->join;}

Within subprocess (described as "function_name" above), control the number of on-going processes as follows:

Start function with $semaphore->down(); (occupying thread)
End function with $semaphore->up(); (opening up thread)

Python:

General Python Tutorial
Google's Python Class
Biopython

contains modules for common bioinformatic analysis

Installing R

Download Source Code or Binaries
Download Windows R GUI
Install Bioconductor Packages
Rtools - necessary for building R packages on a Windows computer

If Rtools doesn't initially work try:

Making sure that your version of R and Rtools are compatible
Add PATH: C:\Program Files\R\R-2.15.1\bin\x64;C:\Rtools\bin;C:\Rtools\gcc-4.6.3\bin

Parallel Computing

Recent versions of R already include the "parallel" library (but you still have to run library("parallel") to get access to those functions)
You'll most likely be interested in using mclapply (doesn't work with Windows, but knows to set number of cores back to 1 on Windows computers)

Note this is a parallel version of apply, so it can be easily applied to columns of a data frame, but you'll need to transpose the table with row names to apply functions across rows. Please see to this blog for an example.
mclapply with one core is slower than apply, and it can potentially take up a lot of memory if data frame has a large number of columns (>50,000 columns, for example). So, use of mclapply may not always be beneficial.

References

Docker:

Understanding Docker (high-level introduction)
Docker User Guide

My Notes (read tutorials first):

To mount Windows Documents folder, docker run -it -v /c/Users/[your username]/Documents:/mnt/[mounted name] [image]

If you need to re-enter an exited session, you can use docker start -ia container_ID to re-open it (note use of container ID instead of image ID)

docker ps -a to see exited interactive jobs
If if host your images on Docker Hub, try to keep them under 3 GB

To upload (after running an interactive session):

docker commit -m "update message" container_ID [image]
docker push [image]

Using Docker Through Singularity:

General Tutorial - for example, you can run interactive mode with singularity shell docker://user-name/repository
Mapping Folders - for example, as an extension of the above example, you can run a Docker image in interactive mode with a mapped drive using singularity shell -B /source/path:/docker/path docker://user-name/repository
NOTE: this may not work for all Docker images (with errors not apparent until you try to run programs within the container), but I think it should work for some of them.

C++:

C and R:

.C Interface Function

scroll down for information more specific to C++

Including C++ Code in a Bioconductor Package

My Notes

For g++ compiler, binary output is created with "-o"

You can use "-g" option for debugging and "-Wall" for warning messages, but you'll still get error messages either way

If mixing your code with open-source code, take the compiler into consideration. For example, some string functions that work when compiling in gcc but not g++.

VirtualBox Virtual Machine:

Ubuntu .iso
Mounting shared folders

When you first open Virtual Box, choose settings for your image and define folder (under "Shared Folders")
To make that folder accessible, go to "Devices --> Insert Guest Additions CD image"
Probably should restart machine
Your folder should appear under /media/sf_[folder name]
However, you may still not have access to the contents of that folder. To fix the permissions issue, run sudo mount -t vboxsf [folder name] /media/sf_[folder name]
This might not be sufficient to have folder load everytime you start the VM. If you run into issues, try sudo usermod -G vboxsf -a [username] after re-mounting folder

If you find yourself in a situation where Ubuntu won't load from a locked installation file, you can fix this by pressing "Left-Shift" before Ubuntu starts to load (and then use the GRUB menu to fix the installation file). I don't think this is unique for the VM environment, but that is where I saw this could work.

This was a helpful blog post that reminded me about the alternative boot option
There were some issues with guest additions after that (at least one time), but some extra information about that process was described here.

Free Data / Code Sharing:

GitHub (up to 1 GB per repository, 100 MB per file)
SourceForge (honor system or 5 GB?)
FigShare (up to 5 GB)
Dryad (up to 20 GB)
Zenodo (up to 50 GB)

Has versioning (although it took me a little while to realize this)
However, this doesn't seem quite as flexible as GitHub. For example, the upload comes with a warning that "File addition, removal or modification are not allowed after you have published your upload".

Setting up Ubuntu Server

I am not sure why, but I had better installation success using the "alternative" installation files (as suggested as a solution in this forum)
A little early to say how much of a "success" everything is. However, at least for the installation step, this is the "No OS" computer that I am trying to set up as a server: Dell Server
Restart server via command line using `sudo reboot` or `sudo poweroff`
Tutorials to set up SSH key: here and here

I found the instructions to be a bit confusing. However, to essentially have 2 passwords and require an SSH key, you will want to add "AuthenticationMethods publickey,password" to the /etc/ssh/sshd_config file and then restart the service using "sudo service ssh restart" (as described here)

Reformat SSH keys to use with PuTTY
Using SSH keys with WinSCP

Even though I provided a password, I still needed to enter the ssh passphrase as well as the server password (the way that I set things up)

Mounting an additional hard drive

Ask Ubuntu discussion

I was able to see my 2nd hard drive (even though it wasn't accessible for storage yet, using "lsblk")
For newer and larger drives, I think the answer whose first step is to run "sudo blkid" may be the most relevant.

Ubuntu community help

Even without being mounted, I could see information about my 2nd hard drive (which was /dev/sdb), using "sudo lshw -C disk"
My 2nd hard drive was 3 TB, and both discussions mention special steps need to be taken for more than 2 TB of space (specifically, fdisk should not be used to create an MBR partition with >2TB)
For a new external hard drive, "parted" is recommended to reformat the drive.

For each partition, I think "sudo mkfs.ext4 /dev/sd[x][n]" should work. However, I think that should be for partitions like /dev/sdc1 not the full drive like /dev/sdc.

There is some information about command line formatting options here.

I also thought this YouTube video provided some general background, but it doesn't really provide as much Linux-specific information (if reformatting for primary use on Ubuntu, which I think would probably be ext4).

You probably don't want to have to use "sudo" for all commands within the mounted drive.

This discussion relates to that issue.
This also relates to the configuration in the /etc/fstab file for loading mounted drives. There is a recommended set of settings in the Ubuntu guide for Systemwide Mounts (although I am using ext4 instead of vfat).
Also, I might need to change things in the future, but I needed to use "defaults" instead of the provided in order to get "sudo mount -a" to correctly load the drive (after editing the /etc/fstab file).
I think checking for the presence of the "lost+found" subfolder is another way to see if the mounting was successful.
Most of this is also discussed in the first Ubuntu community help link that started this section.

Information about RAID drives (which is what I had at one point)

Ubuntu community documentation
Advanced community documentation
Ask Ubuntu documentation
General RAID storage overview from MUO
This tutorial on RAID arrays describes how to check the status of the RAID array using "/proc/mdstat" (for example, I could tell that I needed to do something with my second drive because it was listed as "inactive"). Likewise, I could use this file to tell my 1st drive was set of as "raid1" (for mirroring).
Ubuntu community RAID1+LVM (for mirroring) documentation
I also created this discussion

Setting up a static IP

With the newest version of Ubuntu server, I think this probably uses "netplan" to create a static IP
I forget exactly what I used at first, but I used this to help me be able to access external servers (name servers map names to IP addresses, and you must list name servers to be able to do things like update programs, clone git repositories, etc).
The subnet mask also confused me, but I think you probably want "24" (where I found the definitions for 255.0.0.0, 255.255.0.0, and 255.255.255.0 masks, which are described on this page, and are 8, 16, and 24 respectively).
There is also a more formal website for netplan here.
On Windows, you can list IP addresses on your network using arp -a.

In general, there is some free information on Linux Journey.

Other:

Vi Text Editor
Notepad++ Editor

With default settings, if you write code in Notepad++ and run the code on a Linux system, it may sometimes be helpful to run 'dos2unix` on your code
Ubuntu Notepad++ Alternatives (I recommend gedit)

Basic MS-DOS tutorial
LaTeX tutorial
MiKTeX - Windows software for processing Tex/LaLeX files; also useful for compiling R packages
MacTeX - Mac software for processing Tex/LaLeX files
Subversion high-speed tutorial
Using subversion for Bioconductor packages
Google Code University
Git Bioconductor Tricks

For managing GitHub repository and Bioconductor Repository: http://bioconductor.org/developers/how-to/git/sync-existing-repositories/
You can confirm that the upstream repository has been added with "git remote -v"
You may need passcode to run "git clone git@git.bioconductor.org:packages/[PACKAGE].git", but other users can clone repository with "git clone https://git.bioconductor.org/packages/[PACKAGE]"
If you prefer working with the GitHub interface ("origin" in the instructions above), you can indirectly update the Bioconductor repository as follows (except if Bioconductor changes a file, such as the description file in new releases) :

git clone https://github.com/[username]/[package]
cd [package folder]
git remote add upstream git@git.bioconductor.org:packages/[package].git

If needing to update release branch, please see Tutorial for fixing bugs
If already synced (and you have checked out the appropriate release), you can also update the branch with "git push" and "git push upstream."

git add [updated files]
git commit -m "update message"
git push upstream master

Amazon AWS (cloud computing)

Even though I still have some free Google Cloud credits, I encountered an issue with a newer gcsfuse interface, such that I thought it might be easier to go back to AWS (or purchase a Linux server for my apartment)
So, here are some general notes:

I would recommend using putty to connect to your EC2 instances
S3 storage and EFS storage are different (I would use S3 for sharing large datasets, and EFS for mounting internally shared data between EC2 instances)

Amazon provides a way to make EFS mounting easier using amazon-efs-utils, using the two commands:

sudo yum install -y amazon-efs-utils (installation)
sudo mount -t efs [file system ID]:/ /path/to/efs (mounting the EFS storage)
You can also see similar instructions when you view the full information about the file system that you created.

aws Command Line Interface (CLI) - includes commands to work with S3 storage, and it is already installed on EC2 instance (but I noticed a command to transfer data from S3 to EFS/EC2 didn't work exactly like planned)
Instead, if you are on Windows, I would recommend WinSCP to transfer data from your local computer to an EC2 instance (and, in turn, the EFS mounted storage)