- UNIX Tutorial
- SSH Linux/Unix commands
- SCP Linux/Unix commands
- basic linux commands from Google Code
- "Bioinformatics one-liners" - common bash commands
- Useful/Basic PBS Commands
- 'perf' examples for performance statistics
- Various Notes:
- "ps -eHj" shows hierarchy of processes (from this lesson)
Perl:
- Downloading Perl
- For Windows: Install using ActivePerl
- Perl usually comes pre-installed on Macs
- References
- perldoc
- definitions for common perl functions
- also contains some tutorials
- BioPerl
- contains modules for common bioinformatic analysis
- Beginning Perl for Bioinformatics
- good book to start learning Perl (it's what I used!)
- Exercises available for free download
- Mastering Perl for Bioinformatics
- Parallel Computing
- Stack Overflow discussion on limiting threads in Perl
- PerlMonks discussion on threads in Perl
- Semaphore example
- Stack Overflow discussion on using for-loop to add threads
- In a nutshell, I would say the steps are as follows:
- Load Dependencies
- use threads;
- use Thread::Semaphore;
- Create semaphore (with maximum number of reads)
- our $semaphore = Thread::Semaphore->new($max_threads);
- Use subprocess to apply for multiple threads.
- That should look something like this (for each sample/process):
- push @threads, threads->create(\&function_name, $func_var1, $func_var2,...)
- foreach (@threads) {$_->join;}
- Within subprocess (described as "function_name" above), control the number of on-going processes as follows:
- Start function with $semaphore->down(); (occupying thread)
- End function with $semaphore->up(); (opening up thread)
Python:
- General Python Tutorial
- Google's Python Class
- Biopython
- contains modules for common bioinformatic analysis
R:
- Installing R
- Download Source Code or Binaries
- Download Windows R GUI
- Install Bioconductor Packages
- Rtools - necessary for building R packages on a Windows computer
- If Rtools doesn't initially work try:
- Making sure that your version of R and Rtools are compatible
- Add
PATH: C:\Program Files\R\R-2.15.1\bin\x64;C:\Rtools\bin;C:\Rtools\gcc-4.6.3\bin
- Parallel Computing
- Recent versions of R already include the "parallel" library (but you still have to run library("parallel") to get access to those functions)
- You'll most likely be interested in using mclapply (doesn't work with Windows, but knows to set number of cores back to 1 on Windows computers)
- Note this is a parallel version of apply, so it can be easily applied to columns of a data frame, but you'll need to transpose the table with row names to apply functions across rows. Please see to this blog for an example.
- mclapply with one core is slower than apply, and it can potentially take up a lot of memory if data frame has a large number of columns (>50,000 columns, for example). So, use of mclapply may not always be beneficial.
- References
Docker:
- Understanding Docker (high-level introduction)
- Docker User Guide
- My Notes (read tutorials first):
- To mount Windows Documents folder, docker run -it -v /c/Users/[your username]/Documents:/mnt/[mounted name] [image]
- If you need to re-enter an exited session, you can use docker start -ia container_ID to re-open it (note use of container ID instead of image ID)
- docker ps -a to see exited interactive jobs
- If if host your images on Docker Hub, try to keep them under 3 GB
- To upload (after running an interactive session):
- docker commit -m "update message" container_ID [image]
- docker push [image]
- Using Docker Through Singularity:
- General Tutorial - for example, you can run interactive mode with singularity shell docker://user-name/repository
- Mapping Folders - for example, as an extension of the above example, you can run a Docker image in interactive mode with a mapped drive using singularity shell -B /source/path:/docker/path docker://user-name/repository
- NOTE: this may not work for all Docker images (with errors not apparent until you try to run programs within the container), but I think it should work for some of them.
- C and R:
- .C Interface Function
- scroll down for information more specific to C++
- Including C++ Code in a Bioconductor Package
- My Notes
- For g++ compiler, binary output is created with "-o"
- You can use "-g" option for debugging and "-Wall" for warning messages, but you'll still get error messages either way
- If mixing your code with open-source code, take the compiler into consideration. For example, some string functions that work when compiling in gcc but not g++.
- Ubuntu .iso
- Mounting shared folders
- When you first open Virtual Box, choose settings for your image and define folder (under "Shared Folders")
- To make that folder accessible, go to "Devices --> Insert Guest Additions CD image"
- Probably should restart machine
- Your folder should appear under /media/sf_[folder name]
- However, you may still not have access to the contents of that folder. To fix the permissions issue, run sudo mount -t vboxsf [folder name] /media/sf_[folder name]
- This might not be sufficient to have folder load everytime you start the VM. If you run into issues, try sudo usermod -G vboxsf -a [username] after re-mounting folder
- If you find yourself in a situation where Ubuntu won't load from a locked installation file, you can fix this by pressing "Left-Shift" before Ubuntu starts to load (and then use the GRUB menu to fix the installation file). I don't think this is unique for the VM environment, but that is where I saw this could work.
Free Data / Code Sharing:
Setting up Ubuntu Server
- GitHub (up to 1 GB per repository, 100 MB per file)
- SourceForge (honor system or 5 GB?)
- FigShare (up to 5 GB)
- Dryad (up to 20 GB)
- Zenodo (up to 50 GB)
- Has versioning (although it took me a little while to realize this)
- However, this doesn't seem quite as flexible as GitHub. For example, the upload comes with a warning that "File addition, removal or modification are not allowed after you have published your upload".
Setting up Ubuntu Server
- I am not sure why, but I had better installation success using the "alternative" installation files (as suggested as a solution in this forum)
- A little early to say how much of a "success" everything is. However, at least for the installation step, this is the "No OS" computer that I am trying to set up as a server: Dell Server
- Restart server via command line using `sudo reboot` or `sudo poweroff`
- Tutorials to set up SSH key: here and here
- I found the instructions to be a bit confusing. However, to essentially have 2 passwords and require an SSH key, you will want to add "AuthenticationMethods publickey,password" to the /etc/ssh/sshd_config file and then restart the service using "sudo service ssh restart" (as described here)
- Reformat SSH keys to use with PuTTY
- Using SSH keys with WinSCP
- Even though I provided a password, I still needed to enter the ssh passphrase as well as the server password (the way that I set things up)
- Mounting an additional hard drive
- Ask Ubuntu discussion
- I was able to see my 2nd hard drive (even though it wasn't accessible for storage yet, using "lsblk")
- For newer and larger drives, I think the answer whose first step is to run "sudo blkid" may be the most relevant.
- Ubuntu community help
- Even without being mounted, I could see information about my 2nd hard drive (which was /dev/sdb), using "sudo lshw -C disk"
- My 2nd hard drive was 3 TB, and both discussions mention special steps need to be taken for more than 2 TB of space (specifically, fdisk should not be used to create an MBR partition with >2TB)
- For a new external hard drive, "parted" is recommended to reformat the drive.
- For each partition, I think "sudo mkfs.ext4 /dev/sd[x][n]" should work. However, I think that should be for partitions like /dev/sdc1 not the full drive like /dev/sdc.
- There is some information about command line formatting options here.
- I also thought this YouTube video provided some general background, but it doesn't really provide as much Linux-specific information (if reformatting for primary use on Ubuntu, which I think would probably be ext4).
- You probably don't want to have to use "sudo" for all commands within the mounted drive.
- This discussion relates to that issue.
- This also relates to the configuration in the /etc/fstab file for loading mounted drives. There is a recommended set of settings in the Ubuntu guide for Systemwide Mounts (although I am using ext4 instead of vfat).
- Also, I might need to change things in the future, but I needed to use "defaults" instead of the provided in order to get "sudo mount -a" to correctly load the drive (after editing the /etc/fstab file).
- I think checking for the presence of the "lost+found" subfolder is another way to see if the mounting was successful.
- Most of this is also discussed in the first Ubuntu community help link that started this section.
- Information about RAID drives (which is what I had at one point)
- Ubuntu community documentation
- Advanced community documentation
- Ask Ubuntu documentation
- General RAID storage overview from MUO
- This tutorial on RAID arrays describes how to check the status of the RAID array using "/proc/mdstat" (for example, I could tell that I needed to do something with my second drive because it was listed as "inactive"). Likewise, I could use this file to tell my 1st drive was set of as "raid1" (for mirroring).
- Ubuntu community RAID1+LVM (for mirroring) documentation
- I also created this discussion
- Setting up a static IP
- With the newest version of Ubuntu server, I think this probably uses "netplan" to create a static IP
- I forget exactly what I used at first, but I used this to help me be able to access external servers (name servers map names to IP addresses, and you must list name servers to be able to do things like update programs, clone git repositories, etc).
- The subnet mask also confused me, but I think you probably want "24" (where I found the definitions for 255.0.0.0, 255.255.0.0, and 255.255.255.0 masks, which are described on this page, and are 8, 16, and 24 respectively).
- There is also a more formal website for netplan here.
- On Windows, you can list IP addresses on your network using arp -a.
- In general, there is some free information on Linux Journey.
- Vi Text Editor
- Notepad++ Editor
- With default settings, if you write code in Notepad++ and run the code on a Linux system, it may sometimes be helpful to run 'dos2unix` on your code
- Ubuntu Notepad++ Alternatives (I recommend gedit)
- Basic MS-DOS tutorial
- LaTeX tutorial
- MiKTeX - Windows software for processing Tex/LaLeX files; also useful for compiling R packages
- MacTeX - Mac software for processing Tex/LaLeX files
- Subversion high-speed tutorial
- Using subversion for Bioconductor packages
- Google Code University
- Git Bioconductor Tricks
- For managing GitHub repository and Bioconductor Repository: http://bioconductor.org/developers/how-to/git/sync-existing-repositories/
- You can confirm that the upstream repository has been added with "git remote -v"
- You may need passcode to run "git clone git@git.bioconductor.org:packages/
[PACKAGE].git" , but other users can clone repository with "git clone https://git.bioconductor.org/packages/[PACKAGE]" - If you prefer working with the GitHub interface ("origin" in the instructions above), you can indirectly update the Bioconductor repository as follows (except if Bioconductor changes a file, such as the description file in new releases) :
- git clone https://github.com/[username]/[package]
- cd [package folder]
- git remote add upstream git@git.bioconductor.org:packages/[package].git
- If needing to update release branch, please see Tutorial for fixing bugs
- If already synced (and you have checked out the appropriate release), you can also update the branch with "git push" and "git push upstream."
- git add [updated files]
- git commit -m "update message"
- git push upstream master
- Amazon AWS (cloud computing)
- Even though I still have some free Google Cloud credits, I encountered an issue with a newer gcsfuse interface, such that I thought it might be easier to go back to AWS (or purchase a Linux server for my apartment)
- So, here are some general notes:
- I would recommend using putty to connect to your EC2 instances
- S3 storage and EFS storage are different (I would use S3 for sharing large datasets, and EFS for mounting internally shared data between EC2 instances)
- Amazon provides a way to make EFS mounting easier using amazon-efs-utils, using the two commands:
- sudo yum install -y amazon-efs-utils (installation)
- sudo mount -t efs [file system ID]:/ /path/to/efs (mounting the EFS storage)
- You can also see similar instructions when you view the full information about the file system that you created.
- aws Command Line Interface (CLI) - includes commands to work with S3 storage, and it is already installed on EC2 instance (but I noticed a command to transfer data from S3 to EFS/EC2 didn't work exactly like planned)
- Instead, if you are on Windows, I would recommend WinSCP to transfer data from your local computer to an EC2 instance (and, in turn, the EFS mounted storage)
No comments:
Post a Comment