Descent into Data: Linux

Showing posts with label Linux. Show all posts

Friday, August 30, 2013

Linux Command Line Aliases

As a data guy, I do an awful lot of work from the command line. I get so much more done that way. And if your experience is anything like mine, you may have found that there are a number of moderately complicated Linux shell commands that are extremely useful, but you don't use them often enough to type them from memory when you need them. So instead you rummage through that pile of papers on your desk (or the files on your hard drive, or the posts on your blog, or the search results from Stack Overflow, etc., etc.) looking for "that one command" that was "just what you needed" when you were in a similar situation seven and a half months ago. What a waste of time!

Or maybe you can remember the command, but it's long and tedious, and you just hate typing it every time you need to use it.

Enter command line aliases!

Aliases are super handy. They allow you to take a long/arcane/obscure/tedious command (like (IFS=$'n'; du -s -BM `find -type d | egrep '^./[^/]+$'` | sort -nr | less)) and replace it with a shorter label that is easier to remember---(an alias)---like "dud" (disk usage directories).

Next I'll show you how to set up aliases and give a few examples of my favorites. I'm using the bash interpreter on Ubuntu Linux, so I'll assume that you're using it, too. It shouldn't be hard to adapt these instructions for other interpreters or distributions.

Configuring Aliases

There is a file called .bashrc in your home directory. You can edit it with the following command line:

{editor} ~/.bashrc

replacing {editor} with the name of your editor of choice (emacs, nano, vim, etc.) After that, it's simply a matter of adding alias lines to the end of the file. Each alias line has the following format:

alias {label}='{command}'

where {label} is the short label that you'd like to type instead of {command}.

(NOTE: Single quotes ensure that characters that are normally special to the shell---$, !, #, etc.---are not interpreted in any special way. You often need to put {command} in single quotes for everything to work properly. But beware that using single quotes can create some headaches when your command relies on single quotes as well. You can't just escape a single quote (') within single quotes. See the top answer for this question at Stack Overflow for help on embedded single quotes.)

That's it! Just close your terminal window, open a new one, and you should now be able to use your new aliases to run those commands that are hard to remember or tedious to type!

But wait. It's entirely possible that six months from now (or next week, for that matter) I won't remember the original command or the alias I just created. By typing "alias" on the command line, you'll get a list of the currently active aliases that you've created.

Alias Examples

Now for a few examples of aliases in action. The following are some of my favorites.

dud -- disk usage directories

(NOTE: This alias is a good example of using embedded single quotes. It's not pretty, but it was the only way I found to get it to work. Anyone know of a better way to do this?)

alias dud='(IFS=$'"'"'n'"'"'; du -s -BM `find -type d | egrep '"'"'^./[^/]+$'"'"'` | sort -nr | less)'

When I'm cleaning up my hard drive, doing backups, reorganizing directories, etc., I often want to know how much space a particular directory takes up. dud will show a list directories that are children to the current directory and how much space each directory takes up. It sorts the list in descending numerical order by the amount of storage space each directory uses.

aaron@ajvb:~$ dud
6101M   ./Dropbox
2827M   ./devel
296M    ./tika
161M    ./.cache
154M    ./.m2
124M    ./.cpan
117M    ./.dropbox
88M     ./.mozilla
50M     ./.dropbox-dist
44M     ./.config
10M     ./R
10M     ./.local
9M      ./Downloads
...

duf -- disk usage files

alias duf='(IFS=$'"'"'n'"'"'; du -s -BM `find -type f | egrep '"'"'^./[^/]+$'"'"'` | sort -nr | less)'

Similar to dud, this alias shows the space taken up by each file in the current directory.

aaron@ajvb:~$ duf
14M     ./Product_Report.xls
2M      ./Spending Vis.ai
1M      ./Ward Activity August 8th.docx
1M      ./.dropbox
1M      ./content.xml
1M      ./Combined-Data.xlsx
1M      ./AJ Statement.pgp
1M      ./Activity Sigh Up.docx
1M      ./Activity Flyer Large.pdf
1M      ./Activity Flyer Large.docx
...

Traversing Upwards through the Directory Hierarchy

Sometimes in my projects I find myself really deeply entrenched in the folder hierarchy. There is nothing worse (OK, there are probably worse things) than having to cd ../../../../.. your way towards some other folder higher up the tree. Here are a couple of aliases that help with that.

alias .='cd ..'
alias ..='cd ../..'
alias ...='cd ../../..'
alias ....='cd ../../../..'
alias .....='cd ../../../../..'

Or, if you prefer even less typing, you could use the following instead.

alias .='cd ..'
alias .2='cd ../..'
alias .3='cd ../../..'
alias .4='cd ../../../..'
alias .5='cd ../../../../..'

Conclusion

Aliases are awesome, and customizing the command line interface gives you some serious satisfaction and some geek cred to boot. I'll probably add to this list over time. In the meantime, what are some of your favorite command line aliases? Feel free to share them in the comments below.

Wednesday, December 15, 2010

Continuously monitoring open files in real-time

Recently, I wanted to be able to get a list of files that were being opened by a running process. Searching all over the web, I found a number of solutions, but they all involved using the lsof command.

The lsof command has many, many options, and it allows you to see which files have been opened by a given process. Coming the other direction, it also allows you to see which process has opened a given file. I can see how it would be extremely useful in many different circumstances.

However, my problem was that my process was opening files and closing them almost immediately. In other words, I had no hope of using the lsof command to view open files, because lsof only shows files that are currently open, and by the time lsof would run, the files were already closed again!

I discovered a different way to continuously monitor, in real-time, all of the files that were being opened by a process, regardless of how quickly they were closed:

strace -tt myprog 2> system_calls.txt
grep 'open(' system_calls.txt > opened_files.txt

strace is a command that logs all of the system calls for myprog. The -tt option includes a timestamp (with microseconds) at the beginning of each line. Each file is opened with a call to "open(", so grepping for this string should give you a list of all files that were opened.

Tuesday, August 31, 2010

Installing Ubuntu 10.04 (Lynx) x86_64 Server on Dell XPS 630i

Recently, I needed to install Ubuntu on a Dell XPS 630i. There was one irritating problem: the installation cd would consistently freeze just after selecting "Install Ubuntu" from the main menu, leaving me with a blinking white cursor and the inner turmoil that can only be experienced while wondering whether your computer is actually doing anything

I've never had these kinds of problems installing Ubuntu before, and I wasn't really sure where to start troubleshooting. A number of forum websites with postings similar to my own situation recommended changing some of the install parameters, such as noapic, nolapic, noacpi, etc.

None of this worked.

I finally found this post on a Dell community forum, which ingeniously suggested to:

Install Ubuntu 8.04 (Heron) x86_64 Server

Check for updates in the package manager. Install all UPDATES (NOT distribution upgrade)

Restart

In a terminal, sudo update-manager --devel-release

Check for updates one more time. THEN click on the button at the top of the package manager window to install the distribution upgrade to arrive at 10.04 (Lynx).

I never would have thought of that. I followed the post instructions exactly. Success! Everything appears to be working just fine. Here's to you, jakeman66.

Tuesday, October 13, 2009

Keeping a remote process running after terminal disconnect

Quoting TheOneKEA at http://www.linuxquestions.org/questions/linux-general-1/keeping-a-process-running-after-disconnect-150235/:

nohup is what you want - it's a wrapper that blocks the SIGHUP signal sent to all applications connected to a terminal when that terminal is closed by the shell.

Just ssh into the box and start the command using this syntax:

[user@remoteboxen user]$ nohup /path/to/command arguments &

The man page explains it better.

Friday, September 11, 2009

Rebuilding the VirtualBox Kernel Modules (Ubuntu 9.04)

Any time there is a kernel update, you would do well to rebuild the VirtualBox kernel module to ensure compatibility with your new kernel version. This can be done by executing the following command from the terminal:

sudo /etc/init.d/vboxdrv setup

Thursday, September 10, 2009

Installing Fonts in Linux (Ubuntu 9.04)

First, you can find some good free font downloads at http://www.sostars.com. I downloaded a stencil font called "Ver Army." I unzipped the file, and found a .ttf font file.

I learned how it install it from this page. Here's a summary:

To install Microsoft Windows fonts: sudo apt-get install ttf-mscorefonts-installer
To install Red Hat Liberation fonts: sudo apt-get install ttf-liberation

To install any other kind of font (including the one I downloaded from sostars.com):

mkdir ~/.fonts (make a font directory in your home directory if one doesn't exist already)

mv ver-army.ttf ~/.fonts (move your ttf file into the .fonts folder)

Restart the computer

Monday, August 17, 2009

GNU sed (Stream EDitor)

sed -r 's/\t+/,/g'

`sed`	invoke the stream editor
`-r`	use extended regular expressions (similar to using the -E argument for grep). This gives meaning to the '+' character in my regex.
`s`	tells sed that we are doing a replacement ("substitution") operation
`\t+`	find occurrences of one or more tab characters
`,`	replace it with a comma
`g`	do this substitution for all occurrences of \t+

So, today I had a problem. A friend needed me to convert a 10 MB data file from tab-separated format to comma-separated format.

"This should take about 2 seconds."

I wasn't on my trusty little laptop (running Ubuntu 9.04 Jaunty Jackalope since March) and was stuck using a lab computer on campus, which was, of course, running Windows XP with no useful utilities whatsoever. To try to save some time, I tried to do this conversion right on my friend's computer. We opened the document in MS Word, and tried to do a Find and Replace for tabs, converting them to commas.

Slow. Killed the program several minutes into the operation.

Next, over to my trusty laptop. Loaded up jEdit, a handy programming editor that has done well for me in the past. Tried to do the find and replace.

Also slow. Killed this about 10 minutes into the operation. "It really shouldn't be taking this long." What went wrong? JEdit was out of memory. I found that out from the command-line terminal where I launched jEdit. Hmmm... Maybe some kind of error box would have been nice so I didn't just sit there for 10 minutes wondering. ;)

No more of this garbage. We're going to the command line.

Always go to the command line.

I already knew about sed, but my memory was a little rusty on the command-line arguments. After about 10 minutes, I finally found what I was looking for.

Converted the file in about 2 seconds.

Why is it that something that should take 2 seconds always takes 30 minutes?

Monday, April 13, 2009

Shell script for Google search result parsing

This is the shell script I wrote to help me perform the analysis I did for Quest 5.

1. Perform a site:yoursite.edu search in Google, displaying 100 results per page.
2. Save each page (Google will only give you 10 at most) into a folder named yoursite.edu
3. Download the shell script to the directory that contains the yoursite.edu directory.
4. At the command prompt, type:

./google-results-parse yoursite.edu

5. OR, if you named the yoursite.edu directory something different, run this:

./google-results-parse yoursite.edu savedresultsdirectory

6. It will create a "savedresultsdirectory-parsed" directory, which will contain a "domainlist" file and a "pagelinks" directory. The "domainlist" gives the subdomain breakdown of the search results. The "pagelinks" folder contains files for each subdomain that include all of the search result URLs for that subdomain.

Download the file here.

#!/bin/sh

site_name=''
results_path=''
parsed_path=''

### validate arguments
if [ $# -lt 1 ]; then
  printf "usage: google-results-parse exampledomain.edu [/googleresults/directory/path]"
  exit 1
fi

if [ $# -eq 1 ] && [ -d $1 ]; then
  site_name=$1
  results_path=$1
fi

if [ $# -eq 2 ] && [ -d $2 ]; then
  site_name=$1
  results_path=$2
else
  printf "Must supply one parameter that is the domain name and the name of the directory for the google search results"
  exit 1
fi

### create "-parsed" directory
parsed_path=${results_path}-parsed
if [ ! -d $parsed_path ]; then
  mkdir $parsed_path
fi

### create "pagelinks" directory
pagelinks_path=${parsed_path}/pagelinks
if [ ! -d $pagelinks_path ]; then
  mkdir $pagelinks_path
fi

### count up the total number of CC page instances per domain
grep -ohr "http://[^/]*$site_name/" ${results_path}/* | sort | uniq -c | sort -gr > ${parsed_path}/domainlist

### get all of the individual links within these pages that remain in the initial domain
grep -Eho "http://[^/]+" ${parsed_path}/domainlist > /tmp/clean_domains_$$
grep -ohr "http://[^/]*$site_name/[^"']*" ${results_path}/* | sort | uniq > /tmp/pagelinks_$$

### put links for each domain in its own file
for line in $(cat /tmp/clean_domains_$$)
do
  grep "$line" /tmp/pagelinks_$$ | sort > ${pagelinks_path}/pagelinks-${line#"http://"}
done

### send wget to go get these page links!
#for file in $(ls ${parsed_path}/pagelinks)
#do
#  wget --input-file=${parsed_path}/pagelinks/${file} --wait=1 --random-wait --force-directories --directory-prefix=${parsed_path}/downloads --no-clobber
#done

### scan for media links
### jpg, gif, png, mp3, zip, doc, docx, xls, xlsx
### grep -Erho 'http://.*byu.edu/[^"]+.(pdf|doc|jpg|gif|png|docx|xls|xlsx|zip|wmv|mp3|wma|wav|m4p|mpeg)' * | uniq

### remove all temporary files for this script
rm /tmp/*_$$