Descent into Data: 2009

Tuesday, October 13, 2009

Keeping a remote process running after terminal disconnect

Quoting TheOneKEA at http://www.linuxquestions.org/questions/linux-general-1/keeping-a-process-running-after-disconnect-150235/:

nohup is what you want - it's a wrapper that blocks the SIGHUP signal sent to all applications connected to a terminal when that terminal is closed by the shell.

Just ssh into the box and start the command using this syntax:

[user@remoteboxen user]$ nohup /path/to/command arguments &

The man page explains it better.

Friday, September 11, 2009

Rebuilding the VirtualBox Kernel Modules (Ubuntu 9.04)

Any time there is a kernel update, you would do well to rebuild the VirtualBox kernel module to ensure compatibility with your new kernel version. This can be done by executing the following command from the terminal:

sudo /etc/init.d/vboxdrv setup

Thursday, September 10, 2009

Installing Fonts in Linux (Ubuntu 9.04)

First, you can find some good free font downloads at http://www.sostars.com. I downloaded a stencil font called "Ver Army." I unzipped the file, and found a .ttf font file.

I learned how it install it from this page. Here's a summary:

To install Microsoft Windows fonts: sudo apt-get install ttf-mscorefonts-installer
To install Red Hat Liberation fonts: sudo apt-get install ttf-liberation

To install any other kind of font (including the one I downloaded from sostars.com):

mkdir ~/.fonts (make a font directory in your home directory if one doesn't exist already)

mv ver-army.ttf ~/.fonts (move your ttf file into the .fonts folder)

Restart the computer

Monday, August 17, 2009

GNU sed (Stream EDitor)

sed -r 's/\t+/,/g'

`sed`	invoke the stream editor
`-r`	use extended regular expressions (similar to using the -E argument for grep). This gives meaning to the '+' character in my regex.
`s`	tells sed that we are doing a replacement ("substitution") operation
`\t+`	find occurrences of one or more tab characters
`,`	replace it with a comma
`g`	do this substitution for all occurrences of \t+

So, today I had a problem. A friend needed me to convert a 10 MB data file from tab-separated format to comma-separated format.

"This should take about 2 seconds."

I wasn't on my trusty little laptop (running Ubuntu 9.04 Jaunty Jackalope since March) and was stuck using a lab computer on campus, which was, of course, running Windows XP with no useful utilities whatsoever. To try to save some time, I tried to do this conversion right on my friend's computer. We opened the document in MS Word, and tried to do a Find and Replace for tabs, converting them to commas.

Slow. Killed the program several minutes into the operation.

Next, over to my trusty laptop. Loaded up jEdit, a handy programming editor that has done well for me in the past. Tried to do the find and replace.

Also slow. Killed this about 10 minutes into the operation. "It really shouldn't be taking this long." What went wrong? JEdit was out of memory. I found that out from the command-line terminal where I launched jEdit. Hmmm... Maybe some kind of error box would have been nice so I didn't just sit there for 10 minutes wondering. ;)

No more of this garbage. We're going to the command line.

Always go to the command line.

I already knew about sed, but my memory was a little rusty on the command-line arguments. After about 10 minutes, I finally found what I was looking for.

Converted the file in about 2 seconds.

Why is it that something that should take 2 seconds always takes 30 minutes?

Monday, April 13, 2009

Shell script for Google search result parsing

This is the shell script I wrote to help me perform the analysis I did for Quest 5.

1. Perform a site:yoursite.edu search in Google, displaying 100 results per page.
2. Save each page (Google will only give you 10 at most) into a folder named yoursite.edu
3. Download the shell script to the directory that contains the yoursite.edu directory.
4. At the command prompt, type:

./google-results-parse yoursite.edu

5. OR, if you named the yoursite.edu directory something different, run this:

./google-results-parse yoursite.edu savedresultsdirectory

6. It will create a "savedresultsdirectory-parsed" directory, which will contain a "domainlist" file and a "pagelinks" directory. The "domainlist" gives the subdomain breakdown of the search results. The "pagelinks" folder contains files for each subdomain that include all of the search result URLs for that subdomain.

Download the file here.

#!/bin/sh

site_name=''
results_path=''
parsed_path=''

### validate arguments
if [ $# -lt 1 ]; then
  printf "usage: google-results-parse exampledomain.edu [/googleresults/directory/path]"
  exit 1
fi

if [ $# -eq 1 ] && [ -d $1 ]; then
  site_name=$1
  results_path=$1
fi

if [ $# -eq 2 ] && [ -d $2 ]; then
  site_name=$1
  results_path=$2
else
  printf "Must supply one parameter that is the domain name and the name of the directory for the google search results"
  exit 1
fi

### create "-parsed" directory
parsed_path=${results_path}-parsed
if [ ! -d $parsed_path ]; then
  mkdir $parsed_path
fi

### create "pagelinks" directory
pagelinks_path=${parsed_path}/pagelinks
if [ ! -d $pagelinks_path ]; then
  mkdir $pagelinks_path
fi

### count up the total number of CC page instances per domain
grep -ohr "http://[^/]*$site_name/" ${results_path}/* | sort | uniq -c | sort -gr > ${parsed_path}/domainlist

### get all of the individual links within these pages that remain in the initial domain
grep -Eho "http://[^/]+" ${parsed_path}/domainlist > /tmp/clean_domains_$$
grep -ohr "http://[^/]*$site_name/[^"']*" ${results_path}/* | sort | uniq > /tmp/pagelinks_$$

### put links for each domain in its own file
for line in $(cat /tmp/clean_domains_$$)
do
  grep "$line" /tmp/pagelinks_$$ | sort > ${pagelinks_path}/pagelinks-${line#"http://"}
done

### send wget to go get these page links!
#for file in $(ls ${parsed_path}/pagelinks)
#do
#  wget --input-file=${parsed_path}/pagelinks/${file} --wait=1 --random-wait --force-directories --directory-prefix=${parsed_path}/downloads --no-clobber
#done

### scan for media links
### jpg, gif, png, mp3, zip, doc, docx, xls, xlsx
### grep -Erho 'http://.*byu.edu/[^"]+.(pdf|doc|jpg|gif|png|docx|xls|xlsx|zip|wmv|mp3|wma|wav|m4p|mpeg)' * | uniq

### remove all temporary files for this script
rm /tmp/*_$$