Monday, April 13, 2009

Shell script for Google search result parsing

This is the shell script I wrote to help me perform the analysis I did for Quest 5.

1. Perform a site:yoursite.edu search in Google, displaying 100 results per page.
2. Save each page (Google will only give you 10 at most) into a folder named yoursite.edu
3. Download the shell script to the directory that contains the yoursite.edu directory.
4. At the command prompt, type:
./google-results-parse yoursite.edu


5. OR, if you named the yoursite.edu directory something different, run this:
./google-results-parse yoursite.edu savedresultsdirectory


6. It will create a "savedresultsdirectory-parsed" directory, which will contain a "domainlist" file and a "pagelinks" directory. The "domainlist" gives the subdomain breakdown of the search results.  The "pagelinks" folder contains files for each subdomain that include all of the search result URLs for that subdomain.

Download the file here.


#!/bin/sh

site_name=''
results_path=''
parsed_path=''

### validate arguments
if [ $# -lt 1 ]; then
  printf "usage: google-results-parse exampledomain.edu [/googleresults/directory/path]"
  exit 1
fi

if [ $# -eq 1 ] && [ -d $1 ]; then
  site_name=$1
  results_path=$1
fi

if [ $# -eq 2 ] && [ -d $2 ]; then
  site_name=$1
  results_path=$2
else
  printf "Must supply one parameter that is the domain name and the name of the directory for the google search results"
  exit 1
fi

### create "-parsed" directory
parsed_path=${results_path}-parsed
if [ ! -d $parsed_path ]; then
  mkdir $parsed_path
fi

### create "pagelinks" directory
pagelinks_path=${parsed_path}/pagelinks
if [ ! -d $pagelinks_path ]; then
  mkdir $pagelinks_path
fi

### count up the total number of CC page instances per domain
grep -ohr "http://[^/]*$site_name/" ${results_path}/* | sort | uniq -c | sort -gr > ${parsed_path}/domainlist

### get all of the individual links within these pages that remain in the initial domain
grep -Eho "http://[^/]+" ${parsed_path}/domainlist > /tmp/clean_domains_$$
grep -ohr "http://[^/]*$site_name/[^"']*" ${results_path}/* | sort | uniq > /tmp/pagelinks_$$

### put links for each domain in its own file
for line in $(cat /tmp/clean_domains_$$)
do
  grep "$line" /tmp/pagelinks_$$ | sort > ${pagelinks_path}/pagelinks-${line#"http://"}
done

### send wget to go get these page links!
#for file in $(ls ${parsed_path}/pagelinks)
#do
#  wget --input-file=${parsed_path}/pagelinks/${file} --wait=1 --random-wait --force-directories --directory-prefix=${parsed_path}/downloads --no-clobber
#done

### scan for media links
### jpg, gif, png, mp3, zip, doc, docx, xls, xlsx
### grep -Erho 'http://.*byu.edu/[^"]+.(pdf|doc|jpg|gif|png|docx|xls|xlsx|zip|wmv|mp3|wma|wav|m4p|mpeg)' * | uniq

### remove all temporary files for this script
rm /tmp/*_$$