Descent into Data

Thursday, September 12, 2013

Unit Testing in R with RUnit

Brief Overview

The RUnit package can be found here at the CRAN repository.

RUnit was written by Matthias Burger, Klaus Juenemann, and Thomas Koenig.

It is intended to be analogous to JUnit (for Java) and uses many of the same conventions.

RUnit depends on specific naming conventions to identify test files and test functions.

When running tests, RUnit distinguishes between failure and error. Failure occurs when a check... function fails (e.g., checkTrue(FALSE)). Errors are reported when R experiences an error (usually as a result of a stop() statement.)

Test results can be either printed to the console or saved to a file, in either text or HTML format.

Installing and Using RUnit

To install the RUnit package, type the following at the R prompt: install.packages("RUnit") If installation was successful, include library(RUnit) in your code to make use of the RUnit package in a script.

RUnit Naming Conventions

RUnit depends on specific naming conventions for test files and test functions. These conventions are described by regular expressions. Any file matching the test file regular expression will be interpreted by RUnit as a test file. Any function within a test file that matches the test function regular expression will be interpreted by RUnit as a test function. The default regular expressions are:

Test File: ^runit.+\\.r

Test Function: ^test.+

which simply means that any file starting with runit and ending in .r will be interpreted as a test file, and any function within that file starting with test will be interpreted as a test function. For example, a test file with the name runitnewmath.r might contain a function called testlogarithms.

If you want to use a different naming convention for your test files and test functions, you can customize the regular expressions used to identify them. (More on that in the Running Tests section.)

Content of Test Files

A test file will consist of one or more test functions. An example file, called runitfile1.r would contain functions like the following:

testfunc1 <- function(){...}
testfunc2 <- function(){...}
testfunc3 <- function(){...}

When RUnit runs through runitfile1.r, it will run each test function (testfunc1, testfunc2, and testfunc3) once.

The test functions you write must not require parameters; they will not receive any when called by the RUnit framework.

.setUp and .tearDown

In a test file, you may optionally define the special function .setUp() . If present, RUnit will call .setUp immediately before each test function that is defined in the file. This allows you to make any preparations that are required by all of the tests in the file and to make sure that each test function in the file starts its execution in a context identical to the other tests in the file.

You may also optionally define the special function .tearDown() , which is analogous to the .setUp function. If present, RUnit will call .tearDown immediately after calling each test function that is defined in the file. You would use this function to take care of any clean-up that is required after running each of the tests.

Writing Tests

RUnit uses a variety of “check…” functions to designate pass/fail conditions for the test. They are counterparts to the “assert…” functions provided in JUnit. These functions constitute the core testing functionality of the RUnit package. You'll want to include at least one of them in each test that you write. If the check fails, it will produce a FAILURE in the test results. From the RUnit documentation:

[The check... functions] check the results of some test calculation. If these functions are called within the RUnit framework, the results of the checks are stored and reported in the test protocol.

checkEquals compares two R objects by invoking all.equal on the two objects. If the objects are not equal an error is generated and the failure is reported to the test logger such that it appears in the test protocol.

checkEqualsNumeric works just like checkEquals except that it invokes all.equal.numeric instead ofall.equal

checkIdentical is a convenience wrapper around identical using the error logging mechanism of RUnit.

checkTrue uses the function identical to check if the expression provided as first argument evaluates to TRUE. If not, an error is generated and the failure is reported to the test logger such that it appears in the test protocol.

checkException evaluates the passed expression and uses the try mechanism to check if the evaluation generates an error. If it does the test is OK. Otherwise an error is generated and the failure is reported to the test logger such that it appears in the test protocol.

DEACTIVATED interrupts the test function and reports the test case as deactivated. In the test protocol deactivated test functions are listed separately. Test case deactivation can be useful in the case of major refactoring. Alternatively, test cases can be commented out completely but then it is easy to forget the test case altogether.

Running Tests

Once you've created one or more test files with one or more test functions, you can run them in one of two ways:

runTestFile

runTestSuite

Run One Test File

To simply run the tests in one file, use runTestFile.

runTestFile("path/to/runitmyfile.r")

Define and Run One or More Test Suites

To run a suite of tests that could include hundreds of test functions and span numerous test files, you use the defineTestSuite and runTestSuite commands. For example:

testsuite.example <- defineTestSuite(
  name="example",
  dirs=c("."))

runTestSuite(testsuite.example)

Here, the name argument is the name of the suite that will be reported in the test results. The dirs argument specifies the directories that you want to search for test files to be included in the suite. Possible values could be:

Relative Paths
- "." : Look in the current working directory.
- ".." : Look in the parent directory
- "./test" : Look in the test directory that is a child to the current working directory
- etc.

Absolute Paths
- "/home/aaron/projects/R/mathstuff"
- etc.

The RUnit package documentation suggests that these directory names should all be absolute. From my own experience, relative names appear to work just as well.

You can specify multiple directories by wrapping them in the c() function. For example:

testsuite.example <- defineTestSuite(
  name="example",
  dirs=c(".", "..", "/home/aaron/projects/mathstuff"))

Running Multiple Test Suites

You can also define multiple test suites and run them with one invocation of runTestSuite() by wrapping them in the list() function as follows:

testsuite.math <- defineTestSuite("NewMath Package", dirs="/home/aaron/projects/R/bigproject/newmath")
testsuite.io <- defineTestSuite("IO Package", dirs="/home/aaron/projects/R/bigproject/io")
runTestSuite(testSuites=list(testsuite.math, testsuite.io))

Viewing the Results

If you would like to capture the results from the tests and view them in text or HTML format then you need first to capture the result from runTestFile and runTestSuite. Then you can use printTextProtocol and printHTMLProtocol to print the results to the terminal or to a file. Example, let's say we run an example test suite and capture the result in a variable called results .

testsuite.example <- definteTestSuite(
  name="math",
  dirs="./math")

results <- runTestSuite(testsuite.example)

We then have a number of options for viewing the test results.

Print text to the terminal: printTextProtocol(results)

Print HTML to the terminal: printHTMLProtocol(results)

Print text to a file: printTextProtocol(results, "results.txt")

Print HTML to a file: printHTMLProtocol(results, "results.html")

A Working Example

newmath.r

# pow raises x to the yth power
# It expects a real number x and y >= 0.
pow <- function(x, y) {
  # Need numeric arguments.
  if (mode(x) != "numeric" || mode(y) != "numeric") {
    stop("x and y must both be numeric.")
  }
  # Y must be positive
  if (y < 0) {
    stop("y must be greater than 0")
  }

  # X to 0 power always equals 1.
  if (y == 0) {
    return(1)
  }

  # Do the math in a really inefficient way.
  result <- 1
  for (i in 1:y) {
    result <- result * x
  }

  return(result)
}

runitnewmath.r

# Load the pow function from newmath.r into memory
source("newmath.r")

# X and Y should both be numeric. If a non-numeric argument
# is provided, we should get an exception.
test_pow_non_numeric_exception <- function() {
  checkException(pow(2, "a"))
  checkException(pow(2, "2"))
  checkException(pow(2, Inf))
}

# Y must be positive.  If a negative number
# is provided, we should get an exception.
test_pow_y_neg_exception <- function() {
  checkException(pow(2, -4))
}

# Y must be an integer.  If a floating point number
# is provided, we should get an exception.
test_pow_y_float_exception <- function() {
  checkException(pow(2, 4.1))
}

test_pow_pos_ints_correct <- function() {
  checkEquals(16, pow(2, 4))
}

test_pow_zero_correct <- function() {
  checkEquals(1, pow(2, 0))
}

Running the Tests

Here I'm assuming that the newmath.r and runitnewmath.r files are in the current working directory. Copy and paste the following into the R command prompt:

# Define the test suite.
testsuite.newmath <- defineTestSuite(
  name="newmath",
  dirs=".")

# Run the suite.
results <- runTestSuite(testsuite.newmath)

# Print the results to a pretty HTML file.
printHTMLProtocol(results, "results.html")

If everything happened correctly, you should be able to open up the results.html file in a web browser and see something like the following:

Monday, September 2, 2013

Publicly Available Data

Sometimes I'm looking for interesting data sets to play with. Here's an (incomplete) attempt at making a list of sites where one can find publicly available data sets. It's definitely a work in progress. I'll update this as I find more of them and try to organize it a little better.

TODO

Keep finding more useful data sources.

Dig into each of these sources a bit, provide some annotations.

Think of better ways to organize these links. Things to consider: Sponsoring rganizations, geographic locations of sponsoring organization, accessibility of data, data quality, data freshness, disciplines. Consider using a table that lists each source and indicates the presence/absence of various attributes.

Rate each resource on ease of use (e.g., a site producing CSVs is more readily accessible to more people than a site that requires SPARQL queries, etc.)

Sites Offering Free Data Sets

Government Websites

United States of America

Data.gov -- http://www.data.gov/

Fedstats.gov
- http://www.fedstats.gov/
- Might be outdated. "Home page last updated March 12, 2007"

USA.gov -- http://www.usa.gov/Topics/Reference-Shelf/Data.shtml

USASpending.gov ("Where Americans can see where their money goes") -- http://www.usaspending.gov/

http://www.uscourts.gov/Statistics.aspx

CIA World Factbook -- https://www.cia.gov/library/publications/the-world-factbook/

Department of Veterans' Affairs -- http://www.va.gov/vetdata/

South Carolina Office of Research and Statistics -- http://ors.sc.gov/

United States Department of Agriculture -- http://www.usda.gov/wps/portal/usda/usdahome?navid=DATA_STATISTICS

Labor

Bureau of Labor Statistics -- http://www.bls.gov/

Department of Labor, Occupational Safety and Hazard Administration -- https://www.osha.gov/oshstats/

Business/Finance/Economics

Small Business Administration -- http://www.sba.gov/category/navigation-structure/starting-managing-business/starting-business/establishing-business/business-data-statistics

SBA Office of Advocacy -- http://www.sba.gov/advocacy/847

Federal Deposit Insurance Corporation (FDIC) Bank Data and Statistics -- http://www.fdic.gov/bank/statistical/

Education

US Department of Education -- http://www.ed.gov/rschstat/landing.jhtml

State of California Department of Education -- http://www.cde.ca.gov/ds/

Utah State Office of Education -- http://www.schools.utah.gov/data/

Health

National Institute of Mental Health -- http://www.nimh.nih.gov/statistics/index.shtml

Centers for Disease Control and Prevention (Tuberculosis) -- http://www.cdc.gov/tb/statistics/

Department of Health and Human Services, Health Resources and Services Administration -- http://www.hrsa.gov/data-statistics/

Ohio Department of Health -- http://www.odh.ohio.gov/healthstats/datastats.aspx

Centers for Medicare and Medicaid Services -- http://www.cms.gov/Research-Statistics-Data-and-Systems/Research-Statistics-Data-and-Systems.html

Oregon Division of Health -- http://public.health.oregon.gov/datastatistics/Pages/index.aspx

Alaska Division of Public Health -- http://dhss.alaska.gov/dph/VitalStats/Pages/data/default.aspx

New York State Department of Health -- http://www.health.ny.gov/statistics/

Friday, August 30, 2013

Linux Command Line Aliases

As a data guy, I do an awful lot of work from the command line. I get so much more done that way. And if your experience is anything like mine, you may have found that there are a number of moderately complicated Linux shell commands that are extremely useful, but you don't use them often enough to type them from memory when you need them. So instead you rummage through that pile of papers on your desk (or the files on your hard drive, or the posts on your blog, or the search results from Stack Overflow, etc., etc.) looking for "that one command" that was "just what you needed" when you were in a similar situation seven and a half months ago. What a waste of time!

Or maybe you can remember the command, but it's long and tedious, and you just hate typing it every time you need to use it.

Enter command line aliases!

Aliases are super handy. They allow you to take a long/arcane/obscure/tedious command (like (IFS=$'n'; du -s -BM `find -type d | egrep '^./[^/]+$'` | sort -nr | less)) and replace it with a shorter label that is easier to remember---(an alias)---like "dud" (disk usage directories).

Next I'll show you how to set up aliases and give a few examples of my favorites. I'm using the bash interpreter on Ubuntu Linux, so I'll assume that you're using it, too. It shouldn't be hard to adapt these instructions for other interpreters or distributions.

Configuring Aliases

There is a file called .bashrc in your home directory. You can edit it with the following command line:

{editor} ~/.bashrc

replacing {editor} with the name of your editor of choice (emacs, nano, vim, etc.) After that, it's simply a matter of adding alias lines to the end of the file. Each alias line has the following format:

alias {label}='{command}'

where {label} is the short label that you'd like to type instead of {command}.

(NOTE: Single quotes ensure that characters that are normally special to the shell---$, !, #, etc.---are not interpreted in any special way. You often need to put {command} in single quotes for everything to work properly. But beware that using single quotes can create some headaches when your command relies on single quotes as well. You can't just escape a single quote (') within single quotes. See the top answer for this question at Stack Overflow for help on embedded single quotes.)

That's it! Just close your terminal window, open a new one, and you should now be able to use your new aliases to run those commands that are hard to remember or tedious to type!

But wait. It's entirely possible that six months from now (or next week, for that matter) I won't remember the original command or the alias I just created. By typing "alias" on the command line, you'll get a list of the currently active aliases that you've created.

Alias Examples

Now for a few examples of aliases in action. The following are some of my favorites.

dud -- disk usage directories

(NOTE: This alias is a good example of using embedded single quotes. It's not pretty, but it was the only way I found to get it to work. Anyone know of a better way to do this?)

alias dud='(IFS=$'"'"'n'"'"'; du -s -BM `find -type d | egrep '"'"'^./[^/]+$'"'"'` | sort -nr | less)'

When I'm cleaning up my hard drive, doing backups, reorganizing directories, etc., I often want to know how much space a particular directory takes up. dud will show a list directories that are children to the current directory and how much space each directory takes up. It sorts the list in descending numerical order by the amount of storage space each directory uses.

aaron@ajvb:~$ dud
6101M   ./Dropbox
2827M   ./devel
296M    ./tika
161M    ./.cache
154M    ./.m2
124M    ./.cpan
117M    ./.dropbox
88M     ./.mozilla
50M     ./.dropbox-dist
44M     ./.config
10M     ./R
10M     ./.local
9M      ./Downloads
...

duf -- disk usage files

alias duf='(IFS=$'"'"'n'"'"'; du -s -BM `find -type f | egrep '"'"'^./[^/]+$'"'"'` | sort -nr | less)'

Similar to dud, this alias shows the space taken up by each file in the current directory.

aaron@ajvb:~$ duf
14M     ./Product_Report.xls
2M      ./Spending Vis.ai
1M      ./Ward Activity August 8th.docx
1M      ./.dropbox
1M      ./content.xml
1M      ./Combined-Data.xlsx
1M      ./AJ Statement.pgp
1M      ./Activity Sigh Up.docx
1M      ./Activity Flyer Large.pdf
1M      ./Activity Flyer Large.docx
...

Traversing Upwards through the Directory Hierarchy

Sometimes in my projects I find myself really deeply entrenched in the folder hierarchy. There is nothing worse (OK, there are probably worse things) than having to cd ../../../../.. your way towards some other folder higher up the tree. Here are a couple of aliases that help with that.

alias .='cd ..'
alias ..='cd ../..'
alias ...='cd ../../..'
alias ....='cd ../../../..'
alias .....='cd ../../../../..'

Or, if you prefer even less typing, you could use the following instead.

alias .='cd ..'
alias .2='cd ../..'
alias .3='cd ../../..'
alias .4='cd ../../../..'
alias .5='cd ../../../../..'

Conclusion

Aliases are awesome, and customizing the command line interface gives you some serious satisfaction and some geek cred to boot. I'll probably add to this list over time. In the meantime, what are some of your favorite command line aliases? Feel free to share them in the comments below.

Thursday, December 20, 2012

Transferring Data Between R and Excel

For a long time, I assumed that the only way to transfer data between R and Excel was to do something like the following:

write.csv(x=some.data.frame, file="some_file.csv")

and then to open the resulting CSV file in Excel. I didn't like doing this, because by the end of an analysis I would have a bunch of temporary "deleteme.csv" files cluttering my Desktop.

It turns out there is a better way to move data between R and Excel:

Move Data from R to Excel

First, from R:
write.table(x=some.data.frame, file="clipboard", sep="\t")

This command copies the specified data frame (not sure how well it works with other object types) to the system clipboard.

Then all one needs to do is open a new workbook in Excel and Ctrl-V (paste) the data into the workbook! Easy as pie!

Move Data from Excel to R

First, in Excel, select the data you would like to copy, then copy it (Ctrl-C).

Then, in R:
data.from.excel <- read.table(file="clipboard", sep="\t")

That's all there is to it. This will allow you to quickly and easily transfer data between R and Excel, or more generally, between R and any program that can read data from the clipboard.

Wednesday, November 30, 2011

Excel Changes Values and Formatting when Importing CSV

When opening a text-based data format with Excel (CSV, tab separated, etc.) you will find that Excel has a nasty habit of automatically interpreting the data types of the cells. Once it (thinks it) has determined the proper data type, it will also reformat the data to its default formats for numbers, currency, dates, times, etc.

For example, I work with data that involves a lot of timestamps. I frequently have timestamp data in a MySQL database that looks like this:

2011-11-30 13:56:02

This is a typical timestamp format that is recognized by many different programs and programming languages.

Usually I will export data from the database into CSV format and load it into Excel to do some tweaking before I turn it into a visualization. However, when reading the data from the CSV file, Excel, detecting it as a date/time value, will reformat it according to its default formatting rules, like this:

11/30/2011 13:02

If I then save the CSV file, Excel will save it out using its default format (MM/dd/yyyy hh:mm) instead of my original format (yyyy-MM-dd hh:mm:ss).

Yes, ladies and gentlemen, Excel just changed my values and corrupted my data.

Here is how to prevent this from happening:

In Windows Explorer rename the file, changing the extension from ".csv" to ".txt". (If you don't see file extensions, do the following:
1. Press the Alt key in the Explorer window. (This will make the menu bar visible.)
2. In the newly visible menu bar, select Tools -> Folder Options ...
3. Click on the "View" tab.
4. UNCHECK the box that says "Hide extensions for known file types."
5. Click OK.
6. You should now be able to see the ".csv", ".txt", ".whatever" extensions on the file names.

In Excel, open the newly renamed file. (You will probably need to change the file type selector to "All Files (*.*)" in order to find your .txt file.) After clicking "Open", Excel will present you with a window titled "Text Import Wizard - Step 1 of 3".

In Step 1, choose the "Delimited" radio button. Click "Next".

In Step 2, in the "Delimiters" section, check the "Comma" box (or whatever delimiter your file is using) and make sure that all other delimiter boxes are UNCHECKED. You should now see your data divided into the appropriate columns at the bottom of this window. Click "Next".

In Step 3, click on the first column. Use the scrollbar to scroll over to your rightmost column. Hold down the Shift key and click on the final column. All columns should now be selected.

Click on the "Text" radio button in the "column data format" section. Now click "Finish."

Following these steps, Excel will NOT try to automatically determine a data type, nor will it reformat or change your data in any way.

Wednesday, December 15, 2010

Continuously monitoring open files in real-time

Recently, I wanted to be able to get a list of files that were being opened by a running process. Searching all over the web, I found a number of solutions, but they all involved using the lsof command.

The lsof command has many, many options, and it allows you to see which files have been opened by a given process. Coming the other direction, it also allows you to see which process has opened a given file. I can see how it would be extremely useful in many different circumstances.

However, my problem was that my process was opening files and closing them almost immediately. In other words, I had no hope of using the lsof command to view open files, because lsof only shows files that are currently open, and by the time lsof would run, the files were already closed again!

I discovered a different way to continuously monitor, in real-time, all of the files that were being opened by a process, regardless of how quickly they were closed:

strace -tt myprog 2> system_calls.txt
grep 'open(' system_calls.txt > opened_files.txt

strace is a command that logs all of the system calls for myprog. The -tt option includes a timestamp (with microseconds) at the beginning of each line. Each file is opened with a call to "open(", so grepping for this string should give you a list of all files that were opened.

Tuesday, August 31, 2010

Installing Ubuntu 10.04 (Lynx) x86_64 Server on Dell XPS 630i

Recently, I needed to install Ubuntu on a Dell XPS 630i. There was one irritating problem: the installation cd would consistently freeze just after selecting "Install Ubuntu" from the main menu, leaving me with a blinking white cursor and the inner turmoil that can only be experienced while wondering whether your computer is actually doing anything

I've never had these kinds of problems installing Ubuntu before, and I wasn't really sure where to start troubleshooting. A number of forum websites with postings similar to my own situation recommended changing some of the install parameters, such as noapic, nolapic, noacpi, etc.

None of this worked.

I finally found this post on a Dell community forum, which ingeniously suggested to:

Install Ubuntu 8.04 (Heron) x86_64 Server

Check for updates in the package manager. Install all UPDATES (NOT distribution upgrade)

Restart

In a terminal, sudo update-manager --devel-release

Check for updates one more time. THEN click on the button at the top of the package manager window to install the distribution upgrade to arrive at 10.04 (Lynx).

I never would have thought of that. I followed the post instructions exactly. Success! Everything appears to be working just fine. Here's to you, jakeman66.