Thursday, September 12, 2013

Unit Testing in R with RUnit

Brief Overview



  • The RUnit package can be found here at the CRAN repository.

  • RUnit was written by Matthias Burger, Klaus Juenemann, and Thomas Koenig.

  • It is intended to be analogous to JUnit (for Java) and uses many of the same conventions.

  • RUnit depends on specific naming conventions to identify test files and test functions.

  • When running tests, RUnit distinguishes between failure and error.  Failure occurs when a check... function fails (e.g., checkTrue(FALSE)).  Errors are reported when R experiences an error (usually as a result of a stop() statement.)

  • Test results can be either printed to the console or saved to a file, in either text or HTML format.


Installing and Using RUnit


To install the RUnit package, type the following at the R prompt: install.packages("RUnit") If installation was successful, include library(RUnit) in your code to make use of the RUnit package in a script.

RUnit Naming Conventions


RUnit depends on specific naming conventions for test files and test functions.  These conventions are described by regular expressions.  Any file matching the test file regular expression will be interpreted by RUnit as a test file.  Any function within a test file that matches the test function regular expression will be interpreted by RUnit as a test function.  The default regular expressions are:

  • Test File: ^runit.+\\.r

  • Test Function: ^test.+


which simply means that any file starting with runit and ending in .r will be interpreted as a test file, and any function within that file starting with test will be interpreted as a test function.  For example, a test file with the name runitnewmath.r might contain a function called testlogarithms.

If you want to use a different naming convention for your test files and test functions, you can customize the regular expressions used to identify them.  (More on that in the Running Tests section.)

Content of Test Files


A test file will consist of one or more test functions.  An example file, called runitfile1.r would contain functions like the following:
testfunc1 <- function(){...}
testfunc2 <- function(){...}
testfunc3 <- function(){...}

When RUnit runs through runitfile1.r, it will run each test function (testfunc1, testfunc2, and testfunc3) once.

The test functions you write must not require parameters; they will not receive any when called by the RUnit framework.

.setUp and .tearDown


In a test file, you may optionally define the special function .setUp() .  If present, RUnit will call .setUp immediately before each test function that is defined in the file.  This allows you to make any preparations that are required by all of the tests in the file and to make sure that each test function in the file starts its execution in a context identical to the other tests in the file.

You may also optionally define the special function .tearDown() , which is analogous to the .setUp function.  If present, RUnit will call .tearDown immediately after calling each test function that is defined in the file.  You would use this function to take care of any clean-up that is required after running each of the tests.

Writing Tests


RUnit uses a variety of “check…” functions to designate pass/fail conditions for the test.  They are counterparts to the “assert…” functions provided in JUnit.  These functions constitute the core testing functionality of the RUnit package.  You'll want to include at least one of them in each test that you write.  If the check fails, it will produce a FAILURE in the test results.  From the RUnit documentation:
[The check... functions] check the results of some test calculation. If these functions are called within the RUnit framework, the results of the checks are stored and reported in the test protocol.

checkEquals compares two R objects by invoking all.equal on the two objects. If the objects are not equal an error is generated and the failure is reported to the test logger such that it appears in the test protocol.

checkEqualsNumeric works just like checkEquals except that it invokes all.equal.numeric instead ofall.equal

checkIdentical is a convenience wrapper around identical using the error logging mechanism of RUnit.

checkTrue uses the function identical to check if the expression provided as first argument evaluates to TRUE. If not, an error is generated and the failure is reported to the test logger such that it appears in the test protocol.

checkException evaluates the passed expression and uses the try mechanism to check if the evaluation generates an error. If it does the test is OK. Otherwise an error is generated and the failure is reported to the test logger such that it appears in the test protocol.

DEACTIVATED interrupts the test function and reports the test case as deactivated. In the test protocol deactivated test functions are listed separately. Test case deactivation can be useful in the case of major refactoring. Alternatively, test cases can be commented out completely but then it is easy to forget the test case altogether.


Running Tests


Once you've created one or more test files with one or more test functions, you can run them in one of two ways:

  • runTestFile

  • runTestSuite


Run One Test File


To simply run the tests in one file, use runTestFile.
runTestFile("path/to/runitmyfile.r")

Define and Run One or More Test Suites


To run a suite of tests that could include hundreds of test functions and span numerous test files, you use the defineTestSuite and runTestSuite commands.  For example:
testsuite.example <- defineTestSuite(
name="example",
dirs=c("."))

runTestSuite(testsuite.example)

Here, the name argument is the name of the suite that will be reported in the test results.  The dirs argument specifies the directories that you want to search for test files to be included in the suite.  Possible values could be:

  • Relative Paths

    • "." : Look in the current working directory.

    • ".." : Look in the parent directory

    • "./test" : Look in the test directory that is a child to the current working directory

    • etc.



  • Absolute Paths

    • "/home/aaron/projects/R/mathstuff"

    • etc.




The RUnit package documentation suggests that these directory names should all be absolute.  From my own experience, relative names appear to work just as well.

You can specify multiple directories by wrapping them in the c()  function.  For example:
testsuite.example <- defineTestSuite(
name="example",
dirs=c(".", "..", "/home/aaron/projects/mathstuff"))

Running Multiple Test Suites

You can also define multiple test suites and run them with one invocation of runTestSuite() by wrapping them in the list() function as follows:
testsuite.math <- defineTestSuite("NewMath Package", dirs="/home/aaron/projects/R/bigproject/newmath")
testsuite.io <- defineTestSuite("IO Package", dirs="/home/aaron/projects/R/bigproject/io")
runTestSuite(testSuites=list(testsuite.math, testsuite.io))


Viewing  the Results


If you would like to capture the results from the tests and view them in text or HTML format then you need first to capture the result from runTestFile and runTestSuite.  Then you can use printTextProtocol and printHTMLProtocol to print the results to the terminal or to a file.  Example, let's say we run an example test suite and capture the result in a variable called results .
testsuite.example <- definteTestSuite(
name="math",
dirs="./math")

results <- runTestSuite(testsuite.example)

We then have a number of options for viewing the test results.

  • Print text to the terminal: printTextProtocol(results)

  • Print HTML to the terminal: printHTMLProtocol(results)

  • Print text to a file: printTextProtocol(results, "results.txt")

  • Print HTML to a file: printHTMLProtocol(results, "results.html")


A Working Example


newmath.r


# pow raises x to the yth power
# It expects a real number x and y >= 0.
pow <- function(x, y) {
  # Need numeric arguments.
  if (mode(x) != "numeric" || mode(y) != "numeric") {
    stop("x and y must both be numeric.")
  }
  # Y must be positive
  if (y < 0) {
    stop("y must be greater than 0")
  }

  # X to 0 power always equals 1.
  if (y == 0) {
    return(1)
  }

  # Do the math in a really inefficient way.
  result <- 1
  for (i in 1:y) {
    result <- result * x
  }

  return(result)
}

runitnewmath.r


# Load the pow function from newmath.r into memory
source("newmath.r")

# X and Y should both be numeric. If a non-numeric argument
# is provided, we should get an exception.
test_pow_non_numeric_exception <- function() {
  checkException(pow(2, "a"))
  checkException(pow(2, "2"))
  checkException(pow(2, Inf))
}

# Y must be positive.  If a negative number
# is provided, we should get an exception.
test_pow_y_neg_exception <- function() {
  checkException(pow(2, -4))
}

# Y must be an integer.  If a floating point number
# is provided, we should get an exception.
test_pow_y_float_exception <- function() {
  checkException(pow(2, 4.1))
}

test_pow_pos_ints_correct <- function() {
  checkEquals(16, pow(2, 4))
}

test_pow_zero_correct <- function() {
  checkEquals(1, pow(2, 0))
}

Running the Tests


Here I'm assuming that the newmath.r and runitnewmath.r files are in the current working directory.  Copy and paste the following into the R command prompt:
# Define the test suite.
testsuite.newmath <- defineTestSuite(
  name="newmath",
  dirs=".")

# Run the suite.
results <- runTestSuite(testsuite.newmath)

# Print the results to a pretty HTML file.
printHTMLProtocol(results, "results.html")

If everything happened correctly, you should be able to open up the results.html file in a web browser and see something like the following:

results

Monday, September 2, 2013

Publicly Available Data

Sometimes I'm looking for interesting data sets to play with.  Here's an (incomplete) attempt at making a list of sites where one can find publicly available data sets.  It's definitely a work in progress.  I'll update this as I find more of them and try to organize it a little better.

TODO

  • Keep finding more useful data sources.

  • Dig into each of these sources a bit, provide some annotations.

  • Think of better ways to organize these links.  Things to consider: Sponsoring rganizations, geographic locations of sponsoring organization, accessibility of data, data quality, data freshness, disciplines.  Consider using a table that lists each source and indicates the presence/absence of various attributes.

  • Rate each resource on ease of use (e.g., a site producing CSVs is more readily accessible to more people than a site that requires SPARQL queries, etc.)


Sites Offering Free Data Sets


Government Websites


United States of America



Labor

Business/Finance/Economics

Education

Health

United Kingdom



Media Websites



Wikipedia-based



Other


Friday, August 30, 2013

Linux Command Line Aliases

As a data guy, I do an awful lot of work from the command line.  I get so much more done that way.  And if your experience is anything like mine, you may have found that there are a number of moderately complicated Linux shell commands that are extremely useful, but you don't use them often enough to type them from memory when you need them.  So instead you rummage through that pile of papers on your desk (or the files on your hard drive, or the posts on your blog, or the search results from Stack Overflow, etc., etc.) looking for "that one command" that was "just what you needed" when you were in a similar situation seven and a half months ago.  What a waste of time!

Or maybe you can remember the command, but it's long and tedious, and you just hate typing it every time you need to use it.

Enter command line aliases!

Aliases are super handy.  They allow you to take a long/arcane/obscure/tedious command (like (IFS=$'n'; du -s -BM `find -type d | egrep '^./[^/]+$'` | sort -nr | less)) and replace it with a shorter label that is easier to remember---(an alias)---like "dud" (disk usage directories).

Next I'll show you how to set up aliases and give a few examples of my favorites.  I'm using the bash interpreter on Ubuntu Linux, so I'll assume that you're using it, too.  It shouldn't be hard to adapt these instructions for other interpreters or distributions.

Configuring Aliases


There is a file called .bashrc in your home directory.  You can edit it with the following command line:
{editor} ~/.bashrc

replacing {editor} with the name of your editor of choice (emacs, nano, vim, etc.)  After that, it's simply a matter of adding alias lines to the end of the file.  Each alias line has the following format:
alias {label}='{command}'

where {label} is the short label that you'd like to type instead of {command}.

(NOTE: Single quotes ensure that characters that are normally special to the shell---$, !, #, etc.---are not interpreted in any special way.  You often need to put {command} in single quotes for everything to work properly.  But beware that using single quotes can create some headaches when your command relies on single quotes as well.  You can't just escape a single quote (') within single quotes.  See the top answer for this question at Stack Overflow for help on embedded single quotes.)

That's it!  Just close your terminal window, open a new one, and you should now be able to use your new aliases to run those commands that are hard to remember or tedious to type!

But wait.  It's entirely possible that six months from now (or next week, for that matter) I won't remember the original command or the alias I just created.  By typing "alias" on the command line, you'll get a list of the currently active aliases that you've created.

Alias Examples


Now for a few examples of aliases in action.  The following are some of my favorites.

dud -- disk usage directories


(NOTE: This alias is a good example of using embedded single quotes.  It's not pretty, but it was the only way I found to get it to work.  Anyone know of a better way to do this?)
alias dud='(IFS=$'"'"'n'"'"'; du -s -BM `find -type d | egrep '"'"'^./[^/]+$'"'"'` | sort -nr | less)'

When I'm cleaning up my hard drive, doing backups, reorganizing directories, etc., I often want to know how much space a particular directory takes up.   dud will show a list directories that are children to the current directory and how much space each directory takes up.  It sorts the list in descending numerical order by the amount of storage space each directory uses.
aaron@ajvb:~$ dud
6101M ./Dropbox
2827M ./devel
296M ./tika
161M ./.cache
154M ./.m2
124M ./.cpan
117M ./.dropbox
88M ./.mozilla
50M ./.dropbox-dist
44M ./.config
10M ./R
10M ./.local
9M ./Downloads
...

duf -- disk usage files


alias duf='(IFS=$'"'"'n'"'"'; du -s -BM `find -type f | egrep '"'"'^./[^/]+$'"'"'` | sort -nr | less)'

Similar to dud, this alias shows the space taken up by each file in the current directory.
aaron@ajvb:~$ duf
14M ./Product_Report.xls
2M ./Spending Vis.ai
1M ./Ward Activity August 8th.docx
1M ./.dropbox
1M ./content.xml
1M ./Combined-Data.xlsx
1M ./AJ Statement.pgp
1M ./Activity Sigh Up.docx
1M ./Activity Flyer Large.pdf
1M ./Activity Flyer Large.docx
...

Traversing Upwards through the Directory Hierarchy


Sometimes in my projects I find myself really deeply entrenched in the folder hierarchy.  There is nothing worse (OK, there are probably worse things) than having to cd ../../../../.. your way towards some other folder higher up the tree.  Here are a couple of aliases that help with that.
alias .='cd ..'
alias ..='cd ../..'
alias ...='cd ../../..'
alias ....='cd ../../../..'
alias .....='cd ../../../../..'

Or, if you prefer even less typing, you could use the following instead.
alias .='cd ..'
alias .2='cd ../..'
alias .3='cd ../../..'
alias .4='cd ../../../..'
alias .5='cd ../../../../..'

Conclusion


Aliases are awesome, and customizing the command line interface gives you some serious satisfaction and some geek cred to boot.  I'll probably add to this list over time. In the meantime, what are some of your favorite command line aliases? Feel free to share them in the comments below.