Recently, for part of an R modelling project I put together a group of functions to perform several different jobs in a Quantitative Microbial Risk Assessment (QMRA). Traditional orthodox QMRA methodology utilizes population-based statistics and distributions to estimate the mean risk, of an individual within a population, contracting a specific disease. Currently we are working with water-borne pathogens such as Cryptsporidium, Norvirus and Campylobacter, all of which cause gastro-intestinal diseases, particulary to children in under-developed countries. The traditional methods have been widely used and are endorsed by many leading research and public health organizations in the world, including the World Health Organization(WHO), and the Centre for Disease Control (CDC) among many others. What I wanted to add to the tradional analyses was the ability to accomodate a period of “acquired immunity” to the models. To make a long story very short, for most of these pathogenic infections, once an individual has had the disease, their body develops a level of natural immunity to that specific pathogen. The caveat is that this immunity is generally not complete, and only lasts for a period of time, usually months however this is a huge area of scientific research, and currently debate.
To allow estimation of the effects of varying immunity time periods required a new approach called “microsimulation”. I say new, the approach has been around a long time, but to the best of my knowledge has not been utilized in QMRA analysis. Briefly, microsimulation models a large population of individuals, and tracks each individual, independently of all others in the population. Obviously if the population gets large, and the relative ratio of total simulation length to individual calculation step gets high, the computational power increases dramatically. By way of example, to simulate twenty indivual people, on a daily basis for thirty days is a relatively small simulation, however if you increase the population size to 10,000 people and run it on a daily basis for five years total the computational requirements increases dramatically, probably to the point where attempting to run the simulation on a single, multi-core personal computer is impractial. However this is a perfect application for parallel simulation on a computer cluster.
By way of background, for several years now I’ve written simulation models in the R language using parallelization methods. Personally I really like this approach because I can develop and test parallel models using well tested R packages such as “doMPI” and “foreach” to simplify the process greatly. I write and debug small “toy” models on my pc, then when they run and give reasonable results I upload them to a cluster, change the parameter values to do meaningful sized simulations, and then run them on a cluster, using somewhere between 300 and 400 cores.
In the most recent work I’d written and collected about fourteen different functions to perform this analysis. Over the years I’ve read about coallating R functions into an R package for two reasons, first to make it simpler to implement them myself, and secondly to make it easier to distribute them to anyone else who might want to use them. Thus this was my first attempt at writing an R package, using RStudio and packages such as “devtools” and “Roxygen2”, both of which are excellent I might add. It all started so well, and got so complicated (at least for me!), so quickly.
Complications of parallelization within an R package
So I pulled the functions together in an R package and the problems began. As I read and dug around I learned a couple of critical things about R packages. It was almost impossible to find a package that contained a foreach loop as an example. The two main problems I faced were:
- when building an R package, functions are evaluated at the time of building, not when you subsequently call the function at a later time using the package
- the foreach package creates a new “environment” every time it executes a new version of the foreach loop. The new environment does not see any of the variables in higher level environments.
Therefore the main problem was how to pass variables to functions that are within, or called by, the foreach loop. Therefore I could hard code any required arguments to functions, but could not change the value of an argument prior to running the top level function in the package. By way of example, the QMRA analysis requires input of a parameter value of the length of the immune period, i.e. once you have had the disease are you immune to reinfection for 30, 90, 120, 180 … days? I couldn’t change it, without going inside the R package.
After much digging around, reading books, getting help on StackOverflow, I still couldn’t get it to work. Very fortunately Steve Weston, now at Yale University, previously at Revolution Analytics, answered an email and helped me to work out a solution. Steve is a parallel computing specialist and author/co-author of both the foreach and doMPI packages! Thanks a bunch Steve. So the solution we came up with does some tricky stuff with environments, to pass on argument values to whatever function, within the foreach loop requires them. EUREKA, a solution!!
Where can you find the solution?
I wrote an R package of a small toy model that contains all the features of my complex QMRA model. Rather than try to explain it all, if you have some familiarity with R you can probably understand the solution better from the code. I’ve attempted to be explicit with comments, but undoubtedly will have overlooked something. Therefore feel free to use this example and/or ask for clarification if necessary.
To repeat, this is a solution of how to build an R package, that contains a parallel computing foreach loop, and have the ability to change the value of arguments at time of calling the functions within the package.
The example Package, and an R script to execute it to make sure you have everything you need to execute it, such as a parallel backend like open-mpi, and R packages such as doMPI and foreach, can be found at here.
comments powered by Disqus