Whether you’re learning basic coding in R or mastering the intricacies of mixed-model ANOVAs in SAS, it helps to have some data to practice with. I frequently make up small data sets for this blog based on random number sequences or subsets of other data I’ve collected, but when I need larger data sets these are the ones I like.
I frequently use the Meuse data set, originally published by Burrough & McDonnell in 1998 and available in the R package sp. It includes topsoil heavy metal concentrations in the floodplain of the Meuse river in Germany. I like this data set because it can be used for spatial or non-spatial analysis examples, and it’s about the right size for my uses (small enough to run quickly, but large enough to be interesting).
Web Soil Survey is also a great way resource for spatial data sets. You can download data from an area of interest to you or your students and easily have access to soil-related data sets of different size. I have also blogged about accessing WSS in other posts in the past (here, and probably elsewhere).
While deciding if Tableau was a tool that made sense for my data visualization needs, I played around with some of their downloadable data sets. I am not investing time in using Tableau right now, but I was very impressed with the diversity of public data sets they have available for download. I’m sure you can find many options related to your students’ interests, and you also don’t need to have a Tableau account to access the data downloads.
If you’re looking for some larger datasets or even genomic data for more advanced classes, definitely check out this Awesome Public Datasets list on Github. I like that it’s sorted by topic (I’ve clicked through things primarily in the agriculture, biology and economics sections) and that it has enough variety that you could teach everything from freshmen stats 101 through a grad level genetics or machine learning class without necessarily having to dig for other data sources.
You may notice that Fisher’s Iris data set is not on my list. While this data set was commonly included in the statistics courses I took in undergrad, and it is still used by many online statistics tutorials, I have not used the data since learning about its origins as part of the eugenics movement. You can learn more about the Iris data and its ubiquity in data science on its Wikipedia page, but the short version is that this is a multivariate data set of sepal and petal lengths of three genuses of irises. For a more detailed breakdown of the ethical and practical reasons to use alternative data sets, read this post by Armchair Ecology.
Whether you’re learning new analysis skills or teaching a class, it’s really helpful to have some data sets to work with. The above resources are where I usually turn for data sets, and I’d love to learn about your favorites in the comments section!