Before I learned R, I had previously only programmed in Matlab. Matlab and R have comparable options for data types and treat objects very similarly, but R, and specifically the Tidyverse, has this whole other vocabulary for data structures. Understanding the differences between matrices, data frames, and other data structures made learning R easier for me, and most importantly, taught me the vocabulary I needed to effectively write at Google search to answer my R questions.
Before getting on into data structures, let’s get on the same page about three other terms: objects, data types, and the Tidyverse. Pretty much every beginner R class will include the phrase “everything in R is an object” at some point or another, but what does everything mean? Objects include all the things listed in your global environment. They are any of the stored variables, plots, data, etc. that you could run a function on in R. To put in another way, if you think of functions in R as verbs, an object in R is a noun you could do do that verb to. To continue this metaphor into a full ridiculous sentence, you, the programmer, are the subject, the function is the verb, and well, the objects are the objects.
Data types are kind of exactly what it sounds like. Is this data an integer (3), a number (2.01), a character (C), a string of characters (“corn”), a logical value (true), or something else? There are many data types in R, and different data types will behave differently in many functions. For instance, summary() will provide the mean, maximum, minimum, and a few other summary statistics for data that are numbers. If your data are logical, summary() will provide a tally of how many observations are True, False, or N/A. Also, you can’t make a histogram from string data– check your data type if you get error messages while you’re plotting!
Finally, the Tidyverse is a suite of packages for R that were developed by Hadley Wickham. Packages in R have differing levels of support and upkeep, but since the Tidyverse packages are supported by RStudio they are some of the most stable and widely used packages in data science. You don’t need to use packages to use R, but packages are really useful and contain specialized functions that can make your programming faster. If you’re interested in learning more about the Tidyverse, I recommend Charlie Hadley’s Learning the R Tidyverse course on LinkedIn Learning.
So what are data structures? Data structures describe how different data are stored and organized. Common data structures are vectors, matrices, data frames, or lists. If you are using the Tidyverse, you may also see tibbles. Vectors are a list of data points. Usually you see the dimensions of a vector given in the form 1xY where Y is the number of elements in the vector. Vectors can have numerical, integer, string, or logical data, but all elements in a single vector should be the same data type. Coercing a vector to accept mixed data types prevents functions from running as they should on the vector.
Matrices are similar to vectors in that they have elements of the same data type. But, matrices are two dimensions– YxZ where Y is the number of rows and Z is the number of columns. An even more general data structure is an array, which can be multidimensional (XxYxZ and so forth).
Data frames are a collection of vectors. Each vector has to have the same data type, but within a data frame there can be vectors of different data types. You can run functions on whole data frames or on a single vector within a data frame. If you’re used to working in SAS, data frames are the most similar R data structure to SAS data sets.
Like data frames, lists can also contain different types of data. But, in a data frame all vectors within the frame have to have the same length. In a list, you can mix vectors, matrices, and other data types all of different sizes. The output of many functions in R is a list.
Tibbles are the final type of data structure I want to talk about today. Tibbles are basically the Tidyverse’s take on a data frame. They look different when you print them, and it’s more difficult to rename variables in a tibble than in a data frame. Tibbles also don’t convert your data type automatically, which is really helpful if you work with character data. My favorite part about tibbles is that they let you use variable names that have spaces, special characters, or start with numbers. I work with many collaborators that send me Excel spreadsheets of data, and frequently their column names would not be compatible with the rules for variable names in data frames. Tibbles have saved me a ton of time and hassle.
I hope this post added some new words to your R vocabulary! If you would like to learn about these data structures, I highly recommend this article from Software Carpentry. It has many examples and great descriptions of how to create and manipulate objects in R.