--- title: "Getting Started With The treedata.table Package" author: "Josef Uyeda, Cristian Roman-Palacios, April Wright" date: "08/08/2020" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started With The treedata.table Package} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Getting Started With The `treedata.table` Package The aim of the `treedata.table` R package is to allow researchers to access and manipulate phylogenetic data using tools from the `data.table` package. `data.table` has many functions for rapidly manipulating data in a memory efficient way. Using the `treedata.table` package begins with creating a `treedata.table` object. The `treedata.table` matches the tip.labels of the phylogeny to a column of names in your `data.frame`. This allows you to manipulate the data, and the corresponding tree together. Importantly, the character matrix must must include a column with the taxa names and should be of class `data.frame`. The tree must be of class `phylo` or `multiPhylo`. A `treedata.table` is created using the `as.treedata.table` function. Here we use the *Anolis* dataset from `treeplyr`. Traits in this dataset were randomly generated for a set of 100 species. ```{r} library(ape) library(treedata.table) # Load example data data(anolis) #Create treedata.table object with as.treedata.table td <- as.treedata.table(tree = anolis$phy, data = anolis$dat) ``` We may inspect our object by calling it by name. You will notice that your `data.frame` is now a `data.table`. A `data.table` is simply an advanced version of a `data.frame` that, among other, increase speed in data manipulation steps while simplifying syntax. ```{r} td ``` Furthermore, the new `data.table` has been reordered into the same order as the tip.labels of your tree. ```{r} td$phy$tip.label == td$dat$tip.label ``` ## Manipulating Data ### Coindexing Your data table can be indexed in the same way any other `data.table` object would be. For example, if we wanted to look at our snout-vent length column, we can do that like so. ```{r pressure, echo=TRUE} td$dat[,'SVL'] ``` You can also use double bracket syntax to directly return column data as a named list. ```{r} td[["SVL"]] ``` The same functionality can also be accomplished through the `extractVector` function. Both the double bracket syntax and the `extractVector` function will return a named vector. ```{r} extractVector(td, 'SVL') ``` Multiple traits can also be extracted using `extractVector`. ```{r} extractVector(td, 'SVL','ecomorph') ``` However, there's a couple aspects that are unique to `[[.treedata.table()` and `extractVector()`. First, `[[.treedata.table()` has an extra exact argument to enable partial match (i.e. when target strings and those in the `treedata.table` object match partially). Second, `extractVector()` can extract multiple columns and accepts non-standard evaluation (i.e. names are treated as string literals). The real power in treedata.table is in co-indexing the tree and table. For example, in the below command, we use `data.table` syntax to take the first representative from each ecomorph. We retain all data columns. If you examine the tree object, you will see that it has had all the tips absent from the resultant `data.table`. ```{r} td[, head(.SD, 1), by = "ecomorph"] ``` We could also do the same operation with multiple columns: ```{r} td[, head(.SD, 1), by = .(ecomorph, island)] ``` Tail is also implemented ```{r} td[, tail(.SD, 1), by = "ecomorph"] ``` Columns in the `treedata.table` object can also be operated on using general `data.table` syntax. In the below example, the tree is pruned to those tips that occur in Cuba. This is the `data.table` equivalent of `dplyr`'s filter. Then, a new column is created in the `data.table`, assigned the name "Index", and assigned the value of the SVL + the hostility index. This enables concurrent manipulation of the phylogeny, and the calculation of a new index for only those tips we would actually like to use. ```{r} td[island == "Cuba",.(Index=SVL+hostility)] ``` ### Running functions on `treedata.table` objects In the below command, we extract one vector from our data.table and use `geiger`'s continuous model fitting to estimate a Brownian motion model for the data using the `tdt` function. ```{r} tdt(td, geiger::fitContinuous(phy, extractVector(td, 'SVL'), model="BM", ncores=1)) ``` ### Dropping and extracting taxa from `treedata.table` objects We can also drop tips directly from the tree, and have those tips drop concurrently from the data.table. In the example below, we remove two taxa by name. ```{r} dt <- droptreedata.table(tdObject=td, taxa=c("chamaeleonides" ,"eugenegrahami" )) ``` We can check if *A. chamaeleonides* and *A. eugenegrahami* are still in the tree ```{r} c("chamaeleonides" ,"eugenegrahami" ) %in% dt$phy$tip.label ``` And we can do the same with the data in the `treedata.table` object ```{r} c("chamaeleonides" ,"eugenegrahami" ) %in% dt$dat$X ``` When you're done, the data.table and tree can both be extracted from the object: ```{r} df <- pulltreedata.table(td, "dat") tree <- pulltreedata.table(td, "phy") ``` The table ```{r} df ``` and the corresponding tree ```{r} tree ```