class: center, middle, inverse, title-slide # Reading Files ### Sam Tyner, Haley Jeppson ### 2018/06/06 --- class: primary # Data in Excel - Formats xls and csv - what's the difference? - File extensions xls and xlsx are proprietary Excel formats, binary files - csv is an extension for Comma Separated Value files. They are text files - directly readable. - Example: Glass element analysis data --- class: primary # Glass data Visit [this link](https://raw.githubusercontent.com/CSAFE-ISU/REU-materials-2018/master/slides/rstats/data/GK_clean_results.csv) to get the data. ```r library(readr) glass <- read_csv("https://raw.githubusercontent.com/CSAFE-ISU/REU-materials-2018/master/slides/rstats/data/GK_clean_results.csv") head(glass) ``` ``` ## # A tibble: 6 x 20 ## Piece Rep Li7 Na23 Mg25 Al27 K39 Ca42 Ti49 Mn55 Fe57 Rb85 ## <int> <int> <dbl> <int> <int> <int> <dbl> <int> <dbl> <dbl> <int> <dbl> ## 1 1 1 2.77 107280 24610 1622 743. 67570 115. 128. 545 1.04 ## 2 1 2 2.93 106760 24290 1620 732. 67220 118. 129. 532 1.07 ## 3 1 3 2.73 106500 24410 1639 727. 68120 117. 130. 538 1.08 ## 4 1 4 2.69 104740 24520 1632 716. 67410 118. 129. 541 1.16 ## 5 1 5 2.71 105670 24400 1624 724. 67610 119. 127. 542 1.14 ## 6 2 1 2.44 104790 24380 1639 716. 68040 118. 128. 532 1.05 ## # ... with 8 more variables: Sr88 <dbl>, Zr90 <dbl>, Ba137 <dbl>, ## # La139 <dbl>, Ce140 <dbl>, Nd146 <dbl>, Hf178 <dbl>, Pb208 <dbl> ``` --- class: primary # `read_csv` vs. `read_*` `read_csv` is just one way to read a file using the `readr` package: - `read_delim`: the most generic function. Use the `delim` argument to read a file with any type of delimiter - `read_tsv`: read tab separated files - `read_lines`: read a file into a vector that has one element per line of the file - `read_file`: read a file into a single character element - `read_table`: read a file separated by space --- class: primary # Now we can make a plot! ```r library(ggplot2) ggplot(glass, aes(x = Li7, y = Na23)) + geom_point() ``` ![](03-r-files_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- class: primary # Read Excel Data Download the boxoffice.xlsx file from [this link](https://github.com/CSAFE-ISU/REU-materials-2018/blob/master/slides/rstats/data/boxoffice.xlsx) and put it in your "LearningR" folder. Read it in and save it as `boxofc` ```r library(readxl) boxofc <- read_excel("boxoffice.xlsx") head(boxofc) ``` ``` ## # A tibble: 6 x 11 ## Rank Rank.Last.Week Movie Distributor Gross Change Thtrs. `Per Thtr.` ## <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> ## 1 1. -1 Split Universal 2.57e7 -0.36 3199. 8020. ## 2 2. new A Dog… Universal 1.82e7 <NA> 3059. 5957. ## 3 3. -3 Hidde… 20th Centu… 1.40e7 -0.11 3351. 4179. ## 4 4. new Resid… Sony Pictu… 1.36e7 <NA> 3104. 4382. ## 5 5. -5 La La… Lionsgate 1.22e7 0.45 3136. 3901. ## 6 6. -2 xXx: … Paramount … 8.60e6 -0.56… 3651. 2356. ## # ... with 3 more variables: `Total Gross` <dbl>, Days <dbl>, Date <dttm> ``` --- class: primary # `foreign` Package - Other file formats can be read using the functions from package `foreign` - SPSS: `read.spss` - SAS: `read.xport` - Minitab: `read.mtp` - Systat: `read.systat` --- class: primary # Your Turn 3.1 The NHANES (National Health and Nutrition Survey) publishes data in the SAS xport format: https://wwwn.cdc.gov/Nchs/Nhanes/continuousnhanes/default.aspx?BeginYear=2013 1. Scroll to the bottom, choose one of the datasets (Demographics, Examination, etc.). Download the Data file (XPT) 2. Use `read.xport()` to load the file into R 3. Briefly examine the dataset you've imported (use `head` or `tail`, etc) --- class: primary # Other Data Types What can you think of? --- class: primary # Google Sheets ```r install.packages("googlesheets") library(googlesheets) gap <- gs_key("1D5FK6z7aGukdGRAJQoyuPc2QV5zSJ4oN7oslp_u7YEQ") gs_browse(gap) africa <- gs_read(gap, "Africa") americas <- gs_read(gap, "Americas") ``` ``` ## # A tibble: 6 x 6 ## country continent year lifeExp pop gdpPercap ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Algeria Africa 1952. 43.1 9279525. 2449. ## 2 Algeria Africa 1957. 45.7 10270856. 3014. ## 3 Algeria Africa 1962. 48.3 11000948. 2551. ## 4 Algeria Africa 1967. 51.4 12760499. 3247. ## 5 Algeria Africa 1972. 54.5 14760787. 4183. ## 6 Algeria Africa 1977. 58.0 17152804. 4910. ``` --- class: primary # Text files Data is [here](https://raw.githubusercontent.com/CSAFE-ISU/REU-materials-2018/master/slides/rstats/data/poem.txt) ```r poem <- read_lines("https://raw.githubusercontent.com/CSAFE-ISU/REU-materials-2018/master/slides/rstats/data/poem.txt") poem ``` ``` ## [1] "Because I could not stop for Death -" ## [2] "He kindly stopped for me -" ## [3] "The Carriage held but just Ourselves -" ## [4] "and Immortality" ``` --- class: primary # Your Turn 3.2 Read in the rest of the Google Sheets from the Gapminder data set and combine them into one data set with the `bind_rows()` function in the `dplyr` package. Save it as `gapdata`. --- class: primary # Plot `gapdata` ```r library(ggplot2) ggplot(data = gapdata) + geom_point(aes(x = log(gdpPercap), y = lifeExp, color = continent, size = pop)) + facet_wrap(~continent) + theme(legend.position = c(.75,.25)) ``` --- class: secondary <img src="03-r-files_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" />