Part 1: Introduction to Data

class: center, middle, inverse, title-slide

# Part 1: Introduction to Data
### Sam Tyner
### 2018/06/05

---

class: primary
# Textbook

These slides are based on the book *OpenIntro Statistics* by David Diez, Christopher Barr, and Mine Çetinkaya-Rundel

The book can be downloaded from [https://www.openintro.org/stat/textbook.php](https://www.openintro.org/stat/textbook.php)

Part 1 Corresponds to Chapter 1 of the text. Sections 1.1-1.7 correspond to sections 1.1-1.7 of the text.

---
class: primary
# Outline

- Introductory example (1.1)
- Data basics (1.2)
- Data collection principles (1.3)
- Obervations and samples (1.4)
- Experiments (1.5)
- Looking at Numerical Data (1.6)
- Looking at Categorical data (1.7)

---
class: inverse, center 
# Section 1.1: Introductory example

---
class: primary
# Why Statistics?

- Good scientists use rigorous methods and make careful observations

--
  
- Observations aka **data**: the backbone of a statistical investigation

- **Statistics**: "the study of how best to collect, analyze, and draw conclusions from data" (from the textbook)

- **Statistics**: "a branch of mathematics dealing with the collection, classification, analysis, interpretation of numerical facts, for drawing inferences on the basis of their quantifiable likelihood (probability) of data." (from [Wikipedia](https://en.wikipedia.org/wiki/Statistics))

- **Statistics**: "the study of variation" (my definition)

---
class: primary
# Statistics in context

General investigative process:

1. Identify a question or problem.
2. Collect relevant data on the topic. 
3. Analyze the data.
4. Form a conclusion.

Where do you think "statistics" fit in?

---
class: primary
# Statistics in context

General investigative process:

1. Identify a question or problem.
2. .red[Collect relevant data on the topic.] 
3. .red[Analyze the data.]
4. .red[Form a conclusion.]

.red[Statistics focuses on making stages 2-4 objective, rigorous, and efficient]

---
class: primary
# The three pieces of statistics

1. How *best* can we collect data? 
2. How *should* it be analyzed? 
3. And what can we *infer* from the analysis?

---
class: primary
# Classical application

Medical trials - e.g. testing a new drug

1. Identify a question or problem.
2. Collect relevant data on the topic. 
3. Analyze the data.
4. Form a conclusion.

---
class: primary
# Step 1

Step 1: Identify a question or problem.

*Does the use of stents reduce the risk of stroke?*

Will patients with stents inserted have better outcomes (fewer and/or more minor strokes) than patients without stents inserted?

Stents - "a metal or plastic tube inserted into the lumen of an anatomic vessel or duct to keep the passageway open" ([Wikipedia](https://en.wikipedia.org/wiki/Stent))

---
class: primary
# Step 2

Step 2: Collect relevant data on the topic.

451 at-risk patients volunteered to be studied. Split into 2 groups:

- **Treatment** group (224 patients): Received a stent and medical management (medications, management of risk factors, and help in lifestyle modification)
- **Control** group (227 patients): Received same medical management as treatment group, but did not receive stents.

What is the purpose of the control group?

---
class: primary
# Step 2

Step 2: Collect relevant data on the topic.

451 at-risk patients volunteered to be studied. Split into 2 groups:

What is the purpose of the control group?

.red[The control group provides a reference point against which we can measure the medical impact of stents in the treatment group.]

---
class: primary
# Step 2 (cont.)

- Researchers looked at 2 time points: 30 days after enrollment and 365 days after enrollment 
- Results of 5 patients:

Patient | group | after 30 days | after 365 days
:------:|:------|:-------------:|:-------------:
1       | treatment | no event  | no event
2       | treatment | stroke    | stroke 
3       | treatment | no event  | no event 
...     | ...       |   ...     |  ...
450     | control   | no event  | no event
451     | control   | no event  | no event

---
class: primary
# Step 2 (cont.)

All outcomes in all patients in all groups:

group | timepoint | outcome | count 
:-----|:---------:|:-------:|:----:
treatment | 0-30 days | stroke | 33
treatment | 0-30 days | no event | 191  
treatment | 0-365 days | stroke | 45
treatment | 0-365 days | no event | 179
control | 0-30 days | stroke | 13
control | 0-30 days | no event | 214
control | 0-365 days | stroke | 28
control | 0-365 days | no event | 199

---
class: primary
# Step 3

Step 3: Analyze the data.

**Summary statistic** - one numerical value that summarizes a large amount of data

What do you think would be a good summary statistic for this data?

---
class: primary
# Step 3

Step 3: Analyze the data

**Summary statistic** - one numerical value that summarizes a large amount of data

What do you think would be a good summary statistic for this data? .red[Proportion of patients having a stroke]

- Proportion of patients who had a stroke in the treatment group: 45/224 = 0.20 = 20%. 
- Proportion of patients who had a stroke in the control group: 28/227 = 0.12 = 12%.
- Overall proportion of patients who had a stroke: 73/451 = 0.16 = 16%

---
class: primary
# Step 4

Step 4: Form a conclusion.

Surprising result: treatment group has higher rate of stroke (8% more)

- Not what doctors expect
- Is this difference meaningful?

Not yet equipped with statistical tools to answer the 2nd question: is the difference between stent and non-stent groups so large that we should reject the notion that it was due to random variation?

We'll get there eventually!

---
class: primary
# Your Turn 1.1.1

Researchers studying the effect of temperature on glass looked at the refractive index (RI) measured on 89 different types of glass before and after being exposed to extreme temperatures. The RI value on all glass types was recorded, then the glass was exposed to very cold ( `\(0^o\)` F) or very hot ( `\(400^o\)` F) conditions for 20 minutes. After the glass returned to room temperature, the researchers measured the RI again to see if it was the same or if it had changed. Results are summarized below:

Group | change | no change | Total
:-----|:------:|:---------:|------:
Heat  | 10     | 33        | 43
Cold  |  2     |  44       |   46
------|--------|-----------|--------
Total | 12     | 77        | 89

---
class: primary
# YT 1.1.1 (cont.)

1. What percent of observations had a change in RI after being exposed to extreme heat? What percent of observations had a change in RI after being exposed to extreme cold?
2. At first glance, does exposure to extreme temperatures appear to have an effect on RI of glass? Explain.
3. Do the data provide convincing evidence that there is a real change in RI for either group? 
4. Do you think the effect may just be due to random variability (as opposed to temperature)?

---
class: primary
# YT 1.1.1 (soln.)

1. What percent of observations had a change in RI after being exposed to extreme heat? .red[ `\(\frac{10}{43} = 0.23 = 23\%\)` ]
 What percent of observations had a change in RI after being exposed to extreme cold? .red[ `\(\frac{2}{46} = 0.04 = 4\%\)` ]
2. At first glance, does exposure to extreme temperatures appear to have an effect on RI of glass? Explain. .red[Heat, yes. Cold, no.]
3. Do the data provide convincing evidence that there is a real change in RI for either group? .red[(discussion)]
4. Do you think the effect may just be due to random variability (as opposed to temperature)? .red[(discussion)]

---
class: inverse, center
# Section 1.2: Data basics

---
class: primary
# Data "storage"

Data are stored in **tables** aka **matrices**

![Red Pill or blue pill?](img/neo.jpg)

---
class: primary
# Data "storage"

Data are stored in **tables** aka **matrices**:

- each row is an **observation**
- each column is a **variable**
- data are stored in a **matrix**
- Example: 6 randomly selected obervations of 6 variables on glass element analysis. (units of last 3 columns are ppm)

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> pane </th>
   <th style="text-align:right;"> Piece </th>
   <th style="text-align:right;"> Rep </th>
   <th style="text-align:right;"> Li7 </th>
   <th style="text-align:right;"> Na23 </th>
   <th style="text-align:right;"> Mg25 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> P3 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:right;"> 4.30 </td>
   <td style="text-align:right;"> 109700 </td>
   <td style="text-align:right;"> 24330 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> P1 </td>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:right;"> 19 </td>
   <td style="text-align:right;"> 1.65 </td>
   <td style="text-align:right;"> 101050 </td>
   <td style="text-align:right;"> 23360 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> P4 </td>
   <td style="text-align:right;"> 24 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 2.10 </td>
   <td style="text-align:right;"> 105240 </td>
   <td style="text-align:right;"> 24500 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> P2 </td>
   <td style="text-align:right;"> 23 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 1.77 </td>
   <td style="text-align:right;"> 102200 </td>
   <td style="text-align:right;"> 23180 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> P2 </td>
   <td style="text-align:right;"> 24 </td>
   <td style="text-align:right;"> 18 </td>
   <td style="text-align:right;"> 1.60 </td>
   <td style="text-align:right;"> 101900 </td>
   <td style="text-align:right;"> 23050 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> P1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 1.69 </td>
   <td style="text-align:right;"> 100530 </td>
   <td style="text-align:right;"> 23790 </td>
  </tr>
</tbody>
</table>

---
class: primary
# Types of variables

![Types of variables](img/variables.png)

[Link to Image Source (from textbook)](https://github.com/OpenIntroOrg/openintro-statistics/blob/master/ch_intro_to_data/figures/variables/variables.pdf)