Need in 20 hours

PH 30004

November 1, 2021

Upcoming and updates

Exam 3 Nov 12

Final small assignment – Nov 22

Adjustments this week:

QDAS today; extended time and flexibility for participation points

Possible guest on Friday (will verify on Wed)

Review final proposal details on Wed; revise/refine topics during this week

Adjustments going forward:

Work days (final proposal) plus exam next week – no meetings

Visualization content week of Nov 15th (with posters)

New schedule in Canvas – Nov 1

Aims of qualitative analysis software:

Facilitate organization of data and codes

Facilitate coding processes – especially reapplication of the same code

Enable multiple comparisons:

Code comparisons – how often is a code used? Where is a code used?

Source comparisons – length, type, etc.

Demographics (aka attributes or descriptors– usually coded at source/case level

Attribute by code

Attribute by source

Visualization – vary by program

Terminology for QDAS

Codes are not often called codes

Nodes – NVivo

Quirks – Quirkos

Codes – Dedoose

Codes – Hyperresearch

MAXQDA

Atlas.ti

F4 analyze

QDAS program comparison

Quirkos –

Web-based or local

Project sharing

Projects can be transferred to other programs

Inexpensive perpetual student license

No audio/visual files

Nvivo –

Primarily local

Project sharing with cloud or server

Projects can be transferred

Limited student license

Kent has limited number of perpetual licenses that can be obtained

QDAS program comparison

Hyperresearch –

Local license

Data not imported into program

Has transcribe add on

Free unlimited trial version

Perpetual student license

Dedoose –

Desktop version (was web until Flash went away)

Monthly fee

Promotes itself as “mixed methods”

When to use QDAS

Multiple coders – use a program that facilitates sharing (web-based)

Multi-media files (all but Quirkos)

Need to do categorical/quantitative comparison

Meta-studies

Data are in electronic format (i.e., not books or physical photos, unless you can scan)

Appropriate qualitative approach:

Grounded theory; descriptive; case study; some narrative; large samples

Not phenomenological; IPA; some narrative; small samples; other non-coding approaches

Can at times do later cycles in QDAS after initial work done in Word or similar program

Update/recap

Review final proposal

Friday – Mendeley presentation

Participation points for attendance

Participation points for turning in demo practice

Reviewing last small assignment – can use Mendeley to complete

QDAS – changing from participation to extra credit (10 points)

Review a couple of other programs today

Next week – two work days plus exam

Exam most likely available on Wed due to holiday

Next week, I will request verification of final proposal topic and make up of any groups

For Friday

Download Mendeley Cite and Mendeley Reference manager

Find three articles to practice with – any journal articles

Download to computer

Newer is better!

PH 30004

Week of Oct 25, 2021

Schedule/recap

Grading status

Exam items

Guest lectures coming up:

Ms. Essel on Mendeley (will verify date)

Mr. Coetzer-Liversage on substance abuse research (mid November)

Others as possible

Exam 3 – Nov 12 (shorter time gap); study guide on Nov 8

Last small assignment – Nov 22 (no live meeting that day)

SURE opportunities – applications in Spring

This week

Ethical research

Basic math

R / R studio – problems on Wed

Quirkos – Friday or next Monday

Overview of other available QDAS

Miro exercise – next week

Three opportunities for participation points

Ethical research practice

CITI

Office of Human Research Protections

Is it research?

Example – journalistic interviews and oral histories are not research – systematic but not generalizable

Does it fall within the exempt categories?

Example – interviews aimed to contribute to generalizable knowledge

Secondary analysis of things that are not research (see above) aimed at developing generalizable knowledge

Challenging human participant contexts

Children – usually defined as under 18

Parental consent and children’s assent/consent are both required

Other vulnerable/disadvantaged populations

Illegal immigrants; refugees and individuals seeking political asylum

Persons in community-based settings where poor research practice has occurred in the past

Not necessarily viewed by IRB as vulnerable but may view themselves as vulnerable

Low income and/or individuals with low levels of educational attainment

Critical need to understand consent wording, research process, and right to withdraw

Three major ongoing challenges – #1 incentives

Increasing emphasis on incentives

Potential for coercive impact on recruitment

Potential to sway responses – participants feel they “owe” something

Can limit research efforts (no budget for incentives = no research)

Might dissuade some otherwise eligible participants (identifying info needed for incentives including at times tax information)

What about incentivizing parents for children’s participation? Is this a good practice

Research results on persuasive power of incentives is very mixed

Many will participate due to interest in the subject rather than rewards

Participants should not have to pay (parking, equipment, etc.)

Current/ongoing challenges #2 – archived data

Archiving and re-using data

Adding to archives/secondary data is essential and resource-effective

Where to archive?

De-identifying qualitative data – how much detail can be removed while preserving integrity?

IRBs requesting re-contact

Challenges in finding participants

Is this intrusive? Participants did not consent to re-contact

Ensure consent describes potential for and type of re-use

Historical precedents for reuse that was not always appropriate

How companies like FitBit/Google and MedProctor get your “consent”

Current/ongoing challenges #3 – regulated data

PHI – personal health information and educational records

The latter include enrollment in a course, attendance, etc. not just grades and scores

HIPAA and FERPA

Records exist; access ranges from impossible to routine

De-identification processes

Storage processes

Privacy officer needs to be involved, even for non-research projects

Ex – follow up calls to remind of medical or advising appointments

Quantitative data analysis

Use math when possible

Phone/online calculator

Physical calculator or machine

Why use a machine with a tape?

Qualtrics built in functions

MS Excel***

Basic mathematical processes

Mean/average (add then divide by n)

Frequency (how many times does something happen in a given time period?)

Expressed as simple tally

Proportions example

N = 50

12.5 purchased a course text = (12.5*100)/50 = 25%

Quick calculations – determine 1% by dividing by 100 – multiple by n to account for decimals:

(1/100) *50 = 0.5

Multiple 1% (0.5) by desired number

Index example

An index is a measure with a flexible standard or base that allows you to compare one value to another

Primary uses are to determine percent increase or decrease

Example: Tuition in 1980 was $500 per term; tuition in 1982 was $550 per term

Making 1980 the base value of 100, the 1982 value is 10% more so reflects an index of 110. This can be reported as a 10% increase or as an index value of 110.

An index change is merely the difference = + 10 index points

Indexes are always relative to the base value and are not necessarily the same as a percentage change

Source for proportion and index examples

https://www150.statcan.gc.ca/n1/pub/11-533-x/using-utiliser/4072258-eng.htm

R and R studio

Download before class

Work through example problems on Wed

Will turn something in for participation credit

No computer? take good notes

Exam debrief

M/C 4 is just wrong. All received credit. I believe it should have been “fail to reject” but other answers were close

HCD versus Scientific Method – two differences means two pairs of differences, not A about X and B about Y

Banned – lots of bads and no goods – will people just be passive and sit by? The point of a brainstorm is to look at problems but also consider work arounds. Will people just let the piles of garbage grow?

Ex: The Duxbury Dump

Is it worth spending time to try for a higher score?

R – basic info

Open source, open access, sustained by international users and contributors community

Basic package plus growing add ons

Multiple options for most functions/procedures

Can expand as needed rather than wait for updates

“Reads” data from a designated location

Can enter data

.csv is preferred file type – looks like Excel, saved with .csv suffix

R

Code – equations, formulae, directions

Click programs use code, too, behind the scenes

Excel for stats – not as intuitive (working in cells)

SAS, SPSS, Stata – license fee, periodic updates, not freely expandable

Fully functional SAS is Windows only

Most modern datasets are amenable to R

Python is increasingly popular

R – basic processes

Using a script document

Basic math

Setting a working directory

Create a group

Read in data

Friday –

Install package

Basic graph

Modifying script

Recommended R packages

“ggplot2” or “tidyverse”

https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf

https://cran.r-project.org/web/packages/tidyverse/tidyverse.pdf

To do for participation points

Directions at end of 10/29/2021 code

R Studio

Most people who use R Studio – an interface for R – prefer it.

More point and click options

Easier to access data and other files

Help readily available

Difficult to switch back to R once you have downloaded R studio – opening any files will default to opening R Studio, not R

**Exam 3**

**Questions on this exam come from class meetings, readings, a guest lecture and software demo and practice opportunities**

**Part 3 Longer Response – 4 questions; 6 points each. Please read the instructions carefully. Partial credit is available for partially correct responses.**

1. Use the screenshot below to respond to the following:

a. What does # mean in front of a line of code?

b. What does the command “library” do?

c. What is the point of the word “score” before the Goetz (<-) symbol?

2 Use the screenshot below to response to the following:

a. What does the code “head” show?

b. What does the code “str” show?

c. In the line of code with the command “as.factor” – what is R being asked to do?

3 Review the screenshots of Mendeley Cite for the next series of questions:

a) Refer to the screenshot above. What needs to be done to correct the placement of these two references in the text of a paper or article?

b) What citation style is shown in the screenshot above (in text and reference list)?

c) Which citation style is shown in the screenshot below (in text and reference list)?

d) Identify one citation error in the screenshot associated with item b

e) Identify one citation error in the screenshot associated with item c.

4. The following questions are all about the screenshot below from a library search:

a. Which databases were searched?

b. What is the best way to keep track of specific results I want to be able to look at later on this website?

c. What is one thing that is probably incorrect about the way these search terms were set up?

d. Describe two ways that are consistent with the general aims of academic literature searching, that would reduce the number of results (note correcting the incorrect thing you identified in item c does not count as one of the two ways).

Nicholas J. Horton

Randall Pruim

Daniel T. Kaplan

A Student’s

Guide to

R

Project MOSAIC

2 horton, kaplan, pruim

Copyright (c) 2015 by Nicholas J. Horton, Randall

Pruim, & Daniel Kaplan.

Edition 1.2, November 2015

This material is copyrighted by the authors under a

Creative Commons Attribution 3.0 Unported License.

You are free to Share (to copy, distribute and transmit

the work) and to Remix (to adapt the work) if you

attribute our work. More detailed information about

the licensing is available at this web page: http:

//www.mosaic-web.org/go/teachingRlicense.html.

Cover Photo: Maya Hanna.

Contents

1 Introduction 13

2 Getting Started with RStudio 15

3 One Quantitative Variable 27

4 One Categorical Variable 39

5 Two Quantitative Variables 45

6 Two Categorical Variables 55

7 Quantitative Response, Categorical Predictor 61

8 Categorical Response, Quantitative Predictor 69

9 Survival Time Outcomes 73

4 horton, kaplan, pruim

10 More than Two Variables 75

11 Probability Distributions & Random Variables 83

12 Power Calculations 89

13 Data Management 93

14 Health Evaluation (HELP) Study 107

15 Exercises and Problems 111

16 Bibliography 115

17 Index 117

About These Notes

We present an approach to teaching introductory and in-

termediate statistics courses that is tightly coupled with

computing generally and with R and RStudio in particular.

These activities and examples are intended to highlight

a modern approach to statistical education that focuses

on modeling, resampling based inference, and multivari-

ate graphical techniques. A secondary goal is to facilitate

computing with data through use of small simulation

studies and appropriate statistical analysis workflow. This

follows the philosophy outlined by Nolan and Temple

Lang1. The importance of modern computation in statis- 1 D. Nolan and D. Temple Lang.

Computing in the statistics

curriculum. The American

Statistician, 64(2):97–107, 2010

tics education is a principal component of the recently

adopted American Statistical Association’s curriculum

guidelines2.

2 ASA Undergraduate Guide-

lines Workgroup. 2014 cur-

riculum guidelines for under-

graduate programs in statisti-

cal science. Technical report,

American Statistical Associa-

tion, November 2014. http:

//www.amstat.org/education/

curriculumguidelines.cfm

Throughout this book (and its companion volumes),

we introduce multiple activities, some appropriate for

an introductory course, others suitable for higher levels,

that demonstrate key concepts in statistics and modeling

while also supporting the core material of more tradi-

tional courses.

A Work in Progress

Caution!

Despite our best efforts, you

WILL find bugs both in this

document and in our code.

Please let us know when you

encounter them so we can call

in the exterminators.

These materials were developed for a workshop entitled

Teaching Statistics Using R prior to the 2011 United States

Conference on Teaching Statistics and revised for US-

COTS 2011, USCOTS 2013, eCOTS 2014, ICOTS 9, and

USCOTS 2015. We organized these workshops to help

instructors integrate R (as well as some related technolo-

gies) into statistics courses at all levels. We received great

feedback and many wonderful ideas from the participants

and those that we’ve shared this with since the work-

shops.

6 horton, kaplan, pruim

Consider these notes to be a work in progress. We ap-

preciate any feedback you are willing to share as we con-

tinue to work on these materials and the accompanying

mosaic package. Drop us an email at [email protected]

org with any comments, suggestions, corrections, etc.

Updated versions will be posted at http://mosaic-web.

org.

Two Audiences

We initially developed these materials for instructors of

statistics at the college or university level. Another audi-

ence is the students these instructors teach. Some of the

sections, examples, and exercises are written with one or

the other of these audiences more clearly at the forefront.

This means that

1. Some of the materials can be used essentially as is with

students.

2. Some of the materials aim to equip instructors to de-

velop their own expertise in R and RStudio to develop

their own teaching materials.

Although the distinction can get blurry, and what

works “as is” in one setting may not work “as is” in an-

other, we’ll try to indicate which parts fit into each cate-

gory as we go along.

R, RStudio and R Packages

R can be obtained from http://cran.r-project.org/.

Download and installation are quite straightforward for

Mac, PC, or linux machines.

RStudio is an integrated development environment

(IDE) that facilitates use of R for both novice and expert

users. We have adopted it as our standard teaching en-

vironment because it dramatically simplifies the use of R

for instructors and for students. RStudio can be installed

More Info

Several things we use that

can be done only in RStudio,

for instance manipulate() or

RStudio’s integrated support for

reproducible research).

as a desktop (laptop) application or as a server applica-

tion that is accessible to users via the Internet. RStudio server version works

well with starting students. All

they need is a web browser,

avoiding any potential prob-

lems with oddities of students’

individual computers.

In addition to R and RStudio, we will make use of sev-

eral packages that need to be installed and loaded sep-

arately. The mosaic package (and its dependencies) will

a student’s guide to r 7

be used throughout. Other packages appear from time to

time as well.

Marginal Notes

Marginal notes appear here and there. Sometimes these Have a great suggestion for a

marginal note? Pass it along.are side comments that we wanted to say, but we didn’t

want to interrupt the flow to mention them in the main

text. Others provide teaching tips or caution about traps,

pitfalls and gotchas.

What’s Ours Is Yours – To a Point

This material is copyrighted by the authors under a Cre-

ative Commons Attribution 3.0 Unported License. You

are free to Share (to copy, distribute and transmit the

work) and to Remix (to adapt the work) if you attribute

our work. More detailed information about the licensing

is available at this web page: http://www.mosaic-web.

org/go/teachingRlicense.html. Digging Deeper

If you know LATEX as well as

R, then knitr provides a nice

solution for mixing the two. We

used this system to produce

this book. We also use it for

our own research and to intro-

duce upper level students to

reproducible analysis methods.

For beginners, we introduce

knitr with RMarkdown, which

produces PDF, HTML, or Word

files using a simpler syntax.

Document Creation

This document was created on November 15, 2015, using

• knitr, version 1.11

• mosaic, version 0.12.9003

• mosaicData, version 0.12.9003

• R version 3.2.2 (2015-08-14)

Inevitably, each of these will be updated from time to

time. If you find that things look different on your com-

puter, make sure that your version of R and your pack-

ages are up to date and check for a newer version of this

document.

Project MOSAIC

This book is a product of Project MOSAIC, a community

of educators working to develop new ways to introduce

mathematics, statistics, computation, and modeling to

students in colleges and universities.

The goal of the MOSAIC project is to help share ideas

and resources to improve teaching, and to develop a cur-

ricular and assessment infrastructure to support the dis-

semination and evaluation of these approaches. Our goal

is to provide a broader approach to quantitative stud-

ies that provides better support for work in science and

technology. The project highlights and integrates diverse

aspects of quantitative work that students in science, tech-

nology, and engineering will need in their professional

lives, but which are today usually taught in isolation, if at

all.

In particular, we focus on:

Modeling The ability to create, manipulate and investigate

useful and informative mathematical representations of

a real-world situations.

Statistics The analysis of variability that draws on our

ability to quantify uncertainty and to draw logical in-

ferences from observations and experiment.

Computation The capacity to think algorithmically, to

manage data on large scales, to visualize and inter-

act with models, and to automate tasks for efficiency,

accuracy, and reproducibility.

Calculus The traditional mathematical entry point for col-

lege and university students and a subject that still has

the potential to provide important insights to today’s

students.

10 horton, kaplan, pruim

Drawing on support from the US National Science

Foundation (NSF DUE-0920350), Project MOSAIC sup-

ports a number of initiatives to help achieve these goals,

including:

Faculty development and training opportunities, such as the

USCOTS 2011, USCOTS 2013, eCOTS 2014, and ICOTS

9 workshops on Teaching Statistics Using R and RStu-

dio, our 2010 Project MOSAIC kickoff workshop at the

Institute for Mathematics and its Applications, and

our Modeling: Early and Often in Undergraduate Calculus

AMS PREP workshops offered in 2012, 2013, and 2015.

M-casts, a series of regularly scheduled webinars, de-

livered via the Internet, that provide a forum for in-

structors to share their insights and innovations and

to develop collaborations to refine and develop them.

Recordings of M-casts are available at the Project MO-

SAIC web site, http://mosaic-web.org.

The construction of syllabi and materials for courses that

teach MOSAIC topics in a better integrated way. Such

courses and materials might be wholly new construc-

tions, or they might be incremental modifications of

existing resources that draw on the connections be-

tween the MOSAIC topics.

More details can be found at http://www.mosaic-web.

org. We welcome and encourage your participation in all

of these initiatives.

Computational Statistics

There are at least two ways in which statistical software

can be introduced into a statistics course. In the first ap-

proach, the course is taught essentially as it was before

the introduction of statistical software, but using a com-

puter to speed up some of the calculations and to prepare

higher quality graphical displays. Perhaps the size of

the data sets will also be increased. We will refer to this

approach as statistical computation since the computer

serves primarily as a computational tool to replace pencil-

and-paper calculations and drawing plots manually.

In the second approach, more fundamental changes in

the course result from the introduction of the computer.

Some new topics are covered, some old topics are omit-

ted. Some old topics are treated in very different ways,

and perhaps at different points in the course. We will re-

fer to this approach as computational statistics because

the availability of computation is shaping how statistics is

done and taught. Computational statistics is a key com-

ponent of data science, defined as the ability to use data

to answer questions and communicate those results.

Students need to see aspects of

computation and data science

early and often to develop

deeper skills. Establishing

precursors in introductory

courses help them get started.

In practice, most courses will incorporate elements of

both statistical computation and computational statistics,

but the relative proportions may differ dramatically from

course to course. Where on the spectrum a course lies

will be depend on many factors including the goals of the

course, the availability of technology for student use, the

perspective of the text book used, and the comfort-level of

the instructor with both statistics and computation.

Among the various statistical software packages avail-

able, R is becoming increasingly popular. The recent addi-

tion of RStudio has made R both more powerful and more

accessible. Because R and RStudio are free, they have be-

come widely used in research and industry. Training in R

12 horton, kaplan, pruim

and RStudio is often seen as an important additional skill

that a statistics course can develop. Furthermore, an in-

creasing number of instructors are using R for their own

statistical work, so it is natural for them to use it in their

teaching as well. At the same time, the development of R

and of RStudio (an optional interface and integrated de-

velopment environment for R) are making it easier and

easier to get started with R.

Information about the mosaic

package, including vignettes

demonstrating features and

supplementary materials (such

as this book) can be found at

https://cran.r-project.org/

web/packages/mosaic.

We developed the mosaic R package (available on

CRAN) to make certain aspects of statistical computation

and computational statistics simpler for beginners, with-

out limiting their ability to use more advanced features of

the language. The mosaic package includes a modelling

approach that uses the same general syntax to calculate

descriptive statistics, create graphics, and fit linear mod-

els.

1

Introduction

In this reference book, we briefly review the commands

and functions needed to analyze data from introductory

and second courses in statistics. This is intended to com-

plement the Start Teaching with R and Start Modeling with

R books.

Most of our examples will use data from the HELP

(Health Evaluation and Linkage to Primary Care) study:

a randomized clinical trial of a novel way to link at-risk

subjects with primary care. More information on the

dataset can be found in chapter 14.

Since the selection and order of topics can vary greatly

from textbook to textbook and instructor to instructor, we

have chosen to organize this material by the kind of data

being analyzed. This should make it straightforward to

find what you are looking for. Some data management

skills are needed by students1. A basic introduction to 1 N.J. Horton, B.S. Baumer, and

H. Wickham. Setting the stage

for data science: integration

of data management skills

in introductory and second

courses in statistics (http:

//arxiv.org/abs/1401.3269).

CHANCE, 28(2):40–50, 2015

key idioms is provided in Chapter 13.

This work leverages initiatives undertaken by Project

MOSAIC (http://www.mosaic-web.org), an NSF-funded

effort to improve the teaching of statistics, calculus, sci-

ence and computing in the undergraduate curriculum.

In particular, we utilize the mosaic package, which was

written to simplify the use of R for introductory statis-

tics courses, and the mosaicData package which includes

a number of data sets. A short summary of the R com-

mands needed to teach introductory statistics can be

found in the mosaic package vignette: https://cran.

r-project.org/web/packages/mosaic.

Other related resources from Project MOSAIC may be

helpful, including an annotated set of examples from the

sixth edition of Moore, McCabe and Craig’s Introduction

to the Practice of Statistics2 (see http://www.amherst.edu/ 2 D. S. Moore and G. P. McCabe.

Introduction to the Practice of

Statistics. W.H.Freeman and

Company, 6th edition, 2007

14 horton, kaplan, pruim

~nhorton/ips6e), the second and third editions of the Sta-

tistical Sleuth3 (see http://www.amherst.edu/~nhorton/ 3 F. Ramsey and D. Schafer.

Statistical Sleuth: A Course in

Methods of Data Analysis. Cen-

gage, 2nd edition, 2002

sleuth), and Statistics: Unlocking the Power of Data by Lock

et al (see https://github.com/rpruim/Lock5withR).

To use a package within R, it must be installed (one

time), and loaded (each session). The mosaic package can

be installed using the following commands:

> install.packages(“mosaic”) # note the quotation marks

The # character is a comment in R, and all text after that

RStudio features a simplified

package installation tab (in the

bottom right panel).

on the current line is ignored.

Once the package is installed (one time only), it can be

loaded by running the command:

> require(mosaic)

The knitr/LATEX system allows

experienced users to combine

R and LATEX in the same docu-

ment. The reward for learning

this more complicated system

is much finer control over the

output format. But RMarkdown

is much easier to learn and is

adequate even for professional-

level work.

Using Markdown or

knitr/LATEX requires that the

markdown package be installed.

The RMarkdown system provides a simple markup

language and renders the results in PDF, Word, or HTML.

This allows students to undertake their analyses using a

workflow that facilitates “reproducibility” and avoids cut

and paste errors.

We typically introduce students to RMarkdown very

early, requiring students to use it for assignments and

reports4. 4 B.S. Baumer, M. Çetinkaya

Rundel, A. Bray, L. Loi, and

N. J. Horton. R Markdown:

Integrating a reproducible

analysis tool into introductory

statistics. Technology Innovations

in Statistics Education, 8(1):281–

283, 2014

2

Getting Started with RStudio

RStudio is an integrated development environment (IDE)

for R that provides an alternative interface to R that has

several advantages over other the default R interfaces:

A series of getting started

videos are available at

http://www.amherst.edu/

~nhorton/rstudio.

• RStudio runs on Mac, PC, and Linux machines and pro-

vides a simplified interface that looks and feels identical

on all of them.

The default interfaces for R are quite different on the

various platforms. This is a distractor for students and

adds an extra layer of support responsibility for the

instructor.

• RStudio can run in a web browser.

In addition to stand-alone desktop versions, RStudio

can be set up as a server application that is accessed

via the internet.

The web interface is nearly identical to the desktop

version. As with other web services, users login to Caution!

The desktop and server version

of RStudio are so similar that

if you run them both, you will

have to pay careful attention

to make sure you are working

in the one you intend to be

working in.

access their account. If students logout and login in

again later, even on a different machine, their session

is restored and they can resume their analysis right

where they left off. With a little advanced set up, in-

structors can save the history of their classroom R use

and students can load those history files into their own

environment. Note

Using RStudio in a browser is

like Facebook for statistics.

Each time the user returns, the

previous session is restored and

they can resume work where

they left off. Users can login

from any device with internet

access.

• RStudio provides support for reproducible research.

RStudio makes it easy to include text, statistical

analysis (R code and R output), and graphical displays

all in the same document. The RMarkdown system

provides a simple markup language and renders the

results in HTML. The knitr/LATEX system allows users

16 horton, kaplan, pruim

to combine R and LATEX in the same document. The

reward for learning this more complicated system is

much finer control over the output format. Depending

on the level of the course, students can use either of

these for homework and projects. To use Markdown or

knitr/LATEX requires that the

knitr package be installed on

your system.• RStudio provides an integrated support for editing and

executing R code and documents.

• RStudio provides some useful functionality via a graph-

ical user interface.

RStudio is not a GUI for R, but it does provide a

GUI that simplifies things like installing and updating

packages; monitoring, saving and loading environ-

ments; importing and exporting data; browsing and

exporting graphics; and browsing files and documenta-

tion.

• RStudio provides access to the manipulate package.

The manipulate package provides a way to create

simple interactive graphical applications quickly and

easily.

While one can certainly use R without using RStudio,

RStudio makes a number of things easier and we highly

recommend using RStudio. Furthermore, since RStudio is

in active development, we fully expect more useful fea-

tures in the future.

We primarily use an online version of RStudio. RStudio

is a innovative and powerful interface to R that runs in a

web browser or on your local machine. Running in the

browser has the advantage that you don’t have to install

or configure anything. Just login and you are good to go.

Furthermore, RStudio will “remember” what you were

doing so that each time you login (even on a different

machine) you can pick up right where you left off. This

is “R in the cloud” and works a bit like GoogleDocs or

Facebook for R.

R can also be obtained from http://cran.r-project.

org/. Download and installation are pretty straightfor-

ward for Mac, PC, or Linux machines. RStudio is available

from http://www.rstudio.org/.

a student’s guide to r 17

2.1 Connecting to an RStudio server

RStudio servers have been set up at a number of schools to

facilitate cloud-based computing.

RStudio servers have been in-

stalled at many institutions.

More details about (free) aca-

demic licenses for RStudio

Server Pro as well as setup

instructions can be found at

http://www.rstudio.com/

resources/faqs under the

Academic tab.

Once you connect to the server, you should see a login

screen:

The RStudio server doesn’t tend

to work well with Internet

Explorer.

Once you authenticate, you should see the RStudio

interface:

Notice that RStudio divides its world into four panels.

Several of the panels are further subdivided into multi-

18 horton, kaplan, pruim

ple tabs. Which tabs appear in which panels can be cus-

tomized by the user.

R can do much more than a simple calculator, and we

will introduce additional features in due time. But per-

forming simple calculations in R is a good way to begin

learning the features of RStudio.

Commands entered in the Console tab are immediately

executed by R. A good way to familiarize yourself with

the console is to do some simple calculator-like compu-

tations. Most of this will work just like you would expect

from a typical calculator. Try typing the following com-

mands in the console panel.

> 5 + 3

[1] 8

> 15.3 * 23.4

[1] 358.02

> sqrt(16) # square root

[1] 4

This last example demonstrates how functions are

called within R as well as the use of comments. Com-

ments are prefaced with the # character. Comments can

be very helpful when writing scripts with multiple com-

mands or to annotate example code for your students.

You can save values to named variables for later reuse.

It’s probably best to settle on

using one or the other of the

right-to-left assignment opera-

tors rather than to switch back

and forth. We prefer the arrow

operator because it represents

visually what is happening in

an assignment and because it

makes a clear distinction be-

tween the assignment operator,

the use of = to provide values to

arguments of functions, and the

use of == to test for equality.

> product = 15.3 * 23.4 # save result

> product # display the result

[1] 358.02

> product <- 15.3 * 23.4 # <- can be used instead of =

> product

[1] 358.02

Once variables are defined, they can be referenced in

other operations and functions.

a student’s guide to r 19

> 0.5 * product # half of the product

[1] 179.01

> log(product) # (natural) log of the product

[1] 5.880589

> log10(product) # base 10 log of the product

[1] 2.553907

> log2(product) # base 2 log of the product

[1] 8.483896

> log(product, base=2) # base 2 log of the product, another way

[1] 8.483896

The semi-colon can be used to place multiple com-

mands on one line. One frequent use of this is to save and

print a value all in one go:

> product <- 15.3 * 23.4; product # save result and show it

[1] 358.02

2.1.1 Version information

At times it may be useful to check what version of the

mosaic package, R, and RStudioyou are using. Running

sessionInfo() will display information about the version

of R and packages that are loaded and RStudio.Version()

will provide information about the version of RStudio.

> sessionInfo()

R version 3.2.2 (2015-08-14)

Platform: x86_64-apple-darwin13.4.0 (64-bit)

Running under: OS X 10.10.5 (Yosemite)

locale:

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

20 horton, kaplan, pruim

attached base packages:

[1] grid stats graphics grDevices utils datasets

[7] methods base

other attached packages:

[1] mosaic_0.12.9003 mosaicData_0.9.9001 car_2.1-0

[4] ggplot2_1.0.1 dplyr_0.4.3 lattice_0.20-33

[7] knitr_1.11

loaded via a namespace (and not attached):

[1] Rcpp_0.12.1 magrittr_1.5 splines_3.2.2

[4] MASS_7.3-45 munsell_0.4.2 colorspace_1.2-6

[7] R6_2.1.1 ggdendro_0.1-17 minqa_1.2.4

[10] highr_0.5.1 stringr_1.0.0 plyr_1.8.3

[13] tools_3.2.2 nnet_7.3-11 parallel_3.2.2

[16] pbkrtest_0.4-2 nlme_3.1-122 gtable_0.1.2

[19] mgcv_1.8-9 quantreg_5.19 DBI_0.3.1

[22] MatrixModels_0.4-1 lme4_1.1-10 assertthat_0.1

[25] digest_0.6.8 Matrix_1.2-2 gridExtra_2.0.0

[28] nloptr_1.0.4 reshape2_1.4.1 formatR_1.2.1

[31] evaluate_0.8 stringi_1.0-1 scales_0.3.0

[34] SparseM_1.7 proto_0.3-10

2.2 Working with Files

2.2.1 Working with R Script Files

As an alternative, R commands can be stored in a file.

RStudio provides an integrated editor for editing these

files and facilitates executing some or all of the com-

mands. To create a file, select File, then New File, then R

Script from the RStudio menu. A file editor tab will open

in the Source panel. R code can be entered here, and but-

tons and menu items are provided to run all the code

(called sourcing the file) or to run the code on a single

line or in a selected section of the file.

2.2.2 Working with RMarkdown, and knitr/LATEX

A third alternative is to take advantage of RStudio’s sup-

port for reproducible research. If you already know LATEX,

a student’s guide to r 21

you will want to investigate the knitr/LATEX capabili-

ties. For those who do not already know LATEX, the sim-

pler RMarkdown system provides an easy entry into the

world of reproducible research methods. It also provides

a good facility for students to create homework and re-

ports that include text, R code, R output, and graphics.

To create a new RMarkdown file, select File, then New

File, then RMarkdown. The file will be opened with a short

template document that illustrates the mark up language.

The mosaic package includes two useful RMarkdown

templates for getting started: fancy includes bells and

whistles (and is intended to give an overview of features),

while plain is useful as a starting point for a new analy-

sis. These are accessed using the Template option when

creating a new RMarkdown file.

22 horton, kaplan, pruim

Click on the Knit button to convert to an HTML, PDF,

or Word file.

This will generate a formatted version of the docu-

ment.

a student’s guide to r 23

There is a button (marked with a question mark)

which provides a brief description of the supported markup

commands. The RStudio web site includes more extensive

tutorials on using RMarkdown. Caution!

RMarkdown, and knitr/LATEX

files do not have access to the

console environment, so the

code in them must be self-

contained.

It is important to remember that unlike R scripts,

which are executed in the console and have access to

the console environment, RMarkdown and knitr/LATEX

files do not have access to the console environment This

is a good feature because it forces the files to be self-

contained, which makes them transferable and respects

good reproducible research practices. But beginners, es-

pecially if they adopt a strategy of trying things out in the

console and copying and pasting successful code from the

console to their file, will often create files that are …

simpleR – Using R for Introductory Statistics

John Verzani

20000 40000 60000 80000 120000 160000

2

e

+

0

5

4

e

+

0

5

6

e

+

0

5

8

e

+

0

5

y

page i

Preface

These notes are an introduction to using the statistical software package R for an introductory statistics course.

They are meant to accompany an introductory statistics book such as Kitchens “Exploring Statistics”. The goals

are not to show all the features of R, or to replace a standard textbook, but rather to be used with a textbook to

illustrate the features of R that can be learned in a one-semester, introductory statistics course.

These notes were written to take advantage of R version 1.5.0 or later. For pedagogical reasons the equals sign,

=, is used as an assignment operator and not the traditional arrow combination <-. This was added to R in version

1.4.0. If only an older version is available the reader will have to make the minor adjustment.

There are several references to data and functions in this text that need to be installed prior to their use. To

install the data is easy, but the instructions vary depending on your system. For Windows users, you need to

download the “zip” file , and then install from the “packages” menu. In UNIX, one uses the command R CMD

INSTALL packagename.tar.gz. Some of the datasets are borrowed from other authors notably Kitchens. Credit is

given in the help files for the datasets. This material is available as an R package from:

http://www.math.csi.cuny.edu/Statistics/R/simpleR/Simple 0.4.zip for Windows users.

http://www.math.csi.cuny.edu/Statistics/R/simpleR/Simple 0.4.tar.gz for UNIX users.

If necessary, the file can sent in an email. As well, the individual data sets can be found online in the directory

http://www.math.csi.cuny.edu/Statistics/R/simpleR/Simple.

This is version 0.4 of these notes and were last generated on August 22, 2002. Before printing these notes, you

should check for the most recent version available from

the CSI Math department (http://www.math.csi.cuny.edu/Statistics/R/simpleR).

Copyright c© John Verzani ([email protected]), 2001-2. All rights reserved.

Contents

Introduction 1

What is R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

A note on notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Data 1

Starting R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Entering data with c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Data is a vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Univariate Data 8

Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Bivariate Data 19

Handling bivariate categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Handling bivariate data: categorical vs. numerical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Bivariate data: numerical vs. numerical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Linear regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Multivariate Data 32

Storing multivariate data in data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Accessing data in data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Manipulating data frames: stack and unstack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Using R’s model formula notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Ways to view multivariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

The lattice package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

simpleR – Using R for Introductory Statistics

page ii

Random Data 41

Random number generators in R– the “r” functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Simulations 47

The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Using simple.sim and functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Exploratory Data Analysis 54

Our toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Confidence Interval Estimation 59

Population proportion theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Proportion test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

The z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

The t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Confidence interval for the median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Hypothesis Testing 66

Testing a population parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Testing a mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Tests for the median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Two-sample tests 68

Two-sample tests of proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Two-sample t-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Resistant two-sample tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Chi Square Tests 72

The chi-squared distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Chi-squared goodness of fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Chi-squared tests of independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Chi-squared tests for homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Regression Analysis 77

Simple linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Testing the assumptions of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Statistical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Multiple Linear Regression 84

The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Analysis of Variance 89

one-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Appendix: Installing R 94

Appendix: External Packages 94

Appendix: A sample R session 94

A sample session involving regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

t-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

A simulation example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

simpleR – Using R for Introductory Statistics

page iii

Appendix: What happens when R starts? 100

Appendix: Using Functions 100

The basic template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

For loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Conditional expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Appendix: Entering Data into R 103

Using c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

using scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Using scan with a file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Editing your data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Reading in tables of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Fixed-width fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Spreadsheet data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

XML, urls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

“Foreign” formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Appendix: Teaching Tricks 106

Appendix: Sources of help, documentation 107

simpleR – Using R for Introductory Statistics

Data page 1

Section 1: Introduction

What is R

These notes describe how to use R while learning introductory statistics. The purpose is to allow this fine software

to be used in ”lower-level” courses where often MINITAB, SPSS, Excel, etc. are used. It is expected that the reader

has had at least a pre-calculus course. It is the hope, that students shown how to use R at this early level will better

understand the statistical issues and will ultimately benefit from the more sophisticated program despite its steeper

“learning curve”.

The benefits of R for an introductory student are

• R is free. R is open-source and runs on UNIX, Windows and Macintosh.

• R has an excellent built-in help system.

• R has excellent graphing capabilities.

• Students can easily migrate to the commercially supported S-Plus program if commercial software is desired.

• R’s language has a powerful, easy to learn syntax with many built-in statistical functions.

• The language is easy to extend with user-written functions.

• R is a computer programming language. For programmers it will feel more familiar than others and for new

computer users, the next leap to programming will not be so large.

What is R lacking compared to other software solutions?

• It has a limited graphical interface (S-Plus has a good one). This means, it can be harder to learn at the outset.

• There is no commercial support. (Although one can argue the international mailing list is even better)

• The command language is a programming language so students must learn to appreciate syntax issues etc.

R is an open-source (GPL) statistical environment modeled after S and S-Plus (http://www.insightful.com).

The S language was developed in the late 1980s at AT&T labs. The R project was started by Robert Gentleman and

Ross Ihaka of the Statistics Department of the University of Auckland in 1995. It has quickly gained a widespread

audience. It is currently maintained by the R core-development team, a hard-working, international team of volunteer

developers. The R project web page

http://www.r-project.org

is the main site for information on R. At this site are directions for obtaining the software, accompanying packages

and other sources of documentation.

A note on notation

A few typographical conventions are used in these notes. These include different fonts for urls, R commands,

dataset names and different typesetting for

longer sequences of R commands.

and for

Data sets.

Section 2: Data

Statistics is the study of data. After learning how to start R, the first thing we need to be able to do is learn how

to enter data into R and how to manipulate the data once there.

Starting R

simpleR – Using R for Introductory Statistics

Data page 2

R is most easily used in an interactive manner. You ask it a question and R gives you an answer. Questions are

asked and answered on the command line. To start up R’s command line you can do the following: in Windows find

the R icon and double click, on Unix, from the command line type R. Other operating systems may have different

ways. Once R is started, you should be greeted with a command similar to

R : Copyright 2001, The R Development Core Team

Version 1.4.0 (2001-12-19)

R is free software and comes with ABSOLUTELY NO WARRANTY.

You are welcome to redistribute it under certain conditions.

Type ‘license()’ or ‘licence()’ for distribution details.

R is a collaborative project with many contributors.

Type ‘contributors()’ for more information.

Type ‘demo()’ for some demos, ‘help()’ for on-line help, or

‘help.start()’ for a HTML browser interface to help.

Type ‘q()’ to quit R.

[Previously saved workspace restored]

>

The > is called the prompt. In what follows below it is not typed, but is used to indicate where you are to type if

you follow the examples. If a command is too long to fit on a line, a + is used for the continuation prompt.

Entering data with c

The most useful R command for quickly entering in small data sets is the c function. This function combines, or

concatenates terms together. As an example, suppose we have the following count of the number of typos per page

of these notes:

2 3 0 3 1 0 0 1

To enter this into an R session we do so with

> typos = c(2,3,0,3,1,0,0,1)

> typos

[1] 2 3 0 3 1 0 0 1

Notice a few things

• We assigned the values to a variable called typos

• The assignment operator is a =. This is valid as of R version 1.4.0. Previously it was (and still can be) a <-.

Both will be used, although, you should learn one and stick with it.

• The value of the typos doesn’t automatically print out. It does when we type just the name though as the last

input line indicates

• The value of typos is prefaced with a funny looking [1]. This indicates that the value is a vector. More on

that later.

Typing less

For many implementations of R you can save yourself a lot of typing if you learn that the arrow keys can be used

to retrieve your previous commands. In particular, each command is stored in a history and the up arrow will traverse

backwards along this history and the down arrow forwards. The left and right arrow keys will work as expected. This

combined with a mouse can make it quite easy to do simple editing of your previous commands.

Applying a function

R comes with many built in functions that one can apply to data such as typos. One of them is the mean function

for finding the mean or average of the data. To use it is easy

simpleR – Using R for Introductory Statistics

Data page 3

> mean(typos)

[1] 1.25

As well, we could call the median, or var to find the median or sample variance. The syntax is the same – the

function name followed by parentheses to contain the argument(s):

> median(typos)

[1] 1

> var(typos)

[1] 1.642857

Data is a vector

The data is stored in R as a vector. This means simply that it keeps track of the order that the data is entered in.

In particular there is a first element, a second element up to a last element. This is a good thing for several reasons:

• Our simple data vector typos has a natural order – page 1, page 2 etc. We wouldn’t want to mix these up.

• We would like to be able to make changes to the data item by item instead of having to enter in the entire data

set again.

• Vectors are also a mathematical object. There are natural extensions of mathematical concepts such as addition

and multiplication that make it easy to work with data when they are vectors.

Let’s see how these apply to our typos example. First, suppose these are the typos for the first draft of section 1

of these notes. We might want to keep track of our various drafts as the typos change. This could be done by the

following:

> typos.draft1 = c(2,3,0,3,1,0,0,1)

> typos.draft2 = c(0,3,0,3,1,0,0,1)

That is, the two typos on the first page were fixed. Notice the two different variable names. Unlike many other

languages, the period is only used as punctuation. You can’t use an _ (underscore) to punctuate names as you might

in other programming languages so it is quite useful. 1

Now, you might say, that is a lot of work to type in the data a second time. Can’t I just tell R to change the first

page? The answer of course is “yes”. Here is how

> typos.draft1 = c(2,3,0,3,1,0,0,1)

> typos.draft2 = typos.draft1 # make a copy

> typos.draft2[1] = 0 # assign the first page 0 typos

Now notice a few things. First, the comment character, #, is used to make comments. Basically anything after the

comment character is ignored (by R, hopefully not the reader). More importantly, the assignment to the first entry

in the vector typos.draft2 is done by referencing the first entry in the vector. This is done with square brackets [].

It is important to keep this in mind: parentheses () are for functions, and square brackets [] are for vectors (and

later arrays and lists). In particular, we have the following values currently in typos.draft2

> typos.draft2 # print out the value

[1] 0 3 0 3 1 0 0 1

> typos.draft2[2] # print 2nd pages’ value

[1] 3

> typos.draft2[4] # 4th page

[1] 3

> typos.draft2[-4] # all but the 4th page

[1] 0 3 0 1 0 0 1

> typos.draft2[c(1,2,3)] # fancy, print 1st, 2nd and 3rd.

[1] 0 3 0

Notice negative indices give everything except these indices. The last example is very important. You can take more

than one value at a time by using another vector of index numbers. This is called slicing.

Okay, we need to work these notes into shape, let’s find the real bad pages. By inspection, we can notice that

pages 2 and 4 are a problem. Can we do this with R in a more systematic manner?

1The underscore was originally used as assignment so a name such as The Data would actually assign the value of Data to the variable

The. The underscore is being phased out and the equals sign is being phased in.

simpleR – Using R for Introductory Statistics

Data page 4

> max(typos.draft2) # what are worst pages?

[1] 3 # 3 typos per page

> typos.draft2 == 3 # Where are they?

[1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE

Notice, the usage of double equals signs (==). This tests all the values of typos.draft2 to see if they are equal to 3.

The 2nd and 4th answer yes (TRUE) the others no.

Think of this as asking R a question. Is the value equal to 3? R/ answers all at once with a long vector of TRUE’s

and FALSE’s.

Now the question is – how can we get the indices (pages) corresponding to the TRUE values? Let’s rephrase, which

indices have 3 typos? If you guessed that the command which will work, you are on your way to R mastery:

> which(typos.draft2 == 3)

[1] 2 4

Now, what if you didn’t think of the command which? You are not out of luck – but you will need to work harder.

The basic idea is to create a new vector 1 2 3 … keeping track of the page numbers, and then slicing off just the

ones for which typos.draft2==3:

> n = length(typos.draft2) # how many pages

> pages = 1:n # how we get the page numbers

> pages # pages is simply 1 to number of pages

[1] 1 2 3 4 5 6 7 8

> pages[typos.draft2 == 3] # logical extraction. Very useful

[1] 2 4

To create the vector 1 2 3 … we used the simple : colon operator. We could have typed this in, but this is a

useful thing to know. The command a:b is simply a, a+1, a+2, …, b if a,b are integers and intuitively defined

if not. A more general R function is seq() which is a bit more typing. Try ?seq to see it’s options. To produce the

above try seq(a,b,1).

The use of extracting elements of a vector using another vector of the same size which is comprised of TRUEs and

FALSEs is referred to as extraction by a logical vector. Notice this is different from extracting by page numbers

by slicing as we did before. Knowing how to use slicing and logical vectors gives you the ability to easily access your

data as you desire.

Of course, we could have done all the above at once with this command (but why?)

> (1:length(typos.draft2))[typos.draft2 == max(typos.draft2)]

[1] 2 4

This looks awful and is prone to typos and confusion, but does illustrate how things can be combined into short

powerful statements. This is an important point. To appreciate the use of R you need to understand how one composes

the output of one function or operation with the input of another. In mathematics we call this composition.

Finally, we might want to know how many typos we have, or how many pages still have typos to fix or what the

difference is between drafts? These can all be answered with mathematical functions. For these three questions we

have

> sum(typos.draft2) # How many typos?

[1] 8

> sum(typos.draft2>0) # How many pages with typos?

[1] 4

> typos.draft1 – typos.draft2 # difference between the two

[1] 2 0 0 0 0 0 0 0

Example: Keeping track of a stock; adding to the data

Suppose the daily closing price of your favorite stock for two weeks is

45,43,46,48,51,46,50,47,46,45

We can again keep track of this with R using a vector:

> x = c(45,43,46,48,51,46,50,47,46,45)

> mean(x) # the mean

[1] 46.7

simpleR – Using R for Introductory Statistics

Data page 5

> median(x) # the median

[1] 46

> max(x) # the maximum or largest value

[1] 51

> min(x) # the minimum value

[1] 43

This illustrates that many interesting functions can be found easily. Let’s see how we can do some others. First, lets

add the next two weeks worth of data to x. This was

48,49,51,50,49,41,40,38,35,40

We can add this several ways.

> x = c(x,48,49,51,50,49) # append values to x

> length(x) # how long is x now (it was 10)

[1] 15

> x[16] = 41 # add to a specified index

> x[17:20] = c(40,38,35,40) # add to many specified indices

Notice, we did three different things to add to a vector. All are useful, so lets explain. First we used the c (combine)

operator to combine the previous value of x with the next week’s numbers. Then we assigned directly to the 16th

index. At the time of the assignment, x had only 15 indices, this automatically created another one. Finally, we

assigned to a slice of indices. This latter make some things very simple to do.

R Basics: Graphical Data Entry Interfaces

There are some other ways to edit data that use a spreadsheet interface. These may be preferable to some

students. Here are examples with annotations

> data.entry(x) # Pops up spreadsheet to edit data

> x = de(x) # same only, doesn’t save changes

> x = edit(x) # uses editor to edit x.

All are easy to use. The main confusion is that the variable x needs to be defined previously. For example

> data.entry(x) # fails. x not defined

Error in de(…, Modes = Modes, Names = Names) :

Object “x” not found

> data.entry(x=c(NA)) # works, x is defined as we go.

Other data entry methods are discussed in the appendix on entering data.

Before we leave this example, lets see how we can do some other functions of the data. Here are a few examples.

The moving average simply means to average over some previous number of days. Suppose we want the 5 day

moving average (50-day or 100-day is more often used). Here is one way to do so. We can do this for days 5 through

20 as the other days don’t have enough data.

> day = 5;

> mean(x[day:(day+4)])

[1] 48

The trick is the slice takes out days 5,6,7,8,9

> day:(day+4)

[1] 5 6 7 8 9

and the mean takes just those values of x.

What is the maximum value of the stock? This is easy to answer with max(x). However, you may be interested

in a running maximum or the largest value to date. This too is easy – if you know that R had a built-in function to

handle this. It is called cummax which will take the cumulative maximum. Here is the result for our 4 weeks worth

of data along with the similar cummin:

> cummax(x) # running maximum

[1] 45 45 46 48 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51

> cummin(x) # running minimum

[1] 45 43 43 43 43 43 43 43 43 43 43 43 43 43 43 41 40 38 35 35

simpleR – Using R for Introductory Statistics

Data page 6

Example: Working with mathematics

R makes it easy to translate mathematics in a natural way once your data is read in. For example, suppose the

yearly number of whales beached in Texas during the period 1990 to 1999 is

74 122 235 111 292 111 211 133 156 79

What is the mean, the variance, the standard deviation? Again, R makes these easy to answer:

> whale = c(74, 122, 235, 111, 292, 111, 211, 133, 156, 79)

> mean(whale)

[1] 152.4

> var(whale)

[1] 5113.378

> std(whale)

Error: couldn’t find function “std”

> sqrt(var(whale))

[1] 71.50789

> sqrt( sum( (whale – mean(whale))^2 /(length(whale)-1)))

[1] 71.50789

Well, almost! First, one needs to remember the names of the functions. In this case mean is easy to guess, var

is kind of obvious but less so, std is also kind of obvious, but guess what? It isn’t there! So some other things were

tried. First, we remember that the standard deviation is the square of the variance. Finally, the last line illustrates

that R can almost exactly mimic the mathematical formula for the standard deviation:

SD(X) =

√√√√ 1

n − 1

n∑

i=1

(Xi − X̄)2.

Notice the sum is now sum, X̄ is mean(whale) and length(x) is used instead of n.

Of course, it might be nice to have this available as a built-in function. Since this example is so easy, lets see how

it is done:

> std = function(x) sqrt(var(x))

> std(whale)

[1] 71.50789

The ease of defining your own functions is a very appealing feature of R we will return to.

Finally, if we had thought a little harder we might have found the actual built-in sd() command. Which gives

> sd(whale)

[1] 71.50789

R Basics: Accessing Data

There are several ways to extract data from a vector. Here is a summary using both slicing and extraction by

a logical vector. Suppose x is the data vector, for example x=1:10.

how many elements? length(x)

ith element x[2] (i = 2)

all but ith element x[-2] (i = 2)

first k elements x[1:5] (k = 5)

last k elements x[(length(x)-5):length(x)] (k = 5)

specific elements. x[c(1,3,5)] (First, 3rd and 5th)

all greater than some value x[x>3] (the value is 3)

bigger than or less than some values x[ x< -2 | x > 2]

which indices are largest which(x == max(x))

simpleR – Using R for Introductory Statistics

Data page 7

Problems

2.1 Suppose you keep track of your mileage each time you fill up. At your last 6 fill-ups the mileage was

65311 65624 65908 66219 66499 66821 67145 67447

Enter these numbers into R. Use the function diff on the data. What does it give?

> miles = c(65311, 65624, 65908, 66219, 66499, 66821, 67145, 67447)

> x = diff(miles)

You should see the number of miles between fill-ups. Use the max …