This is a vignette for the `R`

package *ipsRdbs*. This package contains data sets, programmes and illustrations discussed in the book, “Introduction to Probability, Statistics and R: Foundations for Data-Based Sciences” by Sahu (2023).

ipsRdbs 1.0.0

- 1 Introduction
- 2 Data sets
- 2.1
`beanie`

: Age and value of beanie baby toys. - 2.2
`bill`

: Wealth and age of world billionaires - 2.3
`bodyfat`

: Body fat percentage and skinfold thickness of athletes - 2.4
`bombhits`

: Number and frequency of bombhits in London - 2.5
`cement`

: Breaking strength of cement - 2.6
`cfail`

: Number of weekly computer failures - 2.7
`cheese`

: Taste of cheese - 2.8
`emissions`

: Exhaust emissions of cars - 2.9
`err_age`

: Error in guessing ages from photographs - 2.10
`ffood`

: Service times in a fast food restaurant - 2.11
`gasmileage`

: Gas mileage of cars - 2.12
`possum`

: Body weight and length of possums in Australian regions - 2.13
`puffin`

: Nesting habits of puffins in Newfoundland - 2.14
`rice`

: data set on rice yield - 2.15
`wgain`

: Weight gain of students starting college

- 2.1
- 3 Illustrated R functions
- 4 Discussion
- References

This package complements the book, `Introduction to Probability, Statistics and R for Data-Based Sciences’ by Sahu (2023). The package distributes the data sets used in the book and provides code illustrating the statistical modeling of the data sets. In addition, the package provides code for illustrating various results in probability and statistics. For example, it provides code to simulate the Monty python game illustrating conditional probability, and gives simulation based examples to illustrate the central limit theorem and the weak law of large numbers. Thus the package helps a beginner reader in enhancing understanding of a few elementary concepts in probability and statistics, and introduces them to perform linear statistical modelling, i.e., regression and ANOVA which are among the key foundational concepts of data science and machine learning, more generally data-based sciences.

The reader is first instructed to install the `R`

software package by searching for `CRAN`

in the internet. The reader shoud then go onto the web-page https://cran.r-project.org/ and install the correct and latest version of the package on their own computer. Please note that `R`

cannot be installed on a mobile phone.
Once `R`

has been installed, the next task is to install the frontend software package Rstudio,
which provides an easier interface to work with `R`

.

After installing `R`

and `Rstudio`

, the reader should launch the `Rstudio`

programme in their computer. This will open up a four pane window with one named ‘Console’ or ‘Terminal’. This window accepts commands (or our instructions) and prints results. For example, the reader may type 2+2 and then hit the Enter button to examine the result. The reader is asked to search the internet for gentler introductions and videos.

In order to getting started here, thereader is aked to install the add-on `R`

package `ipsRdbs`

simply by issuing the `R`

command

`install.packages("ipsRdbs", dependencies=TRUE)`

without committing any typing mistakes. If this installation is successful, the reader can issue the following two commands to list all `R`

objects (data sets and programmes) included in the package.

```
library(ipsRdbs)
ls("package:ipsRdbs")
```

Note that this command will only produce the intended results if only the package has been successfully installed in the first place.

All the listed objects, as the output of the `ls`

command in the previous section, have associated help files. The reader can gain information for each of those object by asking for help by typing the question mark immediately followed by the object name, e.g. `?butterfly`

or by issuing the command `help(butterfly)`

.

The help files provide details about the objects and the user is able to run all the code included as illustrations at the end of the help file. This cam be done either by clicking the `Run Examples`

link or simply by copy-pasting all the commands onto the command console in Rstudio. This is a great advantage of `R`

as it allows the users to reproduce the results without having to learn all the commands and syntax correctly. After gaining this confidence, a beginner user can examine and experiment with the commands further. More details regarding the objects are provided in the book Sahu (2023).

The remainder of this vignette simply elaborates the help files for all the main objects and programmes included in the package. The main intention here is to enable the reader to reproduce all the results by actually running the commands and the code included already included in the help files.

Section 2 discusses all the data sets.
All the `R`

functions are discussed in 3.

Some summary remarks are provided in Section 4.

`beanie`

: Age and value of beanie baby toys.This data set contains the age and the value of 50 beanie baby toys. Source: Beanie world magazine. This data set has been used as an example of simple linear regression modellinhg where the exercise is to predict the value of a beanie baby toy by knowing it’s age.

```
head(beanie)
#> name age value
#> 1 Ally 52 55
#> 2 Batty 12 12
#> 3 Bongo 28 40
#> 4 Blackie 52 10
#> 5 Bucky 40 45
#> 6 Bumble 28 600
summary(beanie)
#> name age value
#> Length:50 Min. : 5.00 Min. : 10.0
#> Class :character 1st Qu.:12.00 1st Qu.: 15.0
#> Mode :character Median :28.00 Median : 26.5
#> Mean :26.52 Mean : 128.9
#> 3rd Qu.:40.00 3rd Qu.: 62.5
#> Max. :64.00 Max. :1900.0
plot(beanie$age, beanie$value, xlab="Age", ylab="Value", pch="*", col="red")
```

`bill`

: Wealth and age of world billionairesThis data set contains wealth, age and region of 225 billionaires in 1992 as reported in the Fortune magazine. This data set can be used to illustrate exploratory data analysis by producing side-by-side box plots of wealth for billionaires from different continents of the world. It can also be used for multiple linear regression models, although such tasks have not been undertaken here.

```
head(bill)
#> wealth age region
#> 1 37.0 50 M
#> 2 24.0 88 U
#> 3 14.0 64 A
#> 4 13.0 63 U
#> 5 13.0 66 U
#> 6 11.7 72 E
summary(bill)
#> wealth age region
#> Min. : 1.000 Min. : 7.00 A:37
#> 1st Qu.: 1.300 1st Qu.: 56.00 E:76
#> Median : 1.800 Median : 65.00 M:22
#> Mean : 2.726 Mean : 64.03 O:28
#> 3rd Qu.: 3.000 3rd Qu.: 72.00 U:62
#> Max. :37.000 Max. :102.00
library(ggplot2)
gg <- ggplot2::ggplot(data=bill, aes(x=age, y=wealth)) +
geom_point(aes(col=region, size=wealth)) +
geom_smooth(method="loess", se=FALSE) +
xlim(c(7, 102)) +
ylim(c(1, 37)) +
labs(subtitle="Wealth vs Age of Billionaires",
y="Wealth (Billion US $)", x="Age",
title="Scatterplot", caption = "Source: Fortune Magazine, 1992.")
plot(gg)
#> `geom_smooth()` using formula = 'y ~ x'
```

`bodyfat`

: Body fat percentage and skinfold thickness of athletesThis data set contains body fat percentage data for 102 elite male athletes training at the Australian Institute of Sport. This data set has been used to illustrate simple linear regression in Chapter 17 of the book by Sahu (2023).

```
summary(bodyfat)
#> Skinfold Bodyfat
#> Min. : 28.00 Min. : 5.630
#> 1st Qu.: 37.52 1st Qu.: 6.968
#> Median : 47.70 Median : 8.625
#> Mean : 51.42 Mean : 9.251
#> 3rd Qu.: 58.15 3rd Qu.:10.010
#> Max. :113.50 Max. :19.940
plot(bodyfat$Skinfold, bodyfat$Bodyfat, xlab="Skin", ylab="Fat")
```

`plot(bodyfat$Skinfold, log(bodyfat$Bodyfat), xlab="Skin", ylab="log Fat")`

`plot(log(bodyfat$Skinfold), log(bodyfat$Bodyfat), xlab="log Skin", ylab="log Fat")`

```
# Keep the transformed variables in the data set
bodyfat$logskin <- log(bodyfat$Skinfold)
bodyfat$logbfat <- log(bodyfat$Bodyfat)
bodyfat$logskin <- log(bodyfat$Skinfold)
# Create a grouped variable
bodyfat$cutskin <- cut(log(bodyfat$Skinfold), breaks=6)
boxplot(data=bodyfat, Bodyfat~cutskin, col=2:7)
```

```
require(ggplot2)
p2 <- ggplot(data=bodyfat, aes(x=cutskin, y=logbfat)) +
geom_boxplot(col=2:7) +
stat_summary(fun=mean, geom="line", aes(group=1), col="blue", linewidth=1) +
labs(x="Skinfold", y="Percentage of log bodyfat",
title="Boxplot of log-bodyfat percentage vs grouped log-skinfold")
plot(p2)
```