Red Sox

In my investigation I decided to analyze Boston Red Sox statistics. I am going to use the statistics for the number of hits they got in a season compared to the number of games they won from 2000-2005 to create a regression line. Then using this regression line I will calculate a prediction for the next 5 years (2006-2010) and using the Chi-squared test I will evaluate how accurate my original regression was and if it could be used for future predictions.

2000-2005

When creating the 2000-2005 regression 2003 appeared as an outlier and was therefore removed from the regression line that I used. This set of data has a relatively low R-value which is reflective of 2003 being an outlier.

Once the statistics for 2003 were removed from the data set the R-value was much higher and therefore much more reliable.

Using the regression line calculated from the 2000-2005 data without 2003 I then predicted the number of games the Red Sox should have won from 2006-2010. Then I calculated the \chi ^2 value.

Posted in Uncategorized | Leave a comment

Chapter 4

In this chapter we are given data regarding complementary palindromes from the DNA of the human cytomegalovirus(CMV). Overall there were 296 palindromes of at least length 10. The first investigation compares the 296 palindromes and compares them to random data on a scatter plot.


Data for 100 sets of Randomly Generated Numbers:

> mean= 0.996148

> median= 0.9963839

> standard deviation= 0.001450342

> skewness= -0.9413763

> kurtosis= 4.706211

> max= 0.9985346

> min= 0.9903869

95% Confidence Interval

\mu -\frac{2*\sigma}{\sqrt n} \mu \mu +\frac{2*\sigma}{\sqrt n}
0.995857932 0.996148 0.996438068


pg 87.
296 palindromes, 57 intervals
\lambda = \frac {296}{57}

Posted in Uncategorized | Leave a comment

Chapter 7

Purpose: To use the given Pre and Post Size Data for Dungeness Crab Growth to find a regression line and an R^2 value. The Post Size data acts as the independent variable(x-axis) and the Pre Size data acts as the dependent variable(y-axis). Then the calculated R value will determine how close the data is to lying exactly on the regression line. If R = \pm 1 then the points lie exactly on the regression line and the sign indicates the slope of the line. If R is close to 1 then the regression equation can be used along with the second set of Post Size Data to calculate a prediction for a set of Pre Size Data.


From the graph the regression line can be calculated at y=1.0732x - 25.214

You can also get R^2 = 0.98083. Therefore R = 0.99036. This R value is very close to 1 so the points almost lie exactly on a line.

Analysis of Data

Using the regression line calculated from the given data I then calculated the pre-size of a set of crabs using a second set of given post sizes.

Analysis of Data (Predicted Pre-Size)

Next I broke the original set of given data down into 5mm increments. I analyzed the data in each of the 5 increments by calculating the mean, median, min, max, standard deviation, skewness, kurtosis and the count of the number of points in the increment. Also from each of the graphs the regression line and the R and the R^2 values.

132.5-137.4

137.5-142.4

142.5-147.4

147.5-152.4

152.5-157.4

Each of the 5 increments has an R value that s not very close to 1 meaning that many of the points do not lie on the regression line.

Next to get a more accurate regression line I used the midpoint of the x-axis and the mean of the y-axis from each of the increments.


 

Posted in Uncategorized | Leave a comment

Chapter 2

The second chapter’s data comes from a survey of 91 undergraduate students pertaining to the amount of time they spend playing video games. The questions in the questions in the survey are:

  • time : the amount of time, in hours, spent playing video games the week before the survey.
  • like : students rate on a scale of 1 to 5 how much they like to play video games. 1=Never played, 2=Very much, 3=Somewhat, 4=Not really, 5=Not at all
  • where : where they play video games. 1=Arcade, 2=Home on a system, 3=Home on a computer 4=Home on computer and system, 5=Arcade and Home(system or computer) 6=Arcade and home (both system and computer)
  • freq : how often the student plays video games. 1=Daily, 2=Weekly, 3=Monthly, 4=Semesterly
  • busy : do they play video games if they are busy. 0=no, 1=yes
  • educ : is playing video games educational. 0=no, 1=yes
  • sex : 0=female, 1=male
  • age : age in years
  • home : computer at home. 0=no, 1=yes
  • math : hate math. 0=no, 1=yes
  • work : hours worked for pay in the week prior to survey
  • own : own PC. 0=no, 1=yes
  • CDROM : owned PC has a CDROM. 0=no, 1=yes
  • email : has an email account
  • grade : grade expected in course. 4=A, 3=B, 2=C, 1=D, 0=F.

The fraction of students who played video games the week before the survey is \frac {34}{91}

time <- c (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1, 1, 1, 1, 1.5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 5, 14, 14, 30)

mean(time) = 1.242857

sd(time, na.rm=false) = 3.77704

95% confidence interval (0.45, 2.03)

 

\mu -\frac{2*\sigma}{\sqrt n}
\mu
\mu +\frac{2*\sigma}{\sqrt n}
0.450974375 1.242857143 2.034739911

 

Using Excel I calculated the mean of the data set and from that I calculated the 95% confidence interval.


Posted in Uncategorized | Leave a comment

Chapter 1

In this chapter we use data comparing the birthweight of babies with smoking and nonsmoking mothers. First, after using Excel to sort the data, I stored it in R as smoker and nonsmoker.

smoker <- c(58, 65, 68, 69, 71, 71, 71, 72, 75, 75, 75, 75, 77, 77, 78, 80, 81, 81, 82, 82, 83, 84, 84, 85, 85, 85, 86, 86, 86, 86, 87, 87, 87, 87, 87, 87, 88, 88, 88, 88, 90, 90, 91, 91, 91, 91, 91, 91, 91, 92, 92, 92, 93, 93, 93, 93, 93, 93, 94, 94, 94, 94, 95, 95, 95, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 97, 97, 97, 97, 97, 97, 97, 97, 98, 98, 98, 98, 98, 98, 98, 98, 99, 99, 99, 99, 99, 99, 99, 99, 99, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 101, 101, 101, 101, 101, 101, 101, 101, 101, 102, 102, 102, 102, 102, 102, 102, 102, 102, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 104, 104, 104, 104, 104, 104, 104, 104, 104, 104, 104, 105, 105, 105, 105, 105, 105, 105, 105, 105, 106, 106, 106, 106, 106, 106, 107, 107, 107, 107, 107, 107, 107, 108, 108, 108, 108, 108, 108, 108, 109, 109, 109, 109, 109, 109, 109, 109, 109, 109, 109, 109, 110, 110, 110, 110, 110, 110, 110, 111, 111, 111, 111, 111, 111, 112, 112, 112, 112, 112, 112, 113, 113, 113, 113, 113, 113, 113, 113, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 118, 118, 118, 118, 118, 118, 118, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 121, 121, 121, 121, 121, 121, 121, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 124, 124, 124, 124, 124, 124, 125, 125, 125, 125, 125, 125, 125, 125, 125, 126, 126, 126, 126, 126, 126, 126, 127, 127, 127, 127, 127, 127, 127, 127, 127, 128, 128, 128, 128, 128, 128, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 130, 130, 130, 130, 130, 130, 130, 130, 130, 130, 131, 131, 131, 131, 131, 131, 131, 131, 131, 132, 132, 132, 132, 132, 133, 133, 133, 133, 133, 133, 134, 134, 134, 134, 135, 135, 136, 136, 136, 136, 137, 137, 138, 138, 138, 138, 139, 139, 140, 140, 141, 141, 141, 141, 141, 142, 142, 143, 143, 143, 143, 143, 144, 144, 144, 144, 144, 144, 144, 144, 145, 145, 145, 145, 145, 146, 146, 148, 150, 150, 152, 152, 153, 154, 154, 154, 155, 156, 160, 160, 161, 163)

nonsmoker <- c(55, 62, 63, 65, 71, 71, 72, 73, 75, 78, 78, 79, 80, 81, 84, 84, 84, 85, 85, 85, 85, 87, 88, 89, 89, 90, 90, 91, 91, 91, 92, 93, 93, 93, 93, 94, 95, 96, 96, 97, 97, 97, 97, 97, 98, 98, 98, 98, 98, 99, 99, 99, 99, 99, 99, 99, 100, 100, 100, 100, 100, 100, 101, 101, 101, 101, 101, 102, 102, 102, 102, 102, 102, 102, 102, 102, 102, 103, 103, 103, 103, 103, 104, 104, 104, 104, 104, 104, 104, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 106, 106, 106, 106, 106, 107, 107, 107, 107, 107, 107, 107, 107, 107, 108, 108, 108, 108, 108, 108, 108, 109, 109, 109, 109, 109, 109, 109, 109, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 112, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 118, 118, 118, 118, 118, 118, 118, 118, 118, 118, 118, 118, 118, 118, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 121, 121, 121, 121, 121, 121, 121, 121, 121, 121, 121, 121, 121, 121, 121, 121, 121, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 124, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 126, 126, 126, 126, 126, 126, 126, 126, 126, 126, 126, 126, 126, 126, 126, 126, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 130, 130, 130, 130, 130, 130, 130, 130, 130, 130, 130, 130, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 132, 132, 132, 132, 132, 132, 132, 132, 132, 132, 132, 132, 132, 132, 132, 132, 133, 133, 133, 133, 133, 133, 133, 133, 133, 133, 133, 133, 133, 134, 134, 134, 134, 134, 134, 134, 134, 134, 134, 134, 134, 134, 134, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 136, 136, 136, 136, 136, 136, 136, 136, 136, 136, 136, 136, 136, 136, 136, 136, 136, 136, 137, 137, 137, 137, 137, 137, 137, 137, 137, 137, 137, 137, 137, 138, 138, 138, 138, 138, 138, 138, 138, 138, 138, 138, 138, 138, 138, 139, 139, 139, 139, 139, 139, 139, 139, 139, 139, 139, 139, 139, 140, 140, 140, 140, 140, 140, 140, 141, 141, 141, 141, 141, 141, 142, 142, 142, 142, 142, 142, 142, 143, 143, 143, 143, 143, 143, 143, 143, 144, 144, 144, 144, 144, 144, 144, 144, 144, 145, 145, 145, 145, 145, 146, 146, 146, 146, 146, 146, 146, 146, 147, 147, 147, 147, 147, 147, 148, 148, 148, 149, 149, 149, 150, 150, 150, 150, 150, 150, 151, 151, 152, 152, 152, 152, 153, 154, 154, 155, 155, 155, 155, 155, 156, 157, 158, 158, 158, 158, 159, 160, 160, 160, 162, 163, 163, 164, 165, 166, 167, 169, 170, 173, 174, 174, 174, 176)

Next I created a histogram for each set of data using the command hist(smoker) and hist(nonsmoker).

Next I used R to calculate the mean, standard deviation, median, quartiles, max, min, skewness and kurtosis of the two data sets. The mean of the data set is also known as the average. To calculate the mean in R I used the commands mean(smoker) and mean(nonsmoker). The standard deviation is the amount of variation there is in the data from the mean. To calculate the standard deviation I used the command sd(smoker) and sd(nonsmoker).  The median of a data set is the value that is in the middle of a data set. I calculated the median by using the command median(smoker) and median(nonsmoker). The maximum and minimum are the highest and lowest values of a data set. To calculate the maximum and minimum I used the commands max(smoker), min(smoker), max(nonsmoker) and min(nonsmoker). The skewness of a data set is used to describe how the data is distributed. A skewness of zero means the data is relatively evenly distributed. A negative skewness means that most of the data including the median is to the right of the mean. A positive skewness means that most of the data including the median is to the left of the data. To calculate the skewness I used the command skewness(smoker) and skewness(nonsmoker). The kurtosis of a data set is the measure of whether it is peaked or flat. Data sets with a higher kurtosis usually have a peak near the mean and likewise data sets with a lower kurtosis generally have a more flat top. To calculate the kurtosis I used the command kurtosis(smoker) and kurtosis(nonsmoker). To calculate the skewness and the kurtosis the package “moments” must be installed into R and can be used using the command library(moments). A quartile is one of three points that divide the data into four groups that each represent 25% of the data. To calculate the quartiles I used the commands

duration=smoker

quantile(duration)

duration=nonsmoker

quantile(duration)

smoker nonsmoker
mean 114.1095 123.0472
standard deviation 18.09895 17.39869
median 115 123
max 163 176
min 58 55
skewness -0.03359498 -0.1869841
kurtosis 2.988032 4.03706

Quartiles:

0% 25% 50% 75% 100%
smoker 58 102 115 126 163
nonsmoker 55 113 123 134 176
Posted in Uncategorized | Leave a comment