STATISTICS 4365/6364 – HW #4
Due Friday April 2 by midnight
Both problems should be done in both R and Python. Turn in a single R Markdown file with all R code
and results and a single Jupyter Notebook file with all Python code and results.
1. In this problem, you will use simulations to prove that the binomial distribution is co
ect. Recall
that the binomial distribution has two parameters n and p. There are n trials and each has two
possible outcomes, with probability p for “success” and 1-p for “failure”. The binomial gives the
probability distribution for the number of successes in n trials. You will conduct simulations with
replicates, where each simulation replicates does n simulated “coin flips”. You will add up the
number of successes in each coin flip, and compare the result to the true distribution:
i. Generate n*r values from the uniform(0,1) distribution and a
ange these in an
xn matrix. Each value less than p is considered a “success”.
ii. For each row from part I, count the number of successes. The number of
possible successes ranges from 0 to n.
iii. Use the table function in R and the value_counts function in Python and to
count up the number of replicates with each number of successes.
iv. Make a table that compares the simulation result to the true binomial
probabilities.
Note #1: You should make the calculation as “vectorized” as possible. This means, that you
should do it without use of loops
Note #2: Things can be slightly more complicated if some possible values for number of
successes don’t actually appear in your simulations. This will happen if your number of trials is
too large, your value of p is too far from 0.5, or your number of simulation replicates is too
small. For example, if you have n=1000 and p=0.01, you are very unlikely to ever get 1000
successes. The coding is a more complicated in this case. However, if you limit things to n <= 15,
0.4<=p<=0.6, and
=1,000,000 then you shouldn’t have any problems.
2. The point of this problem is to practice using vectorized calculation. Thus, you should not use
any loops in completing the problem. Make a data frame consisting of 20 and 10 columns. Each
column j should consist of 20 values from a normal distribution with mean (j-1) and standard
deviation 0.5j. For example, the third column should be normal(mean=2, sd=1.5). Using this data
frame, do each of the following (using code, of course):
a. Find the mean and standard deviation for each column.
. Write code that counts the number of columns for which the sample mean and sample
standard deviation are within 20% of the values used to generate the data.
c. Write code that writes the columns from part b to a new data frame.
d. For each value in the new data frame, subtract its column mean and divide by the
column standard deviation. Do NOT use the scale function in R, the zscore function in
Python, or any function that does this automatically.