Microsoft Word - W23 MATH341_345 Project V1.docx
1
Winter 2023 MATH 341/345 Project
Deployments of Safety Cars in Formula One in XXXXXXXXXX
(Version 1. Fe
uary 5, 2023)
Introduction:
This project aims at modeling the frequencies of safety car deployments per race in Formula
One and the time intervals between safety car deployments in XXXXXXXXXXA safety car in
Formula One is deployed while the “yellow flags” are waved by the marshals and the Race
Director decides that it is necessary to remove any hazards on the race track or that the racing
cars need to slow down due to unfavorable track conditions (i.e., heavy rain). When a safety
car is deployed, in addition to the yellow flags, each driver sees “SC” boards on the sides of the
track. Moreover, the same information is displayed on the steering wheel of each racing car.
Safety cars and yellow flags are important components of Formula One racing to protect
drivers’ and marshals’ lives. When the safety car is leading the race, each racing car needs to
unch up and follow the safety car without overtaking any other cars, unless they are allowed
to unlap themselves. As the safety car goes around the track at a much slower speed than the
normal racing pace, marshals can quickly remove any hazards on the track and improve the
track condition without wo
ying about fast-moving racing cars.
However, even with strict regulations under the yellow flag condition, accidents happen,
especially during wet weather races. A notable recent incident happened at the 2014 Japanese
Grand Prix, when a very promising young French driver Jules Bianchi of Marussia collided with a
tractor crane under the “double yellow flag” condition. A “double yellow flag” condition
indicates that marshals may be present on the track and the driver needs to prepare to stop, if
necessary. Bianchi lost control of the car due to aquaplaning on the wet surface and suffered a
fatal injury as a result of the collision with the tractor crane.
The FIA (governing body of the Formula One races) took the incident very seriously and
implemented a number of safety measures. One of them is an introduction of the “virtual
safety car (VSC)”. Under VSC condition, each driver needs to slow down their car to the posted
speed limit, usually resulting in a 35 to 40% speed reduction. Because it is a “virtual” safety car,
under VSC, the actual safety car is not deployed; rather, each racing car is equipped with the
device which automatically slows down to the posted speed limit under VSC.
Even with the introduction of VSC in 2015, under severe conditions, safety cars are deployed
once in a while. Here, an interesting question arises: Did the introduction of VSC change the
frequency of safety car deployments? This is an important question to answer for race
strategists, as the deployment of a safety car means that each team needs to react quickly to
adjust their tire strategies. Each driver is required to make at least one pit stop to change their
tires during the race, and a pit stop under the safety car condition implies that they can save
about 20 seconds, possibly gaining several precious positions in the race without overtaking. At
2
the same time, fresh tires typically make the racing car more drivable, increasing the chances of
catching and overtaking the other racing cars in front after the pit stop.
Related article: https:
www.mclaren.com
acing/2019/canadian-grand-prix/how-make-right-
call-safety-ca
Note that the importance of understanding probability is emphasized in this article.
Your Tasks in This Project:
Your main task in this project is to analyze the safety car deployment data in Formula One to
determine whether there are any changes in the frequency of safety car deployments between
the pre-VSC era XXXXXXXXXXand post-VSC era XXXXXXXXXXThat involves fitting reasonable
distribution(s) to the data for the number of safety car deployments per race and time intervals
etween the safety car deployments in these two time periods. Then, by comparing these two
distributions, you are asked to conclude whether strategic adjustments were necessary to
account for increased/decreased safety car deployments after VSC was introduced in 2015. The
dataset is originally retrieved from Kaggle
(https:
www.kaggle.com/datasets/jtrotman/formula-1-race-events), but it was further
augmented by adding Type, Round, TotalRounds, TotalLaps, and Condition. These additional
pieces of information were taken from the Wikipedia entries for the Formula One races.
A thorough and complete analysis of the main task above is sufficient to receive full credit for
this project. That is, you are not required to do any additional programming beyond what is
given if you choose to do so. However, you are probably interested in doing a more detailed
analysis of the dataset to make your analysis useful and interesting for the participating
Formula One teams. To help you analyze the dataset in more detail, the dataset provided
(augmented_safety_cars.csv) contains additional information such as type of the circuit
(permanent or street) and track condition (dry, mixed, or wet).
In addition, you will be asked to watch an interesting video titled “What Does An F1 Strategist
Do?” (https:
youtu.be/4CFkltWIc8o) so that you can see what Formula One strategists actually
do before, during, and after each race. At the same time, you will see how they interact with
acers, mechanics, race engineers, data analysts, and team principals.
How This Project Works:
This project consists of three parts; Probability Questions, Statistics Questions, and project
write-up. For the Probability and Statistics Questions, you need to answer the questions given
elow. For the project write-up, you may choose to summarize the results based on the R code
given. However, to make the project more interesting, you are encouraged to ca
y out
additional analysis. If you find anything interesting, you may choose to write about your
interesting finding(s) instead. To make sure that what you decide to write in your write-up is
appropriate, please talk to the instructor before you do anything. The instructor will be happy
to assist you with additional programming if necessary.
3
Probability Questions (12 Points in Total):
1. Watch “What Does An F1 Strategist Do?” (https:
youtu.be/4CFkltWIc8o) and describe
how the Formula One strategist position is related to your major(s) in a paragraph or
two. Note: Everyone on your team needs to write a separate paragraph or two. (2pts)
2. Suppose that you look at each of ? different laps in Formula One races. Why is checking
whether or not each of these laps was led by a safety car is a binomial experiment?
(2pts)
3. Why is it reasonable to assume that the number of safety car deployments in a fixed
period of time (i.e., five seasons) follows the Poisson distribution (approximately)?
Recall the relationship between the binomial and Poisson distribution, and state what
happens to ? (the number of laps) and ? (the probability that each lap is led by a safety
car). (2pts)
4. Why is it reasonable to assume that the time intervals between safety car deployments
are (approximately) exponentially distributed? (2pts)
5. Suppose that we consider two time periods of Formula One racing (2010 – 2014 and
2015 – XXXXXXXXXXIs it safe to assume that the number of safety car deployments in each of
these two time periods is independent of each other? In other words, is it reasonable to
say that the number of safety car deployments in 2010 – 2014 does not significantly
influence the number of safety car deployments in 2015 – 2019? Justify. (2pts)
6. Recall the memoryless property of the exponential distribution, which says that
?(? ≥ ?! + ?" | ? ≥ ?!) = ?(? ≥ ?"), ?! ≥ 0, ?" ≥ 0, if and only if ? is exponentially
distributed. What does this imply regarding the probability that the next safety car
deployment is 5 races from now given that it has been 3 races since the last safety car
deployment? Comment. (2pts)
Note: The above phenomenon is known as “the waiting time paradox”.
7. (Optional) Any questions you have about this project.
Statistics Questions (Read W23MATH341Project.R and run the program to answer these
questions. Look for “SQ” in the comments in the R code to identify which part of the code is
efe
ing to which question.) (20 Points in Total):
Note: The length of each race is set to 1, which is reasonable given that each race has an
approximately the same race distance. According to the data, the first safety deployment in
2010 occu
ed at lap 2 of Round 2 (which was a 58-lap race) and the second deployment
occu
ed at lap 1 of Round 4 (which was a 56-lap race). Thus, the first duration is simply
(Round 1) + (Deployment in Round 2) = 1 + 2/58 = XXXXXXXXXXThen, the second duration is the
duration between these two deployments is given by (Remaining laps in Round 2) + Round 3
+ (Deployment in Round 4) = (1 – 2/ XXXXXXXXXX/56 = XXXXXXXXXX.
1. Look at the histograms of the number of safety car deployments per race. Do these
histograms suggest that the data are Poisson distributed (approximately)? Or, is there
any clear evidence against that? Comment. (2pts)
2. The best-fit Poisson pmfs, as represented by the blue dotted lines, use
lambda=mean(first_half) and lambda=mean(second_half) for the first and second half of
the 2010’s, respectively. Explain why it makes sense to use these values. (2pts)
4
3. Look at the histograms of the time intervals between two safety car deployments. Do
these histograms suggest that the data are exponentially distributed (approximately)?
Or, is there any clear evidence against that? Comment. (2pts)
4. The best-fit exponential pmfs, as represented by the blue dotted lines, use
ate=1/mean(interval1) and rate=1/mean(interval2) for the first and second half of the
2010’s, respectively. Explain why it makes sense to use these values. (2pts)
5. Report mean(first_half), mean(interval1), mean(second_half), and mean(interval2).
Then, describe how mean(first_half) and mean(interval1), as well as mean(second_half)
and mean(interval2), are approximately related to each other. After that, explain why
that happens by recalling the distributions you identified for the number of safety car
deployments and the time interval between two safety car deployments. (2pts)
6. Running a two-sample t-test for comparing means or to construct a confidence interval
for the difference in means using the time interval data may potentially lead to wrong
esults. Explain why in terms of normality and independence. (2pts)
7. Explain why the concerns you mentioned in the previous question are actually not
concerning for this dataset. (2pts)
8. The t.test() function in R gives the one- and two-sample t-test results for the mean or
difference in means, including the confidence intervals and p-values. The parameter
var.equal in the t.test() function specifies whether or not the common variance can be
assumed (if yes, TRUE, and otherwise, FALSE). For comparing the time intervals, can we
assume common variance? Comment. Recall that the mean and standard deviation are
equal to each other in the case of exponential distribution. (2pts)
9. Report the results of the t.test() function (95% confidence interval, degrees of freedom
used, and p-value) for the var.equal=TRUE and var.equal=FALSE cases. (2pts)
10. Based on the results above, discuss whether or not there is any statistically significant
change in the distribution of the safety car deployments between these two time
periods. (2pts)
11. (Optional) The Kolmogorov-Smirnov test is a one- and two-sample test that directly
compares the cumulative distribution function(s) of the data. In the one-sample case, a
esearcher hypothesizes the underlying distribution and see how well the cumulative
distribution function (cdf) estimated from the data (known as the empirical cdf) matches
that of the hypothesized distribution. In the two-sample case, the two empirical cdf’s
are directly compared. Do the test results show any evidence against the deviation from
the exponential distribution for the time interval data? Also, are these two datasets
significantly different from each other? Justify your conclusion by reporting the p-values
and interpreting these p-values. (Extra credit: 1pt)
12. (Optional) The quantile-quantile (Q-Q) plot is a visual tool to see if the dataset of
interest follows a certain distribution. Although the Q-Q plot is typically used for the
normal distribution, for this project, we use the Q-Q plot for the exponential
distribution. If the points on the plot follows a straight line on the Q-Q plot, that is an
indication that the dataset follows the exponential distribution well. Present the Q-Q
plots for the time interval datasets (pre- and post-VSC) and comment. (Extra credit: 1pt)
13. (Optional) Another important aspect of the dataset is the independence of the
observations. A common assumption