Kentucky Derby
Kentucky Derby
There is a horse race called Kentucky Derby. People are betting on the outcomes of this race. Let’s do an analysis to see if we can get an edge over other people.
Plan
Plan
◼
Get data
◼
Analyse data
Acquiring Data
Acquiring Data
I decided to get data from Wikipedia. After checking 2020 and 2021 pages of Kentucky Derby, I found that each page contains a table that has the data that I need. After some tinkering, I came up with data acquisition code that works.
In[]:=
getData[year_]:=With[{data=Import["https://en.wikipedia.org/wiki/"<>year<>"_Kentucky_Derby","Data"]},With[{position=First@Position[data,{"Finish"|"Finish ",___}]},data[[Sequence@@Append[Most[position],1][[;;-2]]]]]]
I decided to only consider data from 2015 - 2021 and not earlier. It was an arbitrary assumption. I didn’t want to take too old data because I believe the relationships might not be relevant and decided to set a time window to the last 7 years.
In[]:=
years=Table[ToString@year,{year,2015,2021}]
Out[]=
{2015,2016,2017,2018,2019,2020,2021}
In[]:=
races={#,getData[#]}&/@years;
In[]:=
cleanHeader[element_]:=StringTrim@StringReplace[element,("[26]"|"[33]")->""];
In[]:=
addAssociations[headers_,body_,year_]:=AssociationThread[Append[headers[[;;6]],"Year"],Append[#[[;;6]],year]]&/@body;
In[]:=
data=Flatten[addAssociations[cleanHeader/@First@#[[2]],Rest@#[[2]],#[[1]]]&/@races,1];
In[]:=
cleanValues[value_]:=If[StringQ[value],StringTrim[value],value];
In[]:=
cleanData=Map[cleanValues,data[[All,{"Finish","Horse","Jockey","Trainer","Year"}]],{2}];
In[]:=
ds=Dataset[cleanData][Select[If[NumberQ[#Finish],True,!StringContainsQ[#Finish,"also"]]&]];
In[]:=
ds
Out[]=
EDA (Exploratory Data Analysis)
EDA (Exploratory Data Analysis)
Jockey
Jockey
Let’s get some domain expertise before delving right into data.
In[]:=
Unfortunately, Wikipedia doesn’t have size, fitness level, communications skills, courage metrics and if either or not jockey lives in a stable.
However, we can find a proxy for hard training. We can see how often a jockey participates in a race. We can think about it as the live training. The training that matters.
However, we can find a proxy for hard training. We can see how often a jockey participates in a race. We can think about it as the live training. The training that matters.
In[]:=
ds[GroupBy["Jockey"],#[["Finish"]]&/@#&]
Out[]=
From this, we can see that there is a difference in jockey’s experience in this race. There are some that participate more often than others. It might have to do with the fact that jockeys that perform well have better chances of returning in the next year. There might be survivorship bias present in the dataset. It might also indicate that the same jockeys perform consistently better than the others over races so they keep returning.
In[]:=
Histogram@ds[GroupBy["Jockey"],Length[#[["Finish"]]&/@#]&]
Out[]=
In[]:=
topJockeys=ds[GroupBy["Jockey"],Count[#[["Finish"]]&/@#,1|2|3]&][Select[#!=0&]]
Those are the jockeys that finished within first 3 from 2015-2021.
We can see that indeed jockeys that have taken one of the leading places seem to be participating more often than those that didn’t. Another thing that we notice is that jockeys that win top places don’t participate once in a race.
Do jockeys win their first race they participate in?
Only 20% of participating jockeys from 2015-2021 won on their first race. Ignoring the fact that they might have participated in the race before 2015.
Domain knowledge
Domain knowledge
In order to understand importance of relationships that we find as well as to what to pay attention to. I decided find some articles that explain what role jockeys and trainer play.
A jockey is booked to ride a horse by his agent. The booking requires the agreement of the owner and trainer of the racehorse. The jockey is not the sole decision-maker over which horse he rides. However, good riders are sought after and often can pick their horse. - https://horseracingsense.com/how-do-jockeys-choose-which-horses-they-ride/
Jockeys ride the horses on race days and often follow the instructions issued by the horse’s trainer, but sometimes they use their own initiative. Winning a race reflects well on the jockey, while losing can provoke a search for riding errors. - https://www.racingpost.com/guide-to-racing/trainers-and-jockeys/
He can’t do much with a lousy horse, but he can help a great horse win. The best jockeys know an animal’s strengths and weaknesses. Some horses prefer to hang back and break at the last minute, while others, known as speed horses, like to be out front the whole time. Some horses are comfortable running in close quarters and can pass along the rail on the left, while others need more space and pass on the right. A jockey takes these factors into account and adjusts his strategy accordingly. - https://slate.com/news-and-politics/2009/05/do-jockeys-matter-at-all-in-horse-racing.html
Observations
Observations
It seems that jockey are following success to the successful pattern. If you are already successful, then you will have more chances to be more successful later. It’ll be easier because you’ll be able to choose better horse and have more opportunities to succeed.
Trainer
Trainer
A horse trainer or instructor works with horses to ready them for riders, races or shows. They typically are expected to analyze horses’ dispositions to anticipate any possible behavioral problems such as kicking, tossing or biting. Then, they train accordingly to prevent future behavioral problems. Additionally, trainers/instructors assist horses in adapting to gear, acclimating to riding on various terrains and performing various exercises. - https://agexplorer.ffa.org/career/horse-trainer-instructor
In other words, trainer is responsible in large part for the victory.
Trainers can participate multiple times within the same race because they can have more than one horse participating.
We can see that the majority of the trainers within our dataset only participated once and there are some outliers that have participated multiple times.
Number of 1st place per trainer.
Number of 2nd place per trainer.
Number of 3rd place per trainer.
Number of times trainer’s horse finished 1st, 2nd or 3rd
Let’s look at the relationship between trainer and jockey
Let’s look at the relationship between trainer and jockey
With naked eye we can see that Todd Pletcher seem to have horses that every jockey wants.
We can also see that all top jockeys used horses trained by Todd Pletcher.
Let’s find some communities:
◼
We can see that there are community clusters around the trainers. There are some trainer’s whose horses have been ridden by many Jockeys. There are other trainers that only appeared once.
◼
Interesting to see that Todd Pletcher and Bob Baffert are within the same community. Bob Baffert was disqualified from participating in races until 2023. So in previous race Brad Cox won a first place instead.
This is the number of popular trainers. So we can see that there are a lot of unpopular trainers whose horses were only used by one jockey and there are two trainers whose horses were used by 11 and 12 jockeys.
Let’s see the names of these trainers.
Let’s see the number of jockeys that use different horses.
Most of the jockeys use one trainer. However, there are some jockeys that use 4 or 5 trainers.
Who are the jockeys that use different trainers?
Let’s see if the jockeys that have won before are using different trainers.
In fact it seems, that our top jockeys have used different horses to participate in a race.
Wins per community
Wins per community
We can see that the majority of wins fall onto the cluster that has Bob Baffet and Todd Pletcher in it.
However, we can see that the majority of the wins fall outside of this community. Almost half of all the wins that happened between 2015-2021 fall into the first cluster.
Summary
Summary
◼
Successful jockeys are not successful right away (only considering past 5 years)
◼
Successful jockeys have more freedom in choosing best horses (success to the successful)
◼
There is a cluster of trainers whose horses account for 50% of wins in the past 5 years.
Participants in 2022
Participants in 2022
Now, let’s apply our knowledge and see if we can make some informed predictions about Kentucky Derby race that is going to take place on May 7th 2022.
Acquire data for 2022
Acquire data for 2022
Let’s use our knowledge
Let’s use our knowledge
Let’s see if there are trainers that accounted for 50% wins for the past year in 2022 dataset.
For some reason trainer names are spelled differently for 2022 dataset, so we need to correct them first
Let’s see the trainer’s from the winning cluster that are participating.
Let’s see if we have jockeys that won previously participating:
So here we have two participants that according to our analysis might be able to win the race.
Improvements
Improvements
I didn’t use ML algorithms to come up with a model that would predict the outcome of a race. If I had more time and interest, I would definitely look into how to train such a model. I didn’t do it right away because I don’t believe there is enough of data on Wikipedia to do a meaningful prediction.