This data set from kaggle.com is a set of csv files with Formula 1 race data from the past 70 years. Although it does not include the two most recent seasons, it is very thorough and has very few missing values. All tables have appropriate primary keys as well as foreign keys. In addition to the data set being large, thorough, and in an accessible format, I am also very interested in the subject matter and will be very engaged with the research and data. Formula 1 is known as a marvel of engineering and data optimization. Much of the data used by teams is kept private but these publicly known results are important nonetheless. This data set will allow me to analyze many relationships and make assertions about a subject matter I am familiar with.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from matplotlib import cm
drivers_df = pd.read_csv("./data/drivers.csv", encoding = "ISO-8859-1", parse_dates=["dob"])
drivers_df = drivers_df[drivers_df.columns[0:-1]]
drivers_df.head()
I am particularly interested in analyzing the relationships that influence driver success in relation to their team situation and the situations of individual races. For teams, I am interested to see if there is a correlation between driver nationality and team nationality as well as the age of a team versus driver experience. For individual races there are many specific statistics that are not often brought up, and commentators and analysts tend to focus on the top drivers without talking about all of the racers. I'd like to analyze how drivers perform at tracks in their country as well as the country of their team and look at more specific stats like time of race, time zone of the track, and track characteristics to speculate how drivers may perform at new tracks added to the race calendar since 3 new tracks are being added in 2020 alone.
The scoring system for Formula 1 has been changed multiple times in the 90's alone which has influenced certain statistics on face value. Titles like most points of all time and most points per race are both attributed to Lewis Hamilton despite more race wins by Michael Schumacher. A quick look at the top 10 drivers in this category reveal that this list is determined primarily by the era in which a driver competes. While the data set already has points awarded in the race statistics, I would like to look at these performances from different eras more objectively by categorizing them all with each of the points systems and seeing if any championship results change specifically and see if this data supports one points system over another. These files also contain information about mechanical failures and crashes in races that can be accounted for to adjust the real-world statistics with these DNFs that are not the fault of the driver to show a more accurate representation of driver talent.
This data set is very convenient, and I have found others to reference for additional data, although they may require more formatting to use effectively. There are many valuable questions that this data may help to answer, such as the relative efficacy of different points systems, whether or not car performance is overshadowing driver performance, and what tracks make for unpredictable race results.
Here I'll import the rest of the csvs
circuits_df = pd.read_csv("./data/circuits.csv", encoding = "ISO-8859-1")
circuits_df = circuits_df[circuits_df.columns[0:-2]] #removing the URL column and altitude since only 1 track has info
constructor_results_df = pd.read_csv("./data/constructorResults.csv", encoding = "ISO-8859-1")
constructor_results_df = constructor_results_df[constructor_results_df.columns[0:-1]] #removing status column as it is largely useless
constructors_df = pd.read_csv("./data/constructors.csv", encoding = "ISO-8859-1")
constructors_df = constructors_df[constructors_df.columns[0:-2]] #removing useless column and URL
#removing position text column (it is redundant in this context) and another empty column
constructor_standings_df = pd.read_csv("./data/constructorStandings.csv", encoding = "ISO-8859-1")[["constructorStandingsId", "raceId", "constructorId", "points", "position", "wins"]]
driver_standings_df = pd.read_csv("./data/driverStandings.csv", encoding = "ISO-8859-1")[["driverStandingsId", "raceId", "driverId", "points", "position", "wins"]]#removing position text column (it is redundant in this context)
laptimes_df = pd.read_csv("./data/lapTimes.csv", encoding = "ISO-8859-1")
pitstops_df = pd.read_csv("./data/pitStops.csv", encoding = "ISO-8859-1")
qualifying_df = pd.read_csv("./data/qualifying.csv", encoding = "ISO-8859-1")
races_df = pd.read_csv("./data/races.csv", encoding = "ISO-8859-1", parse_dates=["date"])
races_df = races_df[races_df.columns[0:-1]] #removing URL
results_df = pd.read_csv("./data/results.csv", encoding = "ISO-8859-1") #although this df is wide, it is in fact tidy
#I didn't import the seaons csv because all it contains is the seaons and their wikipedia pages
status_df = pd.read_csv("./data/status.csv", encoding = "ISO-8859-1")
Let's clean up these dataframes a bit and check the datatypes. I'll explain each change in a comment before the operation.
#adding a full name category for more legible displays
drivers_df["Full Name"] = drivers_df["forename"] + ' ' + drivers_df["surname"]
#adding a unit variable to indicate race participation (this is useful for aggregating)
results_df["raced"] = 1
'''This block contains checks for setting data types
all the datatypes are correctly set (numbers, strings as objects, and datetime objects)
print(drivers_df.dtypes)
print(circuits_df.dtypes)
print(constructor_results_df.dtypes) #points are given as a float rather than int to accommodate for NaN values
print(constructors_df.dtypes)
print(constructor_standings_df.dtypes)
print(driver_standings_df.dtypes)
print(laptimes_df.dtypes)
print(pitstops_df.dtypes)
print(qualifying_df.dtypes)
print(races_df.dtypes)
print(results_df.dtypes)
print(status_df.dtypes)
''';
#since no driver Full Name appears more than once, it is also appropriate to use as a primary key if need be
drivers_df["Full Name"].value_counts().max()
These dataframes form a relational database and are connected by numerical IDs. The information for drivers, constructors, and circuits are stored in their respective dtaframes. The results for drivers and constructors are held in the results dataframes, and the standings after each race in their own dataframes. The status dataframe has only one column that connects different IDs to different results that represent a driver's finishing status for a race. The rest of the dataframes contain information about specific races, laps, qualifying, and pitstops.
Let's start by looking at the drivers with the most wins in F1 History
race_wins = results_df[results_df["position"] == 1] #filtering for race wins
win_counts = race_wins.groupby(race_wins["driverId"]).raced.sum() #counting race wins for each driver
win_counts_with_name = drivers_df.merge(win_counts, on="driverId") #merging with driver_df to see names
high_counts = win_counts_with_name[win_counts_with_name["raced"] > win_counts_with_name["raced"].quantile(.75)] #filtering to drivers who are in the top 25% of winners
high_counts.set_index("Full Name").raced.plot.bar() #plotting wins across drivers
plt.title("Career wins for each driver (in top 25%)");
While modern sensation Lewis Hamilton is certainly a standout on this graph, he is quite far behind Michael Schumacher in terms of sheer wins (it is of note that this data ends in 2017 and Hamilton has won two championships since then). Let's look at how many races drivers have competed in as well to account for volume.
races_series = results_df.groupby(results_df["driverId"]).raced.sum() #finding number of races for each driverID
drivers_tally = high_counts.merge(races_series, on="driverId", suffixes=["_win", "_start"]) #merging race starts to drivers
drivers_tally.set_index("Full Name").raced_start.plot.bar() #plotting race starts
plt.title("Race starts for each driver (in top 25%)");
This plot looks significantly different, but Michael Schumacher is still at the top, which shows that his high volume of starts may have something to do with being the winningest driver in F1 history. Let's take a look at winning percentages to see if he was as consistent as Hamilton
drivers_tally["win_percentage"] = drivers_tally["raced_win"] / drivers_tally["raced_start"]
drivers_tally.set_index("Full Name").win_percentage.plot.bar() #plotting race starts
plt.title("Race win percentage for each driver (with at least 10 wins)");
Looking at this graph of win percentages shows that Hamilton and Schumacher are neck and neck in terms of percentage. The drivers that stand out are Jim Clark, Juan Manuel Fangio, and Alberto Ascari, who all had incredibly short but prolific careers.
Of course titles like winningest driver are important, but what solidified Hamilton and Schumacher in F1 history is their multiple world titles. Schumacher is a 7 time world champion and Hamilton is a 6 time world champion (his 2 most recent championship winning seasons are not included in this data set).
The champion is determined by whoever scores the most points over the course of a season. Since Schumacher has more world titles and more wins, we'd expect him to have more points. Let's verify this.
Working with a driver's points is complicated since the points are contained in the driver_standings_df and it contains information from every race rather than every season, so many points are repeated. To find a drivers actual points tally, we need to identify the standings from the last race of every season and sum those rather than all of the standings for each driver.
races_in_year = races_df.groupby('year')["round"].max()
Here we've actually stumbled onto another interesting piece of data. We have the number of races from each year, so let's take a quick aside to graph this.
races_in_year.plot.line();
This graph demonstrates the steady increase of Formula 1 races each year. It is predicted that by 2020 we will have a record 24 race calendar.
Now back to finding the points tallies.
#inner joining the races_df with races_in_year
season_enders = races_df.merge(races_in_year, on=["round", "year"])
#now we have the raceIds for every final race of the year
#this will allow us to filter the standings to only finishing results
#inner joining driver_standings_df with season_enders to yield final season tallies for every driver
drivers_season_standings = driver_standings_df.merge(season_enders, on=['raceId'])
drivers_season_standings.head()
The new drivers_season_standings dataframe actually allows us to perform many interesting analyses that we will get to later, but for now let's tally up career points for each driver.
#grouping season standings by driver Id and summing points to get a total tally of points for each driver ID
lifetime_points = drivers_season_standings.groupby('driverId')["points"].sum()
#inner joining wtih high_counts to get driver names
#this join can be performed with all of the drivers but there are too many drivers to sufficiently graph
lifetime_points_df = high_counts.merge(lifetime_points, on="driverId")
#filtering for high results to weed
ax = lifetime_points_df.plot.bar(x="Full Name", y="points");
plt.title("Lifetime points for drivers with most wins:");
At first this graph seems jarring, since Michael Schumacher isn't even in the top 3 points scorers. Any Formula 1 fans would tell you that the title of "driver with most points" is somewhat irrelevant since Formula 1 has changed their points scheme throughout the years. This wikipedia article details the different systems if you are curious.
Of course this is due to the amount of available points increasing over the years, so let's try to standardize earned points over time.
#finding the average number of earned points by each driver each season
#again this is an imperfect process because older seasons featured many one-off drivers
#while nowadays drivers tend to be constand over the course of a season
#so it helps to weed out drivers that participated in less than a full season of races
#counting the number of races for each driver
all_tallies = pd.DataFrame(results_df.groupby("driverId")['raced'].sum())
#weeding out drivers
all_tallies = all_tallies[all_tallies['raced'] >= 10]
drivers_season_standings = drivers_season_standings.merge(all_tallies, on="driverId")
ppy = drivers_season_standings.groupby("year")["points"].mean()
plt.plot(ppy)
plt.title("Average number of points scored per driver each season");
To normalize for different amounts of points we will divide scored points each season by these averages. The number increases steadily as more races are added to the calendar and then increases dramatically when the most recent scoring system was introduced.
#merging averages into dataframe
drivers_season_standings = drivers_season_standings.merge(ppy, on="year", suffixes =['_scored', '_average'])
#computing score normalized to year
drivers_season_standings["norm_points"] = drivers_season_standings["points_scored"] / drivers_season_standings["points_average"]
#grouping by driverID to find career totals
lifetime_points = drivers_season_standings.groupby('driverId')["norm_points"].sum()
#inner joining wtih high_counts to get driver names
#this join can be performed with all of the drivers but there are too many drivers to sufficiently graph
lifetime_points_df = high_counts.merge(lifetime_points, on="driverId")
#filtering for high results to weed
ax = lifetime_points_df.plot.bar(x="Full Name", y="norm_points");
plt.title("Standardized points for drivers with most wins:");
Now that points are normalized, Michael Schumacher is again at the top of the graph. Although this helps compare drivers across eras, this doesn't tell us anything about which points systems are better.
To new motorsport fans the points system may seem somewhat arbitrary. Obviously the winner of a race is whoever crosses the finish line first, but the winner of a season of racing is not so straightforward. In head-to-head sports like football it is easy to derive points from a win-loss record, but motorsport does not have such luxuries. The points system used in Formula 1 has been the subject of many debates, since there is no head-to-head playoff structure and designing a proper ranking system involves making subjective decisions.
Let's look at all of the race results under each of the previous points systems (and some others) to see if any championship results would have changed, and determine if those changes are for the better as well as finding out what traits are rewarded most under each system.
The first step is to create series to represent the scoring under each different system so that we can map finishing place to their respective points.
#creating dictionaries with position as key and points as value
points2010_18 = pd.Series([25,18,15,12,10,8,6,4,2,1], [1,2,3,4,5,6,7,8,9,10]).to_dict()
points2003_09 = pd.Series([10,8,6,5,4,3,2,1], [1,2,3,4,5,6,7,8]).to_dict()
points1991_2002 = pd.Series([10,6,4,3,2,1], [1,2,3,4,5,6]).to_dict()
points1961_90 = pd.Series([9,6,4,3,2,1], [1,2,3,4,5,6]).to_dict()
points1960 = pd.Series([8,6,4,3,2,1], [1,2,3,4,5,6]).to_dict()
points1950_59 = pd.Series([8,6,4,3,2], [1,2,3,4,5]).to_dict()
pointsReverse = pd.Series(list(range(1,21))[::-1], list(range(1,21))).to_dict()
#this last one is a proposed (and largely contested) points system where all drivers earn points opposite their finishing position
pointsExp = pd.Series(list(map(lambda x: 2 ** x, list(range(0,20))[::-1])), list(range(1,21))).to_dict()
#this is a points system that gives all drivers points, but on an exponential scale instead of linear
#in essence it makes each place finish worth one half of the place above it
#this system works well if you'd like to think of a first place finish as twice a second place finish
#but it makes lower place finishes negligible beyond being tiebreakers
Before simply mapping the points onto the results, we need to discuss the difference between the position, positionText, and positionOrder columns. Position is numerical given finishing position if the driver was counted (which in Formula 1 means they finished 90% of the laps). positionText is the same information but NaNs for DNFs are replaced with R for retirement or D for disqualified. positionOrder contains only integers for position regardless of whether or not they were counted. Since the position column already follows the rules for F1 scoring, it is the best choice for retabulating points.
#setting points columns as their respective values
points_comparison_df = results_df.copy()
points_comparison_df["points2010_18"] = points_comparison_df["position"].map(points2010_18)
points_comparison_df["points2003_09"] = points_comparison_df["position"].map(points2003_09)
points_comparison_df["points1991_2002"] = points_comparison_df["position"].map(points1991_2002)
points_comparison_df["points1961_90"] = points_comparison_df["position"].map(points1961_90)
points_comparison_df["points1960"] = points_comparison_df["position"].map(points1960)
points_comparison_df["points1950_59"] = points_comparison_df["position"].map(points1950_59)
points_comparison_df["pointsReverse"] = points_comparison_df["position"].map(pointsReverse)
points_comparison_df["pointsExp"] = points_comparison_df["position"].map(pointsExp)
#a list of all points systems
points_systems = list(points_comparison_df.columns[-8:-1]) + list([points_comparison_df.columns[-1]])
points_comparison_df.head()
Now we have the points obtained from each race, but we need to sum the points across seasons. To do this we will need to incorporate the year from the races_df.
#dropping unnecessary columns before merge since this dataframe is getting wide and it will only get wider
new_columns = ['resultId', 'raceId', 'driverId', 'constructorId', 'points', 'points2010_18', 'points2003_09', 'points1991_2002', 'points1961_90', 'points1960', 'points1950_59', 'pointsReverse', 'pointsExp']
points_comparison_df = points_comparison_df[new_columns]
#inner joining results with races
points_comparison_with_year = points_comparison_df.merge(races_df, on="raceId", suffixes=["_result","_race"])
#setting year
points_comparison_with_year["year"] = pd.DatetimeIndex(points_comparison_with_year['date']).year
#removing unnecessary columns
points_comparison_with_year = points_comparison_with_year[points_comparison_with_year.columns[0:-5]]
#grouping by driverId and year and summing to obtain total points ofr each year
drivers_championship_comparison = points_comparison_with_year.groupby(['year', 'driverId']).sum()
#resetting index to get year as column
drivers_championship_comparison.reset_index(inplace=True)
#this will make it easier to work with driverIds and then display names instead
driverIdMap = drivers_df.set_index("driverId")["Full Name"].to_dict()
#given a year and a points system to use, this function will compute the winner of that season
def champion_of_season(year, column):
#filtering to results from the given year
championship_year = drivers_championship_comparison[drivers_championship_comparison["year"] == year]
#finding the winning points tally
winning_tally = championship_year[column].max()
#finding the driverId of the driver who scored the winning tally
driverId = championship_year[championship_year[column] == winning_tally]["driverId"].values[0]
return driverId
#creating an empty dataframe to store would-be champion data
champions = pd.DataFrame(columns=["champ_real","champ_10_18","champ_03_09","champ_91_02","champ_61_90","champ_60","champ_50_59", "champ_reverse", "champ_exp"])
#iterating through each year of the championship
for i in drivers_championship_comparison["year"].unique():
#each line sets the current years champion
champions.loc[i, 'champ_real'] = champion_of_season(i,"points")
champions.loc[i, 'champ_10_18'] = champion_of_season(i,"points2010_18")
champions.loc[i, 'champ_03_09'] = champion_of_season(i,"points2003_09")
champions.loc[i, 'champ_91_02'] = champion_of_season(i,"points1991_2002")
champions.loc[i, 'champ_61_90'] = champion_of_season(i,"points1961_90")
champions.loc[i, 'champ_60'] = champion_of_season(i,"points1960")
champions.loc[i, 'champ_50_59'] = champion_of_season(i,"points1950_59")
champions.loc[i, 'champ_reverse'] = champion_of_season(i,"pointsReverse")
champions.loc[i, 'champ_exp'] = champion_of_season(i,"pointsExp")
print("Here are the champions under each system for each year:")
champions.head()
We can also compute the entire championship standings for each season, which we will do in the next cell.
#grouping dataframes of each season
seasons = drivers_championship_comparison.groupby("year")
possible_finishes = pd.DataFrame(columns=list(races_in_year.index)[:-1])
for key, item in seasons:
#creating dataframe to store places of each driver under each system
place_finishes = pd.DataFrame(columns=points_systems)
#for each system
for system in points_systems:
#ordering drivers by their championship order
driver_order = item.set_index('driverId')[system].sort_values().iloc[::-1]
#saving driverId
driver_order['driverId'] = driver_order.index
#resetting index to represent championship position
driver_order = driver_order.reset_index()
#for each driver, saving their position to place_finishes
for index, row in driver_order.iterrows():
place_finishes.loc[index+1, system] = row['driverId']
#making places columns and systems rows
place_finishes = place_finishes.T
#for each place, save the number of possible drivers to possible_finishes
for place in list(place_finishes.columns):
possible_finishes.loc[place, key] = len(place_finishes[place].unique())
#filtering to top 22 spots since there have never been more than 22 drivers to compete throughout a season
possible_finishes = possible_finishes[possible_finishes.index <= 22]
possible_finishes.head()
ax = sns.heatmap(possible_finishes.fillna(0).T)
ax.invert_xaxis()
This heatmap demonstrates the uncertainty of championship results under different systems, sometimes having as many as 6 drivers ending up in a specific rank, and 3 different potential championship winners. In fact, there are a number of drivers who never won a championship that would have if they raced under a different scoring system. Let's look at those drivers.
#getting a list of the actual champions
real_winners = champions["champ_real"].unique()
#getting a list of the champions from other systems
#this is done by creating a set, which is a key-only dictionary, to avoid duplicates and then converting it to a list
fake_winners = list(set().union(champions["champ_10_18"].unique(),
champions["champ_03_09"].unique(),
champions["champ_91_02"].unique(),
champions["champ_61_90"].unique(),
champions["champ_60"].unique(),
champions["champ_50_59"].unique(),
champions["champ_reverse"].unique(),
))
#empty list for drivers who would have won a championship in a different system
new_winners = []
print("The following is the list of all actual Formula 1 world champions from 1950 to 2017:")
for i in real_winners:
print(driverIdMap[i])
for i in fake_winners:
if(i not in real_winners):
new_winners.append(i)
print("The following is the list of drivers who could have won a championship if F1 used a different points system at the time:")
for i in new_winners:
print(driverIdMap[i])
So now we've established a list of drivers who would have won a championship if they used a different points system. Let's compare this to the drivers mentioned in the top 3 Google search results for "formula 1 drivers who should've won a championship". Here are the 3 top results for reference:
Ranking the Best Formula 1 Drivers to Never Win the World Title
The 6 best F1 drivers never to win a championship
Top 10 Formula One drivers who have never won a World Championship
The drivers on these lists are:
Let's save this list in our code:
#hard coding search results
internet_drivers = ['Didier Pironi','Jacky Ickx','Tony Brooks','Clay Regazzoni','Wolfgang von Trips','Stirling Moss','Mark Webber','Gerhard Berger','Rubens Barrichello','Felipe Massa','Gilles Villeneuve','Carlos Reutemann','Ronnie Peterson','David Coulthard']
#new list for names of new winners
new_winner_names = new_winners.copy()
#gathering names from new_winners
for i in range(len(new_winner_names)):
new_winner_names[i] = driverIdMap[new_winner_names[i]]
#finding drivers in both lists
drivers_intersection = [value for value in new_winner_names if value in internet_drivers]
print("Out of the " + str(len(new_winners)) + " drivers that could have won championships, " +
str(len(drivers_intersection)) + " of them were listed in the afforementioned search results.")
print("Here are those drivers:")
print(drivers_intersection)
Now that we've identified some key changes under different systems and established that the Formula 1 fanbase agrees that these drivers deserved championships, let's research the seasons they would have won and find out what weaknesses in the scoring systems led to their defeat.
Looking into the seasons these drivers almost won reveals that many of them are contentious.
Stirling Moss won 4 of the 11 races in the 1958 season, while Mike Hawthorne only won 1, but his slew of 2nd place finishes kept him in the lead of the championship. This is perhaps the most clear case of a poorly decided championship, but many would argue that Moss's 5 DNFs are good reason to settle for 2nd place.
Jacky Ickx could have surpassed Jochen Rindt, who died on track halfway through the 1970 season and did not compete in the last 4 races, but held on to the title thanks to 5 race wins.
Clay Regazzoni finished only 3 points behind Emerson Fittipaldi. Although Fittipaldi had 3 race wins to Regazzoni's 1, the latter had a higher average finishing position.
Felipe Massa lost the 2008 championship by just 1 point in spectacular fashion in the last race of the season and many fans believe he should have been crowned champion instead of Hamilton.
Now that we've seen the effects of different point systems, it is time to evaluate them. Since there is no objective measure of a system's correctness, we will instead attempt to correlate each system with different features of a drivers' season including average finishing place, race wins, and podiums (races where they finished in the top 3).
#grouping results info by driver and year
driver_seasons = results_df.merge(races_df[["raceId", "year"]], on='raceId').groupby(["year", "driverId"])
drivers_season_standings["average_position"] = 0
drivers_season_standings["podiums"] = 0
#iterating through groups to add addtional stats to drivers_season_standings
for i, df in driver_seasons:
#finding index of driver's season in drivers_season_standings
seasons_of_driver = drivers_season_standings[drivers_season_standings['driverId'] == i[1]]
dss_index = seasons_of_driver[seasons_of_driver['year'] == i[0]].index
#computing podiums scored
num_podiums = df[df["position"] <= 3].shape[0]
#saving podiums to drivers_season_standings
drivers_season_standings.iloc[dss_index, drivers_season_standings.columns.get_loc('podiums')] = num_podiums
#computing average finishing position
competed_races = df.shape[0]
sum_finishes = df['positionOrder'].sum()
missed_races = races_in_year[i[0]] - competed_races
#missed races are filled with an 11th place finish
#this number is arbitrary but chosen because 12th place is outside of points in the real systems
#and about halfway down the grid in most races
fill_finishes = missed_races * 12
average_finish = (sum_finishes + fill_finishes) / races_in_year[i[0]]
drivers_season_standings.iloc[dss_index, drivers_season_standings.columns.get_loc('average_position')] = average_finish
championship_vs_results = drivers_season_standings[['driverId', 'year', 'wins', 'average_position', 'podiums', 'position']].copy()
championship_vs_results.head()
Now we have a dataframe with the important features of a driver's season and their finishing place in the actual championship. To compare the different systems, we need to add their finishing position in the systems we used.
#creating new columns for each scoring system
for sys in points_systems:
col_name = 'position' + sys[6:]
championship_vs_results[col_name] = 0
for key, item in seasons:
#creating dataframe to store places of each driver under each system
#for each system
for system in points_systems:
col_name = 'position' + system[6:]
#ordering drivers by their championship order
driver_order = item.set_index('driverId')[system].sort_values().iloc[::-1]
#saving driverId
driver_order['driverId'] = driver_order.index
#resetting index to represent championship position
driver_order = driver_order.reset_index()
#for each driver, saving their position to place_finishes
for index, row in driver_order.iterrows():
driver_idx = championship_vs_results[championship_vs_results['driverId'] == row['driverId']]
driver_year_idx = driver_idx[driver_idx['year'] == key].index
championship_vs_results.loc[driver_year_idx, col_name] = index+1
championship_vs_results.head()
Now that we have all of the championship positions, we can look at their correlations to wins, average position, and podiums. It is of note that since average position, and championship position are rankings (with 1 being the best), it is expected that they will correlate negatively with wins and podiums, which are tallies rather than rankings.
feature_vs_position = championship_vs_results[championship_vs_results.columns[2:]].corr()
feature_vs_position
These correlation results are astoundingly similar across different points systems. All of the systems, with the exception of reverse linear (positionReverse
) have nearly the same correlation with wins and podiums (to about 0.01). The only observable trend is that the points systems used by Formula 1 have evoled to favor average_position, but again only by a slight margin, and this may also be due to drivers competeing more consistently in races as time goes on, which changes our calculation for average finishing position. All of the systems correlate more highly with average position and the number of podiums than number of wins, suggesting that all of the systems reward consistency.
Another interesting comparison that this correlation DataFrame allows us to make is how accurately these systems predict the actual champion. Surprisingly, positionReverse
and positionExp
, the two most drastically different scoring systems, had the highest correlation with actual championship position.
While these results are inconclusive, it shows that Formula 1's scoring system has had the same fundamental priorities consistently, despite the numerous different results produced.