Mason Riley O'Connor

CMPS3660 Tutorial Milestone 1

This data set from kaggle.com is a set of csv files with Formula 1 race data from the past 70 years. Although it does not include the two most recent seasons, it is very thorough and has very few missing values. All tables have appropriate primary keys as well as foreign keys. In addition to the data set being large, thorough, and in an accessible format, I am also very interested in the subject matter and will be very engaged with the research and data. Formula 1 is known as a marvel of engineering and data optimization. Much of the data used by teams is kept private but these publicly known results are important nonetheless. This data set will allow me to analyze many relationships and make assertions about a subject matter I am familiar with.

Importing Data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from matplotlib import cm
drivers_df = pd.read_csv("./data/drivers.csv", encoding = "ISO-8859-1", parse_dates=["dob"])
drivers_df = drivers_df[drivers_df.columns[0:-1]]
drivers_df.head()
Out[1]:
driverId driverRef number code forename surname dob nationality
0 1 hamilton 44.0 HAM Lewis Hamilton 1985-07-01 British
1 2 heidfeld NaN HEI Nick Heidfeld 1977-10-05 German
2 3 rosberg 6.0 ROS Nico Rosberg 1985-06-27 German
3 4 alonso 14.0 ALO Fernando Alonso 1981-07-29 Spanish
4 5 kovalainen NaN KOV Heikki Kovalainen 1981-10-19 Finnish

I am particularly interested in analyzing the relationships that influence driver success in relation to their team situation and the situations of individual races. For teams, I am interested to see if there is a correlation between driver nationality and team nationality as well as the age of a team versus driver experience. For individual races there are many specific statistics that are not often brought up, and commentators and analysts tend to focus on the top drivers without talking about all of the racers. I'd like to analyze how drivers perform at tracks in their country as well as the country of their team and look at more specific stats like time of race, time zone of the track, and track characteristics to speculate how drivers may perform at new tracks added to the race calendar since 3 new tracks are being added in 2020 alone.

The scoring system for Formula 1 has been changed multiple times in the 90's alone which has influenced certain statistics on face value. Titles like most points of all time and most points per race are both attributed to Lewis Hamilton despite more race wins by Michael Schumacher. A quick look at the top 10 drivers in this category reveal that this list is determined primarily by the era in which a driver competes. While the data set already has points awarded in the race statistics, I would like to look at these performances from different eras more objectively by categorizing them all with each of the points systems and seeing if any championship results change specifically and see if this data supports one points system over another. These files also contain information about mechanical failures and crashes in races that can be accounted for to adjust the real-world statistics with these DNFs that are not the fault of the driver to show a more accurate representation of driver talent.

This data set is very convenient, and I have found others to reference for additional data, although they may require more formatting to use effectively. There are many valuable questions that this data may help to answer, such as the relative efficacy of different points systems, whether or not car performance is overshadowing driver performance, and what tracks make for unpredictable race results.

Milestone 2:

Here I'll import the rest of the csvs

In [2]:
circuits_df = pd.read_csv("./data/circuits.csv", encoding = "ISO-8859-1")
circuits_df = circuits_df[circuits_df.columns[0:-2]] #removing the URL column and altitude since only 1 track has info
constructor_results_df = pd.read_csv("./data/constructorResults.csv", encoding = "ISO-8859-1")
constructor_results_df = constructor_results_df[constructor_results_df.columns[0:-1]] #removing status column as it is largely useless
constructors_df = pd.read_csv("./data/constructors.csv", encoding = "ISO-8859-1")
constructors_df = constructors_df[constructors_df.columns[0:-2]] #removing useless column and URL
#removing position text column (it is redundant in this context) and another empty column
constructor_standings_df = pd.read_csv("./data/constructorStandings.csv", encoding = "ISO-8859-1")[["constructorStandingsId", "raceId", "constructorId", "points", "position", "wins"]]
driver_standings_df = pd.read_csv("./data/driverStandings.csv", encoding = "ISO-8859-1")[["driverStandingsId", "raceId", "driverId", "points", "position", "wins"]]#removing position text column (it is redundant in this context)
laptimes_df = pd.read_csv("./data/lapTimes.csv", encoding = "ISO-8859-1")
pitstops_df = pd.read_csv("./data/pitStops.csv", encoding = "ISO-8859-1")
qualifying_df = pd.read_csv("./data/qualifying.csv", encoding = "ISO-8859-1")
races_df = pd.read_csv("./data/races.csv", encoding = "ISO-8859-1", parse_dates=["date"])
races_df = races_df[races_df.columns[0:-1]] #removing URL 
results_df = pd.read_csv("./data/results.csv", encoding = "ISO-8859-1") #although this df is wide, it is in fact tidy
#I didn't import the seaons csv because all it contains is the seaons and their wikipedia pages
status_df = pd.read_csv("./data/status.csv", encoding = "ISO-8859-1")

Let's clean up these dataframes a bit and check the datatypes. I'll explain each change in a comment before the operation.

In [3]:
#adding a full name category for more legible displays
drivers_df["Full Name"] = drivers_df["forename"] + ' ' + drivers_df["surname"]
#adding a unit variable to indicate race participation (this is useful for aggregating)
results_df["raced"] = 1
'''This block contains checks for setting data types
all the datatypes are correctly set (numbers, strings as objects, and datetime objects)
print(drivers_df.dtypes)
print(circuits_df.dtypes)
print(constructor_results_df.dtypes) #points are given as a float rather than int to accommodate for NaN values
print(constructors_df.dtypes)
print(constructor_standings_df.dtypes)
print(driver_standings_df.dtypes)
print(laptimes_df.dtypes)
print(pitstops_df.dtypes)
print(qualifying_df.dtypes)
print(races_df.dtypes)
print(results_df.dtypes)
print(status_df.dtypes)
''';
In [4]:
#since no driver Full Name appears more than once, it is also appropriate to use as a primary key if need be
drivers_df["Full Name"].value_counts().max()
Out[4]:
1

Exploratory analysis

These dataframes form a relational database and are connected by numerical IDs. The information for drivers, constructors, and circuits are stored in their respective dtaframes. The results for drivers and constructors are held in the results dataframes, and the standings after each race in their own dataframes. The status dataframe has only one column that connects different IDs to different results that represent a driver's finishing status for a race. The rest of the dataframes contain information about specific races, laps, qualifying, and pitstops.

The Greats

Let's start by looking at the drivers with the most wins in F1 History

In [5]:
race_wins = results_df[results_df["position"] == 1] #filtering for race wins
win_counts = race_wins.groupby(race_wins["driverId"]).raced.sum() #counting race wins for each driver
win_counts_with_name = drivers_df.merge(win_counts, on="driverId") #merging with driver_df to see names
high_counts = win_counts_with_name[win_counts_with_name["raced"] > win_counts_with_name["raced"].quantile(.75)] #filtering to drivers who are in the top 25% of winners
high_counts.set_index("Full Name").raced.plot.bar() #plotting wins across drivers
plt.title("Career wins for each driver (in top 25%)");

While modern sensation Lewis Hamilton is certainly a standout on this graph, he is quite far behind Michael Schumacher in terms of sheer wins (it is of note that this data ends in 2017 and Hamilton has won two championships since then). Let's look at how many races drivers have competed in as well to account for volume.

In [6]:
races_series = results_df.groupby(results_df["driverId"]).raced.sum() #finding number of races for each driverID
drivers_tally = high_counts.merge(races_series, on="driverId", suffixes=["_win", "_start"]) #merging race starts to drivers 
drivers_tally.set_index("Full Name").raced_start.plot.bar() #plotting race starts
plt.title("Race starts for each driver (in top 25%)");

This plot looks significantly different, but Michael Schumacher is still at the top, which shows that his high volume of starts may have something to do with being the winningest driver in F1 history. Let's take a look at winning percentages to see if he was as consistent as Hamilton

In [7]:
drivers_tally["win_percentage"] = drivers_tally["raced_win"] / drivers_tally["raced_start"]
drivers_tally.set_index("Full Name").win_percentage.plot.bar() #plotting race starts
plt.title("Race win percentage for each driver (with at least 10 wins)");

Looking at this graph of win percentages shows that Hamilton and Schumacher are neck and neck in terms of percentage. The drivers that stand out are Jim Clark, Juan Manuel Fangio, and Alberto Ascari, who all had incredibly short but prolific careers.

Of course titles like winningest driver are important, but what solidified Hamilton and Schumacher in F1 history is their multiple world titles. Schumacher is a 7 time world champion and Hamilton is a 6 time world champion (his 2 most recent championship winning seasons are not included in this data set).

The champion is determined by whoever scores the most points over the course of a season. Since Schumacher has more world titles and more wins, we'd expect him to have more points. Let's verify this.

Getting to the Point(s)

Working with a driver's points is complicated since the points are contained in the driver_standings_df and it contains information from every race rather than every season, so many points are repeated. To find a drivers actual points tally, we need to identify the standings from the last race of every season and sum those rather than all of the standings for each driver.

In [8]:
races_in_year = races_df.groupby('year')["round"].max()

Here we've actually stumbled onto another interesting piece of data. We have the number of races from each year, so let's take a quick aside to graph this.

In [9]:
races_in_year.plot.line();

This graph demonstrates the steady increase of Formula 1 races each year. It is predicted that by 2020 we will have a record 24 race calendar.

Now back to finding the points tallies.

In [10]:
#inner joining the races_df with races_in_year
season_enders = races_df.merge(races_in_year, on=["round", "year"])
#now we have the raceIds for every final race of the year
#this will allow us to filter the standings to only finishing results

#inner joining driver_standings_df with season_enders to yield final season tallies for every driver
drivers_season_standings = driver_standings_df.merge(season_enders, on=['raceId'])
drivers_season_standings.head()
Out[10]:
driverStandingsId raceId driverId points position wins year round circuitId name date time
0 355 35 1 98.0 1 5 2008 18 18 Brazilian Grand Prix 2008-11-02 17:00:00
1 356 35 2 60.0 6 0 2008 18 18 Brazilian Grand Prix 2008-11-02 17:00:00
2 357 35 3 17.0 13 0 2008 18 18 Brazilian Grand Prix 2008-11-02 17:00:00
3 358 35 4 61.0 5 2 2008 18 18 Brazilian Grand Prix 2008-11-02 17:00:00
4 359 35 5 53.0 7 1 2008 18 18 Brazilian Grand Prix 2008-11-02 17:00:00

The new drivers_season_standings dataframe actually allows us to perform many interesting analyses that we will get to later, but for now let's tally up career points for each driver.

In [11]:
#grouping season standings by driver Id and summing points to get a total tally of points for each driver ID
lifetime_points = drivers_season_standings.groupby('driverId')["points"].sum()
#inner joining wtih high_counts to get driver names
#this join can be performed with all of the drivers but there are too many drivers to sufficiently graph
lifetime_points_df = high_counts.merge(lifetime_points, on="driverId")
#filtering for high results to weed 
ax = lifetime_points_df.plot.bar(x="Full Name", y="points");
plt.title("Lifetime points for drivers with most wins:");

At first this graph seems jarring, since Michael Schumacher isn't even in the top 3 points scorers. Any Formula 1 fans would tell you that the title of "driver with most points" is somewhat irrelevant since Formula 1 has changed their points scheme throughout the years. This wikipedia article details the different systems if you are curious.

Of course this is due to the amount of available points increasing over the years, so let's try to standardize earned points over time.

In [12]:
#finding the average number of earned points by each driver each season
#again this is an imperfect process because older seasons featured many one-off drivers
#while nowadays drivers tend to be constand over the course of a season
#so it helps to weed out drivers that participated in less than a full season of races
#counting the number of races for each driver
all_tallies = pd.DataFrame(results_df.groupby("driverId")['raced'].sum())
#weeding out drivers
all_tallies = all_tallies[all_tallies['raced'] >= 10]
drivers_season_standings = drivers_season_standings.merge(all_tallies, on="driverId")
In [13]:
ppy = drivers_season_standings.groupby("year")["points"].mean()
plt.plot(ppy)
plt.title("Average number of points scored per driver each season");

To normalize for different amounts of points we will divide scored points each season by these averages. The number increases steadily as more races are added to the calendar and then increases dramatically when the most recent scoring system was introduced.

In [14]:
#merging averages into dataframe
drivers_season_standings = drivers_season_standings.merge(ppy, on="year", suffixes =['_scored', '_average'])
#computing score normalized to year
drivers_season_standings["norm_points"] = drivers_season_standings["points_scored"] / drivers_season_standings["points_average"]
#grouping by driverID to find career totals
lifetime_points = drivers_season_standings.groupby('driverId')["norm_points"].sum()
#inner joining wtih high_counts to get driver names
#this join can be performed with all of the drivers but there are too many drivers to sufficiently graph
lifetime_points_df = high_counts.merge(lifetime_points, on="driverId")
#filtering for high results to weed 
ax = lifetime_points_df.plot.bar(x="Full Name", y="norm_points");
plt.title("Standardized points for drivers with most wins:");

Now that points are normalized, Michael Schumacher is again at the top of the graph. Although this helps compare drivers across eras, this doesn't tell us anything about which points systems are better.

To new motorsport fans the points system may seem somewhat arbitrary. Obviously the winner of a race is whoever crosses the finish line first, but the winner of a season of racing is not so straightforward. In head-to-head sports like football it is easy to derive points from a win-loss record, but motorsport does not have such luxuries. The points system used in Formula 1 has been the subject of many debates, since there is no head-to-head playoff structure and designing a proper ranking system involves making subjective decisions.

Exploring the Hypothesis

Let's look at all of the race results under each of the previous points systems (and some others) to see if any championship results would have changed, and determine if those changes are for the better as well as finding out what traits are rewarded most under each system.

The first step is to create series to represent the scoring under each different system so that we can map finishing place to their respective points.

In [15]:
#creating dictionaries with position as key and points as value
points2010_18 = pd.Series([25,18,15,12,10,8,6,4,2,1], [1,2,3,4,5,6,7,8,9,10]).to_dict()
points2003_09 = pd.Series([10,8,6,5,4,3,2,1], [1,2,3,4,5,6,7,8]).to_dict()
points1991_2002 = pd.Series([10,6,4,3,2,1], [1,2,3,4,5,6]).to_dict()
points1961_90 = pd.Series([9,6,4,3,2,1], [1,2,3,4,5,6]).to_dict()
points1960 = pd.Series([8,6,4,3,2,1], [1,2,3,4,5,6]).to_dict()
points1950_59 = pd.Series([8,6,4,3,2], [1,2,3,4,5]).to_dict()
pointsReverse = pd.Series(list(range(1,21))[::-1], list(range(1,21))).to_dict() 
#this last one is a proposed (and largely contested) points system where all drivers earn points opposite their finishing position
pointsExp = pd.Series(list(map(lambda x: 2 ** x, list(range(0,20))[::-1])), list(range(1,21))).to_dict()
#this is a points system that gives all drivers points, but on an exponential scale instead of linear
#in essence it makes each place finish worth one half of the place above it
#this system works well if you'd like to think of a first place finish as twice a second place finish
#but it makes lower place finishes negligible beyond being tiebreakers

Before simply mapping the points onto the results, we need to discuss the difference between the position, positionText, and positionOrder columns. Position is numerical given finishing position if the driver was counted (which in Formula 1 means they finished 90% of the laps). positionText is the same information but NaNs for DNFs are replaced with R for retirement or D for disqualified. positionOrder contains only integers for position regardless of whether or not they were counted. Since the position column already follows the rules for F1 scoring, it is the best choice for retabulating points.

In [16]:
#setting points columns as their respective values
points_comparison_df = results_df.copy()
points_comparison_df["points2010_18"] = points_comparison_df["position"].map(points2010_18)
points_comparison_df["points2003_09"] = points_comparison_df["position"].map(points2003_09)
points_comparison_df["points1991_2002"] = points_comparison_df["position"].map(points1991_2002)
points_comparison_df["points1961_90"] = points_comparison_df["position"].map(points1961_90)
points_comparison_df["points1960"] = points_comparison_df["position"].map(points1960)
points_comparison_df["points1950_59"] = points_comparison_df["position"].map(points1950_59)
points_comparison_df["pointsReverse"] = points_comparison_df["position"].map(pointsReverse)
points_comparison_df["pointsExp"] = points_comparison_df["position"].map(pointsExp)
#a list of all points systems
points_systems = list(points_comparison_df.columns[-8:-1]) + list([points_comparison_df.columns[-1]])

points_comparison_df.head()
Out[16]:
resultId raceId driverId constructorId number grid position positionText positionOrder points ... statusId raced points2010_18 points2003_09 points1991_2002 points1961_90 points1960 points1950_59 pointsReverse pointsExp
0 1 18 1 1 22.0 1 1.0 1 1 10.0 ... 1 1 25.0 10.0 10.0 9.0 8.0 8.0 20.0 524288.0
1 2 18 2 2 3.0 5 2.0 2 2 8.0 ... 1 1 18.0 8.0 6.0 6.0 6.0 6.0 19.0 262144.0
2 3 18 3 3 7.0 7 3.0 3 3 6.0 ... 1 1 15.0 6.0 4.0 4.0 4.0 4.0 18.0 131072.0
3 4 18 4 4 5.0 11 4.0 4 4 5.0 ... 1 1 12.0 5.0 3.0 3.0 3.0 3.0 17.0 65536.0
4 5 18 5 1 23.0 3 5.0 5 5 4.0 ... 1 1 10.0 4.0 2.0 2.0 2.0 2.0 16.0 32768.0

5 rows × 27 columns

Now we have the points obtained from each race, but we need to sum the points across seasons. To do this we will need to incorporate the year from the races_df.

In [17]:
#dropping unnecessary columns before merge since this dataframe is getting wide and it will only get wider
new_columns = ['resultId', 'raceId', 'driverId', 'constructorId', 'points', 'points2010_18', 'points2003_09', 'points1991_2002', 'points1961_90', 'points1960', 'points1950_59', 'pointsReverse', 'pointsExp']
points_comparison_df = points_comparison_df[new_columns]
#inner joining results with races
points_comparison_with_year = points_comparison_df.merge(races_df, on="raceId", suffixes=["_result","_race"])
#setting year
points_comparison_with_year["year"] =  pd.DatetimeIndex(points_comparison_with_year['date']).year
#removing unnecessary columns
points_comparison_with_year = points_comparison_with_year[points_comparison_with_year.columns[0:-5]]
In [18]:
#grouping by driverId and year and summing to obtain total points ofr each year
drivers_championship_comparison = points_comparison_with_year.groupby(['year', 'driverId']).sum()
#resetting index to get year as column
drivers_championship_comparison.reset_index(inplace=True)

#this will make it easier to work with driverIds and then display names instead
driverIdMap = drivers_df.set_index("driverId")["Full Name"].to_dict()

#given a year and a points system to use, this function will compute the winner of that season
def champion_of_season(year, column):
    #filtering to results from the given year
    championship_year = drivers_championship_comparison[drivers_championship_comparison["year"] == year]
    #finding the winning points tally
    winning_tally = championship_year[column].max()
    #finding the driverId of the driver who scored the winning tally
    driverId = championship_year[championship_year[column] == winning_tally]["driverId"].values[0]
    return driverId

#creating an empty dataframe to store would-be champion data
champions = pd.DataFrame(columns=["champ_real","champ_10_18","champ_03_09","champ_91_02","champ_61_90","champ_60","champ_50_59", "champ_reverse", "champ_exp"])

#iterating through each year of the championship
for i in drivers_championship_comparison["year"].unique():
    #each line sets the current years champion
    champions.loc[i, 'champ_real'] = champion_of_season(i,"points")
    champions.loc[i, 'champ_10_18'] = champion_of_season(i,"points2010_18")
    champions.loc[i, 'champ_03_09'] = champion_of_season(i,"points2003_09")
    champions.loc[i, 'champ_91_02'] = champion_of_season(i,"points1991_2002")
    champions.loc[i, 'champ_61_90'] = champion_of_season(i,"points1961_90")
    champions.loc[i, 'champ_60'] = champion_of_season(i,"points1960")
    champions.loc[i, 'champ_50_59'] = champion_of_season(i,"points1950_59")
    champions.loc[i, 'champ_reverse'] = champion_of_season(i,"pointsReverse")
    champions.loc[i, 'champ_exp'] = champion_of_season(i,"pointsExp")

print("Here are the champions under each system for each year:")
champions.head()
Here are the champions under each system for each year:
Out[18]:
champ_real champ_10_18 champ_03_09 champ_91_02 champ_61_90 champ_60 champ_50_59 champ_reverse champ_exp
1950 642.0 642.0 786.0 642.0 642.0 786.0 786.0 786.0 642.0
1951 579.0 579.0 579.0 579.0 579.0 579.0 579.0 579.0 579.0
1952 647.0 647.0 647.0 647.0 647.0 647.0 647.0 647.0 647.0
1953 647.0 647.0 647.0 647.0 647.0 647.0 647.0 578.0 647.0
1954 579.0 579.0 579.0 579.0 579.0 579.0 579.0 579.0 579.0

We can also compute the entire championship standings for each season, which we will do in the next cell.

In [19]:
#grouping dataframes of each season
seasons = drivers_championship_comparison.groupby("year")
possible_finishes = pd.DataFrame(columns=list(races_in_year.index)[:-1])

for key, item in seasons:
    #creating dataframe to store places of each driver under each system
    place_finishes = pd.DataFrame(columns=points_systems)
    #for each system
    for system in points_systems:
        #ordering drivers by their championship order
        driver_order = item.set_index('driverId')[system].sort_values().iloc[::-1]
        #saving driverId
        driver_order['driverId'] = driver_order.index
        #resetting index to represent championship position
        driver_order = driver_order.reset_index()
        #for each driver, saving their position to place_finishes
        for index, row in driver_order.iterrows():
            place_finishes.loc[index+1, system] = row['driverId']
        
    #making places columns and systems rows
    place_finishes = place_finishes.T
    #for each place, save the number of possible drivers to possible_finishes
    for place in list(place_finishes.columns):
        possible_finishes.loc[place, key] = len(place_finishes[place].unique())
        
#filtering to top 22 spots since there have never been more than 22 drivers to compete throughout a season
possible_finishes = possible_finishes[possible_finishes.index <= 22]
possible_finishes.head()
Out[19]:
1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ... 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
1 2 1 1 2 1 1 1 1 2 2 ... 2 1 2 1 2 1 2 1 2 1
2 3 1 1 3 1 1 2 1 3 2 ... 3 2 2 2 2 1 2 1 2 2
3 3 1 1 2 1 1 2 4 2 4 ... 3 3 3 3 3 3 2 1 1 2
4 3 1 3 3 1 2 2 4 4 2 ... 2 2 3 3 3 3 2 2 2 2
5 3 2 2 2 2 4 3 5 4 4 ... 2 2 2 2 3 3 2 2 2 4

5 rows × 68 columns

In [20]:
ax = sns.heatmap(possible_finishes.fillna(0).T)
ax.invert_xaxis()

This heatmap demonstrates the uncertainty of championship results under different systems, sometimes having as many as 6 drivers ending up in a specific rank, and 3 different potential championship winners. In fact, there are a number of drivers who never won a championship that would have if they raced under a different scoring system. Let's look at those drivers.

In [21]:
#getting a list of the actual champions
real_winners = champions["champ_real"].unique()

#getting a list of the champions from other systems
#this is done by creating a set, which is a key-only dictionary, to avoid duplicates and then converting it to a list
fake_winners = list(set().union(champions["champ_10_18"].unique(), 
                                champions["champ_03_09"].unique(), 
                                champions["champ_91_02"].unique(), 
                                champions["champ_61_90"].unique(),
                                champions["champ_60"].unique(),
                                champions["champ_50_59"].unique(),
                                champions["champ_reverse"].unique(),
                               ))


#empty list for drivers who would have won a championship in a different system
new_winners = []
print("The following is the list of all actual Formula 1 world champions from 1950 to 2017:")
for i in real_winners:
    print(driverIdMap[i])

for i in fake_winners:
    if(i not in real_winners):
        new_winners.append(i)
The following is the list of all actual Formula 1 world champions from 1950 to 2017:
Nino Farina
Juan Fangio
Alberto Ascari
Mike Hawthorn
Jack Brabham
Phil Hill
Graham Hill
Jim Clark
Denny Hulme
Jackie Stewart
Jochen Rindt
Emerson Fittipaldi
Niki Lauda
James Hunt
Mario Andretti
Jody Scheckter
Alan Jones
Nelson Piquet
Keke Rosberg
Alain Prost
Ayrton Senna
Nigel Mansell
Michael Schumacher
Damon Hill
Jacques Villeneuve
Mika HÌ_kkinen
Fernando Alonso
Kimi RÌ_ikk̦nen
Lewis Hamilton
Jenson Button
Sebastian Vettel
Nico Rosberg
In [22]:
print("The following is the list of drivers who could have won a championship if F1 used a different points system at the time:")
for i in new_winners:
    print(driverIdMap[i])
The following is the list of drivers who could have won a championship if F1 used a different points system at the time:
Richie Ginther
Felipe Massa
Luigi Fagioli
Maurice Trintignant
Eddie Irvine
Carlos Reutemann
Stirling Moss
Clay Regazzoni
Bruce McLaren
Jacky Ickx
Dan Gurney

So now we've established a list of drivers who would have won a championship if they used a different points system. Let's compare this to the drivers mentioned in the top 3 Google search results for "formula 1 drivers who should've won a championship". Here are the 3 top results for reference:

Ranking the Best Formula 1 Drivers to Never Win the World Title

The 6 best F1 drivers never to win a championship

Top 10 Formula One drivers who have never won a World Championship

The drivers on these lists are:

  • Didier Pironi
  • Jacky Ickx
  • Tony Brooks
  • Clay Regazzoni
  • Wolfgang von Trips
  • Stirling Moss
  • Mark Webber
  • Gerhard Berger
  • Rubens Barrichello
  • Felipe Massa
  • Gilles Villeneuve
  • Carlos Reutemann
  • Ronnie Peterson
  • David Coulthard

Let's save this list in our code:

In [23]:
#hard coding search results
internet_drivers = ['Didier Pironi','Jacky Ickx','Tony Brooks','Clay Regazzoni','Wolfgang von Trips','Stirling Moss','Mark Webber','Gerhard Berger','Rubens Barrichello','Felipe Massa','Gilles Villeneuve','Carlos Reutemann','Ronnie Peterson','David Coulthard']
#new list for names of new winners
new_winner_names = new_winners.copy()

#gathering names from new_winners
for i in range(len(new_winner_names)):
    new_winner_names[i] = driverIdMap[new_winner_names[i]]
    
#finding drivers in both lists
drivers_intersection = [value for value in new_winner_names if value in internet_drivers]

print("Out of the " + str(len(new_winners)) + " drivers that could have won championships, " +
     str(len(drivers_intersection)) + " of them were listed in the afforementioned search results.")
print("Here are those drivers:")
print(drivers_intersection)
Out of the 11 drivers that could have won championships, 5 of them were listed in the afforementioned search results.
Here are those drivers:
['Felipe Massa', 'Carlos Reutemann', 'Stirling Moss', 'Clay Regazzoni', 'Jacky Ickx']

Now that we've identified some key changes under different systems and established that the Formula 1 fanbase agrees that these drivers deserved championships, let's research the seasons they would have won and find out what weaknesses in the scoring systems led to their defeat.

Comparing Actual Champions to Potential Champions

Looking into the seasons these drivers almost won reveals that many of them are contentious.

Stirling Moss won 4 of the 11 races in the 1958 season, while Mike Hawthorne only won 1, but his slew of 2nd place finishes kept him in the lead of the championship. This is perhaps the most clear case of a poorly decided championship, but many would argue that Moss's 5 DNFs are good reason to settle for 2nd place.

Jacky Ickx could have surpassed Jochen Rindt, who died on track halfway through the 1970 season and did not compete in the last 4 races, but held on to the title thanks to 5 race wins.

Clay Regazzoni finished only 3 points behind Emerson Fittipaldi. Although Fittipaldi had 3 race wins to Regazzoni's 1, the latter had a higher average finishing position.

Felipe Massa lost the 2008 championship by just 1 point in spectacular fashion in the last race of the season and many fans believe he should have been crowned champion instead of Hamilton.

Now that we've seen the effects of different point systems, it is time to evaluate them. Since there is no objective measure of a system's correctness, we will instead attempt to correlate each system with different features of a drivers' season including average finishing place, race wins, and podiums (races where they finished in the top 3).

In [25]:
#grouping results info by driver and year
driver_seasons = results_df.merge(races_df[["raceId", "year"]], on='raceId').groupby(["year", "driverId"])

drivers_season_standings["average_position"] = 0
drivers_season_standings["podiums"] = 0

#iterating through groups to add addtional stats to drivers_season_standings
for i, df in driver_seasons:
    #finding index of driver's season in drivers_season_standings
    seasons_of_driver = drivers_season_standings[drivers_season_standings['driverId'] == i[1]]
    dss_index = seasons_of_driver[seasons_of_driver['year'] == i[0]].index
    #computing podiums scored
    num_podiums = df[df["position"] <= 3].shape[0]
    #saving podiums to drivers_season_standings
    drivers_season_standings.iloc[dss_index, drivers_season_standings.columns.get_loc('podiums')] = num_podiums
    #computing average finishing position
    competed_races = df.shape[0]
    sum_finishes = df['positionOrder'].sum()
    missed_races = races_in_year[i[0]] - competed_races
    #missed races are filled with an 11th place finish
    #this number is arbitrary but chosen because 12th place is outside of points in the real systems
    #and about halfway down the grid in most races
    fill_finishes = missed_races * 12
    average_finish = (sum_finishes + fill_finishes) / races_in_year[i[0]]
    drivers_season_standings.iloc[dss_index, drivers_season_standings.columns.get_loc('average_position')] = average_finish
    
    
championship_vs_results = drivers_season_standings[['driverId', 'year', 'wins', 'average_position', 'podiums', 'position']].copy()
championship_vs_results.head()
Out[25]:
driverId year wins average_position podiums position
0 1 2008 5 5.222222 10 1
1 2 2008 0 6.500000 4 6
2 3 2008 0 10.888889 2 13
3 4 2008 2 7.444444 3 5
4 5 2008 1 7.888889 3 7

Now we have a dataframe with the important features of a driver's season and their finishing place in the actual championship. To compare the different systems, we need to add their finishing position in the systems we used.

In [26]:
#creating new columns for each scoring system
for sys in points_systems:
    col_name = 'position' + sys[6:]
    championship_vs_results[col_name] = 0


for key, item in seasons:
    #creating dataframe to store places of each driver under each system
    #for each system
    for system in points_systems:
        col_name = 'position' + system[6:]
        #ordering drivers by their championship order
        driver_order = item.set_index('driverId')[system].sort_values().iloc[::-1]
        #saving driverId
        driver_order['driverId'] = driver_order.index
        #resetting index to represent championship position
        driver_order = driver_order.reset_index()
        #for each driver, saving their position to place_finishes
        for index, row in driver_order.iterrows():
            driver_idx = championship_vs_results[championship_vs_results['driverId'] == row['driverId']]
            driver_year_idx = driver_idx[driver_idx['year'] == key].index
            championship_vs_results.loc[driver_year_idx, col_name] = index+1

championship_vs_results.head()
/Users/masonoconnor/anaconda3/lib/python3.7/site-packages/pandas/core/ops.py:1649: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  result = method(y)
Out[26]:
driverId year wins average_position podiums position position2010_18 position2003_09 position1991_2002 position1961_90 position1960 position1950_59 positionReverse positionExp
0 1 2008 5 5.222222 10 1 1 1 2 2 2 1 1 2
1 2 2008 0 6.500000 4 6 6 6 6 6 6 6 2 6
2 3 2008 0 10.888889 2 13 12 13 11 11 11 11 11 10
3 4 2008 2 7.444444 3 5 5 5 5 5 5 5 6 5
4 5 2008 1 7.888889 3 7 7 7 7 7 7 7 7 7

Now that we have all of the championship positions, we can look at their correlations to wins, average position, and podiums. It is of note that since average position, and championship position are rankings (with 1 being the best), it is expected that they will correlate negatively with wins and podiums, which are tallies rather than rankings.

In [27]:
feature_vs_position = championship_vs_results[championship_vs_results.columns[2:]].corr()
feature_vs_position
Out[27]:
wins average_position podiums position position2010_18 position2003_09 position1991_2002 position1961_90 position1960 position1950_59 positionReverse positionExp
wins 1.000000 -0.505024 0.832482 -0.366202 -0.368934 -0.367450 -0.369465 -0.368635 -0.367914 -0.365713 -0.340833 -0.366381
average_position -0.505024 1.000000 -0.634594 0.473604 0.488625 0.486182 0.466961 0.467136 0.467199 0.449950 0.467488 0.461369
podiums 0.832482 -0.634594 1.000000 -0.509152 -0.510688 -0.510472 -0.511910 -0.511922 -0.511891 -0.509981 -0.473397 -0.504509
position -0.366202 0.473604 -0.509152 1.000000 0.927733 0.888095 0.827803 0.827790 0.827801 0.769950 0.930510 0.968497
position2010_18 -0.368934 0.488625 -0.510688 0.927733 1.000000 0.955706 0.874139 0.874215 0.874323 0.812335 0.926288 0.947040
position2003_09 -0.367450 0.486182 -0.510472 0.888095 0.955706 1.000000 0.917159 0.917267 0.917407 0.852254 0.875041 0.908787
position1991_2002 -0.369465 0.466961 -0.511910 0.827803 0.874139 0.917159 1.000000 0.999937 0.999737 0.935494 0.792717 0.849977
position1961_90 -0.368635 0.467136 -0.511922 0.827790 0.874215 0.917267 0.999937 1.000000 0.999814 0.935546 0.792835 0.849861
position1960 -0.367914 0.467199 -0.511891 0.827801 0.874323 0.917407 0.999737 0.999814 1.000000 0.935808 0.793081 0.849829
position1950_59 -0.365713 0.449950 -0.509981 0.769950 0.812335 0.852254 0.935494 0.935546 0.935808 1.000000 0.733165 0.792368
positionReverse -0.340833 0.467488 -0.473397 0.930510 0.926288 0.875041 0.792717 0.792835 0.793081 0.733165 1.000000 0.960124
positionExp -0.366381 0.461369 -0.504509 0.968497 0.947040 0.908787 0.849977 0.849861 0.849829 0.792368 0.960124 1.000000

Conclusion

These correlation results are astoundingly similar across different points systems. All of the systems, with the exception of reverse linear (positionReverse) have nearly the same correlation with wins and podiums (to about 0.01). The only observable trend is that the points systems used by Formula 1 have evoled to favor average_position, but again only by a slight margin, and this may also be due to drivers competeing more consistently in races as time goes on, which changes our calculation for average finishing position. All of the systems correlate more highly with average position and the number of podiums than number of wins, suggesting that all of the systems reward consistency.

Another interesting comparison that this correlation DataFrame allows us to make is how accurately these systems predict the actual champion. Surprisingly, positionReverse and positionExp, the two most drastically different scoring systems, had the highest correlation with actual championship position.

While these results are inconclusive, it shows that Formula 1's scoring system has had the same fundamental priorities consistently, despite the numerous different results produced.

In [ ]: