I recently started playing with Hadoop to learn some data analysis. To make it more fun, I decided to fiddle with F1 datasets, specifically the Constructor And Overall Race Results. In this post, I’ll describe the three main process involved in this task. The preparation of data, followed by the two map reduce task, mapping and reducing. Let’s get started!
Data Preparation
Something we need to do for data analysis is data. The source for this project is from a site known as Ergast Developer API. The API allows you to query F1 data, such as race results, for all F1 races since 1950.
To download all of the race data, I wrote a python script to download all the races from 1950. The python URL is generated by the following snippet:
# year = year of the race. # rnd = integer representing current round. results_url = "http://ergast.com/api/f1/{0}/{1}/results.json".format(year, rnd)
The problem with the source API is that it returns a non-flat JSON file. Here we have a problem. Non-flat JSON will be a problem once we are in MapReduce, so we must flatten it now. To do this, we use python’s json library to parse and generated a record in a single line, thus generating a flat file. The snippet for this flat line after is:
results += "{0},{1},{2},{3},{4},{5},{6}\n".format( year, # Year of the race. circuit, # Circuit name. constructor_json["constructorId"], # constructorId driver_json["driverId"], # driverId status, # Status. Finished, +1 Laps, +2 Laps, ..., DNQ, ... time_milli, # Time it took to finish the race. -1 if DNF. position); # Position at race end.
Now that flatten the JSON file to some CSV file, we are good to go.
Full snippet is in a Github gist below:
To run:
./f1-results.py output/
Once we have the data, we can finally do some MapReduce.
MapReduce
In this section, I’ll utilize Java on creating Mapper and Reducer for MapReduce.
Mapper
For our mapper, we extended Mapper with the input template argument LongWritable and Text. The first template argument is the line number in our results file, which we don’t use. The latter is the line of text, for instance:
1950,silverstone,alfa,farina,Finished,8003600,1
Since we are interested in the constructor team and their position for this race, we extract that from the CSV above: alfa and 1. These two values are also the output values of our mapper, hence the 3rd and 4th argument of our extended Mapper class is Text and IntWritable.
Furthermore, since want the lowest position to have the highest value, we also inverted position and added 50 to the position:
\(\text{PositionValue} = 50 – \text{Position}
\)
We chose 50 randomly since the chances of having 50 drivers in the field is very low. Adding 50 also ensures that the value of the position is positive. Not really needed, but makes our task of creating bar charts easier.
The full Java code for doing this is shown below:
Reducer
After we send out the mapped race position for each team, Reducer will sum all these mapped positions for each team. For instance, a small subset of the mapped output could be:
alfa,49 alfa,48 mercedez,49 mercedez,48 alfa,47
Again, note that the position value is:
\(\text{PositionValue} = 50 – \text{Position}
\)
So alfa’s 1st, 2nd, 3rd place would have a position value of 49, 48, and 47.
For each key, which is the constructor team, we are given a list of mapped values. For alfa this would be 1,2,3 and for mercedez, this would be 1,2. Our reducer sums all of these values. So the result would be:
alfa,144 mercedez,97
Full snippet of our reducer code is the following:
Results
For the results, I suggest you reading, or skipping to the conclusion section of my Constructor And Overall Race Results post.