In this Jupyter notebook, we're going to be doing some cleaning and exploration of a dataset with 50,000 points of used car sales data from eBay Kleinanzeigen
import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding='Latin-1')
print(autos.head())
print(autos.info())
offerType
, vehicleType
, gearbox
, notRepairedDamage
name
column also has German content, but that isn't as critical to translateodometer
and price
are both strings because of their units ('km' and '$' respectivley) and their place markersimport re
# regex from here: https://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case
camel_to_snake = re.compile(r'(?<!^)(?=[A-Z])')
# done separately because these aren't a simple camel -> snake conversion
renames = {
'yearOfRegistration': 'registration_year',
'monthOfRegistration': 'registration_month',
'notRepairedDamage': 'unrepaired_damage',
'dateCreated': 'ad_created'
}
new_cols = []
for col in autos.columns:
if col in renames:
new_cols.append(renames[col])
else:
new_cols.append(camel_to_snake.sub('_', col).lower())
autos.columns = new_cols
print(autos.head())
To make the analysis more Pythonic, we're converting the original camelCase column names to snake_case using the following regular expression: (?<!^)(?=[A-Z])
In addition to this converion, we're renaming some columns to make it clearer what they refer to. Those renames are as follows:
yearOfRegistration
to registration_year
monthOfRegistration
to registration_month
notRepairedDamage
to unrepaired_damage
dateCreated
to ad_created
autos.describe(include='all')
# nr_of_pictures is zero for all of these, dropping for simplicity
autos.drop(['seller', 'offer_type', 'nr_of_pictures'], axis=1, inplace=True)
autos['price'] = autos['price'].str.replace('$', '').str.replace(',', '').astype(int)
autos['odometer'] = autos['odometer'].str.replace('km', '').str.replace(',', '').astype(int)
autos.rename(columns={'odometer': 'odometer_km'}, inplace=True)
print(autos.head())
Here we've removed the nr_of_pictures
column because it only contains the value 0
, and as such does not add anything to our analysis.
Additionally, we've removed the non-numeric characters (,
, km
, and $
) from the odometer
column (now renamed odometer_km
) and converting their values to numbers.
print('odometer_km')
print(autos['odometer_km'].unique().shape)
print('\n')
print(autos['odometer_km'].describe())
print('\n')
print(autos['odometer_km'].value_counts().sort_index(ascending=True))
print('\n\n')
print('price')
print(autos[autos['price'].between(100, 20000)]['price'].unique().shape)
print('\n')
print(autos[autos['price'].between(100, 20000)]['price'].describe())
print('\n')
print(autos[autos['price'].between(100, 20000)]['price'].value_counts().sort_index(ascending=True))
# removing price outliers
autos = autos[autos['price'].between(100, 20000)]
print(autos.describe())
After looking for outliers in the odometer_km
and price
columns, we have removed some values based on their price. The odometer_km
values were not distributed discretely. Instead, they were distributed in increments of either 5-, 10-, or 25,000, and the number of listings tended upwards, with the majority of listings (30,877) at 150,000. Because of that, it didn't make sense to remove outliers based on the odometer_km
column, as it may be interesting to look at price differences based on the various increments.
price
was a different matter. Most (>75%) price
values were under 10,000, with a long tail reaching all the way to over 1,000,000. We have removed that long tail by dropping values over 20,000 and under 100, which removed just under 5,000 rows of very highly priced cars and essentialy free cars.
date_cols = ['date_crawled', 'ad_created', 'last_seen']
for col in date_cols:
print(autos[col].str[:10].value_counts(normalize=True, dropna=False).sort_index())
print(autos['registration_year'].describe())
The date_crawled
column is faily evenly distributed (with about 3% of items each day), and likely doesn't hold much predictive or analytical interest, as it was not generated by the people putting these ads up, but rather by the crawler itself.
ad_created
seems to ramp up as the year progresses, with earlier values from August 2015 all the way through March 2016 being very small parts of the data (as in hundred-thousandths of a percentage point). After early March 2016, the shares pick up, with each date bringing ~3% of the dataset per day, before starting to drop off again after early April.
last_seen
is an interesting column. Values there start with very low values (<1%) in early march, and slowly ramp up to around 1.5% per day by mid March and ery early April. Then, there is a huge spike between April 5 and 7. 12.2% of listings were last viewed on 4/5, 21.7% on 4/6, and 12.8% on 4/7, the end of the set.
The distribution of registration_year
reveals some data quality problems. Notably, around the outliers. The minimum value for registration_year
is 1000, and the maxium is 9999. Assuming that time travel is not a factor, we can safely ignore values from further than ~100 years in the past, and from any years in the future.
outliers_removed = autos[autos['registration_year'].between(1906, 2016)]
print(outliers_removed.describe())
autos = outliers_removed
Because of the data quality problems revealed in the previous analysis, we have removed all values with registration years before 1907 (the release of the Ford Model T) and after 2016 (the year in which this data was collected).
above_1_pct_filter = autos['brand'].value_counts(normalize=True, dropna=False) >= 0.01
autos = autos.groupby('brand').filter(lambda row: above_1_pct_filter[row['brand']].iloc[1])
above_1_brands = autos['brand'].unique()
We have decided to include in this analysis all brands with a share of over 1% of the total listings. We are also removing the value of sonstige_autos
, which translates to other
, and is therefore an aggregation of brands which represent less than 0.06% (the share of Lada, the least-represented brand) of the total.
price_avgs = {}
for el in above_1_brands:
if el == 'sonstige_autos':
continue
price_avgs[el] = autos[autos['brand'] == el]['price'].mean()
sorted_avgs = sorted(price_avgs.items(), key=lambda x: x[1], reverse=True)
for el in sorted_avgs:
print(el)
It makes sense that the luxury German brands Audi, BMW, and Mercedes Benz take the top three spots in order. After that top cluster, we find a relatively smooth progression of primarially European and Japanese brands down to Renault in last place.
bmp_series = pd.Series(price_avgs)
print(bmp_series)
df = pd.DataFrame(bmp_series, columns=['mean_price'])
print(df)
mileage_avgs = {}
for el in above_1_brands:
if el == 'sonstige_autos':
continue
mileage_avgs[el] = autos[autos['brand'] == el]['odometer_km'].mean()
odometer_series = pd.Series(mileage_avgs)
df['mean_mileage'] = odometer_series
print(df)
There doesn't seem to be much of a relationship between price and mileage when broken down by brand. With two exceptions (Hyundai and Smart) all of the brands were between 115,000 and 137,000. If we were to aggregate all the cars and look for a relationship between mileage and price, a clearer trend may emerge.