Exploring Ebay Car Sales Data

In this Jupyter notebook, we're going to be doing some cleaning and exploration of a dataset with 50,000 points of used car sales data from eBay Kleinanzeigen

In [2]:
import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding='Latin-1')
print(autos.head())
print(autos.info())
           dateCrawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   seller offerType   price   abtest vehicleType  yearOfRegistration  \
0  privat   Angebot  $5,000  control         bus                2004   
1  privat   Angebot  $8,500  control   limousine                1997   
2  privat   Angebot  $8,990     test   limousine                2009   
3  privat   Angebot  $4,350  control  kleinwagen                2007   
4  privat   Angebot  $1,350     test       kombi                2003   

     gearbox  powerPS   model   odometer  monthOfRegistration fuelType  \
0    manuell      158  andere  150,000km                    3      lpg   
1  automatik      286     7er  150,000km                    6   benzin   
2    manuell      102    golf   70,000km                    7   benzin   
3  automatik       71  fortwo   70,000km                    6   benzin   
4    manuell        0   focus  150,000km                    7   benzin   

        brand notRepairedDamage          dateCreated  nrOfPictures  \
0     peugeot              nein  2016-03-26 00:00:00             0   
1         bmw              nein  2016-04-04 00:00:00             0   
2  volkswagen              nein  2016-03-26 00:00:00             0   
3       smart              nein  2016-03-12 00:00:00             0   
4        ford              nein  2016-04-01 00:00:00             0   

   postalCode             lastSeen  
0       79588  2016-04-06 06:45:54  
1       71034  2016-04-06 14:45:08  
2       35394  2016-04-06 20:15:37  
3       33729  2016-03-15 03:16:28  
4       39218  2016-04-01 14:38:50  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
None

Initial Notes

  • Many of these columns are in German and may need to be translated to make the analysis make the most sense for an English-language analysis. Columns with German content: offerType, vehicleType, gearbox, notRepairedDamage
  • The name column also has German content, but that isn't as critical to translate
  • odometer and price are both strings because of their units ('km' and '$' respectivley) and their place markers
In [3]:
import re
# regex from here: https://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case
camel_to_snake = re.compile(r'(?<!^)(?=[A-Z])')
# done separately because these aren't a simple camel -> snake conversion
renames = {
    'yearOfRegistration': 'registration_year',
    'monthOfRegistration': 'registration_month',
    'notRepairedDamage': 'unrepaired_damage',
    'dateCreated': 'ad_created'
}
new_cols = []
for col in autos.columns:
    if col in renames:
        new_cols.append(renames[col])
    else:
        new_cols.append(camel_to_snake.sub('_', col).lower())

autos.columns = new_cols
print(autos.head())
          date_crawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   seller offer_type   price   abtest vehicle_type  registration_year  \
0  privat    Angebot  $5,000  control          bus               2004   
1  privat    Angebot  $8,500  control    limousine               1997   
2  privat    Angebot  $8,990     test    limousine               2009   
3  privat    Angebot  $4,350  control   kleinwagen               2007   
4  privat    Angebot  $1,350     test        kombi               2003   

     gearbox  power_p_s   model   odometer  registration_month fuel_type  \
0    manuell        158  andere  150,000km                   3       lpg   
1  automatik        286     7er  150,000km                   6    benzin   
2    manuell        102    golf   70,000km                   7    benzin   
3  automatik         71  fortwo   70,000km                   6    benzin   
4    manuell          0   focus  150,000km                   7    benzin   

        brand unrepaired_damage           ad_created  nr_of_pictures  \
0     peugeot              nein  2016-03-26 00:00:00               0   
1         bmw              nein  2016-04-04 00:00:00               0   
2  volkswagen              nein  2016-03-26 00:00:00               0   
3       smart              nein  2016-03-12 00:00:00               0   
4        ford              nein  2016-04-01 00:00:00               0   

   postal_code            last_seen  
0        79588  2016-04-06 06:45:54  
1        71034  2016-04-06 14:45:08  
2        35394  2016-04-06 20:15:37  
3        33729  2016-03-15 03:16:28  
4        39218  2016-04-01 14:38:50  

Renaming Columns

To make the analysis more Pythonic, we're converting the original camelCase column names to snake_case using the following regular expression: (?<!^)(?=[A-Z])

In addition to this converion, we're renaming some columns to make it clearer what they refer to. Those renames are as follows: yearOfRegistration to registration_year monthOfRegistration to registration_month notRepairedDamage to unrepaired_damage dateCreated to ad_created

In [4]:
autos.describe(include='all')
Out[4]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_p_s model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-29 23:42:13 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN
In [5]:
# nr_of_pictures is zero for all of these, dropping for simplicity
autos.drop(['seller', 'offer_type', 'nr_of_pictures'], axis=1, inplace=True)
autos['price'] = autos['price'].str.replace('$', '').str.replace(',', '').astype(int)
autos['odometer'] = autos['odometer'].str.replace('km', '').str.replace(',', '').astype(int)
autos.rename(columns={'odometer': 'odometer_km'}, inplace=True)
print(autos.head())
          date_crawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   price   abtest vehicle_type  registration_year    gearbox  power_p_s  \
0   5000  control          bus               2004    manuell        158   
1   8500  control    limousine               1997  automatik        286   
2   8990     test    limousine               2009    manuell        102   
3   4350  control   kleinwagen               2007  automatik         71   
4   1350     test        kombi               2003    manuell          0   

    model  odometer_km  registration_month fuel_type       brand  \
0  andere       150000                   3       lpg     peugeot   
1     7er       150000                   6    benzin         bmw   
2    golf        70000                   7    benzin  volkswagen   
3  fortwo        70000                   6    benzin       smart   
4   focus       150000                   7    benzin        ford   

  unrepaired_damage           ad_created  postal_code            last_seen  
0              nein  2016-03-26 00:00:00        79588  2016-04-06 06:45:54  
1              nein  2016-04-04 00:00:00        71034  2016-04-06 14:45:08  
2              nein  2016-03-26 00:00:00        35394  2016-04-06 20:15:37  
3              nein  2016-03-12 00:00:00        33729  2016-03-15 03:16:28  
4              nein  2016-04-01 00:00:00        39218  2016-04-01 14:38:50  

Cleaning Number Columns

Here we've removed the nr_of_pictures column because it only contains the value 0, and as such does not add anything to our analysis. Additionally, we've removed the non-numeric characters (,, km, and $) from the odometer column (now renamed odometer_km) and converting their values to numbers.

In [6]:
print('odometer_km')
print(autos['odometer_km'].unique().shape)
print('\n')
print(autos['odometer_km'].describe())
print('\n')
print(autos['odometer_km'].value_counts().sort_index(ascending=True))
print('\n\n')
print('price')
print(autos[autos['price'].between(100, 20000)]['price'].unique().shape)
print('\n')
print(autos[autos['price'].between(100, 20000)]['price'].describe())
print('\n')
print(autos[autos['price'].between(100, 20000)]['price'].value_counts().sort_index(ascending=True))

# removing price outliers
autos = autos[autos['price'].between(100, 20000)]
print(autos.describe())
odometer_km
(13,)


count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64


5000        967
10000       264
20000       784
30000       789
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64



price
(1655,)


count    45891.000000
mean      4558.115556
std       4499.351331
min        100.000000
25%       1200.000000
50%       2899.000000
75%       6500.000000
max      20000.000000
Name: price, dtype: float64


100      134
110        3
111        2
115        2
117        1
120       39
122        1
125        8
129        1
130       15
135        1
139        1
140        9
145        2
149        7
150      224
156        2
160        8
170        7
173        1
175       12
179        1
180       35
185        1
188        1
190       16
193        1
195        2
198        1
199       41
        ... 
19450      2
19480      1
19490      4
19499      2
19500     83
19550      1
19599      1
19600      6
19650      3
19666      1
19690      1
19699      1
19700      4
19750      3
19777      1
19780      1
19800     20
19850      4
19890      2
19900     50
19911      1
19950      9
19968      1
19970      1
19980      1
19990     28
19995      4
19998      2
19999     28
20000     30
Name: price, Length: 1655, dtype: int64
              price  registration_year     power_p_s    odometer_km  \
count  45891.000000       45891.000000  45891.000000   45891.000000   
mean    4558.115556        2004.265695    112.139810  128738.423656   
std     4499.351331          81.626662    203.387274   37053.622378   
min      100.000000        1000.000000      0.000000    5000.000000   
25%     1200.000000        1999.000000     69.000000  125000.000000   
50%     2899.000000        2003.000000    105.000000  150000.000000   
75%     6500.000000        2008.000000    143.000000  150000.000000   
max    20000.000000        9999.000000  17700.000000  150000.000000   

       registration_month   postal_code  
count        45891.000000  45891.000000  
mean             5.783552  50668.145105  
std              3.691624  25722.360061  
min              0.000000   1067.000000  
25%              3.000000  30179.000000  
50%              6.000000  49477.000000  
75%              9.000000  71229.000000  
max             12.000000  99998.000000  

Price Outliers

After looking for outliers in the odometer_km and price columns, we have removed some values based on their price. The odometer_km values were not distributed discretely. Instead, they were distributed in increments of either 5-, 10-, or 25,000, and the number of listings tended upwards, with the majority of listings (30,877) at 150,000. Because of that, it didn't make sense to remove outliers based on the odometer_km column, as it may be interesting to look at price differences based on the various increments.

price was a different matter. Most (>75%) price values were under 10,000, with a long tail reaching all the way to over 1,000,000. We have removed that long tail by dropping values over 20,000 and under 100, which removed just under 5,000 rows of very highly priced cars and essentialy free cars.

In [7]:
date_cols = ['date_crawled', 'ad_created', 'last_seen']
for col in date_cols:
    print(autos[col].str[:10].value_counts(normalize=True, dropna=False).sort_index())

    
print(autos['registration_year'].describe())
2016-03-05    0.025495
2016-03-06    0.014055
2016-03-07    0.036216
2016-03-08    0.033405
2016-03-09    0.033013
2016-03-10    0.032490
2016-03-11    0.032207
2016-03-12    0.037349
2016-03-13    0.015711
2016-03-14    0.036979
2016-03-15    0.034146
2016-03-16    0.029723
2016-03-17    0.031531
2016-03-18    0.012835
2016-03-19    0.034582
2016-03-20    0.037720
2016-03-21    0.037066
2016-03-22    0.032882
2016-03-23    0.032316
2016-03-24    0.029461
2016-03-25    0.031553
2016-03-26    0.032664
2016-03-27    0.030681
2016-03-28    0.034865
2016-03-29    0.034124
2016-03-30    0.033841
2016-03-31    0.031640
2016-04-01    0.033296
2016-04-02    0.035410
2016-04-03    0.038548
2016-04-04    0.036652
2016-04-05    0.012987
2016-04-06    0.003181
2016-04-07    0.001373
Name: date_crawled, dtype: float64
2015-08-10    0.000022
2015-09-09    0.000022
2015-11-10    0.000022
2015-12-05    0.000022
2015-12-30    0.000022
2016-01-03    0.000022
2016-01-07    0.000022
2016-01-10    0.000044
2016-01-13    0.000022
2016-01-16    0.000022
2016-01-22    0.000022
2016-01-27    0.000065
2016-01-29    0.000022
2016-02-01    0.000022
2016-02-02    0.000044
2016-02-05    0.000044
2016-02-07    0.000022
2016-02-08    0.000022
2016-02-09    0.000022
2016-02-11    0.000022
2016-02-12    0.000044
2016-02-14    0.000044
2016-02-16    0.000022
2016-02-17    0.000022
2016-02-18    0.000044
2016-02-19    0.000065
2016-02-20    0.000044
2016-02-21    0.000065
2016-02-22    0.000022
2016-02-23    0.000087
                ...   
2016-03-09    0.033100
2016-03-10    0.032207
2016-03-11    0.032555
2016-03-12    0.037197
2016-03-13    0.017040
2016-03-14    0.035606
2016-03-15    0.033906
2016-03-16    0.030224
2016-03-17    0.031095
2016-03-18    0.013641
2016-03-19    0.033427
2016-03-20    0.037742
2016-03-21    0.037284
2016-03-22    0.032686
2016-03-23    0.032207
2016-03-24    0.029418
2016-03-25    0.031662
2016-03-26    0.032686
2016-03-27    0.030551
2016-03-28    0.035083
2016-03-29    0.034037
2016-03-30    0.033623
2016-03-31    0.031662
2016-04-01    0.033296
2016-04-02    0.035170
2016-04-03    0.038788
2016-04-04    0.037001
2016-04-05    0.011702
2016-04-06    0.003247
2016-04-07    0.001242
Name: ad_created, Length: 74, dtype: float64
2016-03-05    0.001133
2016-03-06    0.004489
2016-03-07    0.005557
2016-03-08    0.007627
2016-03-09    0.009915
2016-03-10    0.010917
2016-03-11    0.012835
2016-03-12    0.024580
2016-03-13    0.009152
2016-03-14    0.012704
2016-03-15    0.016082
2016-03-16    0.016801
2016-03-17    0.028720
2016-03-18    0.007409
2016-03-19    0.016169
2016-03-20    0.021050
2016-03-21    0.020854
2016-03-22    0.021813
2016-03-23    0.019023
2016-03-24    0.020091
2016-03-25    0.019655
2016-03-26    0.016910
2016-03-27    0.015820
2016-03-28    0.021355
2016-03-29    0.023011
2016-03-30    0.025059
2016-03-31    0.024210
2016-04-01    0.023055
2016-04-02    0.025190
2016-04-03    0.025299
2016-04-04    0.025038
2016-04-05    0.122464
2016-04-06    0.217515
2016-04-07    0.128500
Name: last_seen, dtype: float64
count    45891.000000
mean      2004.265695
std         81.626662
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Dates

The date_crawled column is faily evenly distributed (with about 3% of items each day), and likely doesn't hold much predictive or analytical interest, as it was not generated by the people putting these ads up, but rather by the crawler itself.

ad_created seems to ramp up as the year progresses, with earlier values from August 2015 all the way through March 2016 being very small parts of the data (as in hundred-thousandths of a percentage point). After early March 2016, the shares pick up, with each date bringing ~3% of the dataset per day, before starting to drop off again after early April.

last_seen is an interesting column. Values there start with very low values (<1%) in early march, and slowly ramp up to around 1.5% per day by mid March and ery early April. Then, there is a huge spike between April 5 and 7. 12.2% of listings were last viewed on 4/5, 21.7% on 4/6, and 12.8% on 4/7, the end of the set.

The distribution of registration_year reveals some data quality problems. Notably, around the outliers. The minimum value for registration_year is 1000, and the maxium is 9999. Assuming that time travel is not a factor, we can safely ignore values from further than ~100 years in the past, and from any years in the future.

In [8]:
outliers_removed = autos[autos['registration_year'].between(1906, 2016)]
print(outliers_removed.describe())
autos = outliers_removed
              price  registration_year     power_p_s    odometer_km  \
count  44037.000000       44037.000000  44037.000000   44037.000000   
mean    4606.213956        2002.619502    112.679202  128639.780185   
std     4526.515503           6.816194    187.018380   37054.503324   
min      100.000000        1910.000000      0.000000    5000.000000   
25%     1200.000000        1999.000000     71.000000  125000.000000   
50%     2900.000000        2003.000000    105.000000  150000.000000   
75%     6690.000000        2007.000000    143.000000  150000.000000   
max    20000.000000        2016.000000  17700.000000  150000.000000   

       registration_month   postal_code  
count        44037.000000  44037.000000  
mean             5.830574  50787.685605  
std              3.675900  25729.825500  
min              0.000000   1067.000000  
25%              3.000000  30451.000000  
50%              6.000000  49584.000000  
75%              9.000000  71334.000000  
max             12.000000  99998.000000  

Registration Year Cleaning

Because of the data quality problems revealed in the previous analysis, we have removed all values with registration years before 1907 (the release of the Ford Model T) and after 2016 (the year in which this data was collected).

In [9]:
above_1_pct_filter = autos['brand'].value_counts(normalize=True, dropna=False) >= 0.01
autos = autos.groupby('brand').filter(lambda row: above_1_pct_filter[row['brand']].iloc[1])
above_1_brands = autos['brand'].unique()

Brand Aggregation

We have decided to include in this analysis all brands with a share of over 1% of the total listings. We are also removing the value of sonstige_autos, which translates to other, and is therefore an aggregation of brands which represent less than 0.06% (the share of Lada, the least-represented brand) of the total.

In [10]:
price_avgs = {}
for el in above_1_brands:
    if el == 'sonstige_autos':
        continue
    price_avgs[el] = autos[autos['brand'] == el]['price'].mean()

sorted_avgs = sorted(price_avgs.items(), key=lambda x: x[1], reverse=True)
for el in sorted_avgs:
    print(el)
    
('audi', 6553.900535362074)
('bmw', 6373.01667735728)
('mercedes_benz', 6112.457078875342)
('skoda', 5823.146540027137)
('hyundai', 5244.239130434783)
('toyota', 4852.5504273504275)
('volkswagen', 4615.367980504344)
('nissan', 4388.680972818312)
('seat', 4114.786057692308)
('mazda', 3678.8311688311687)
('smart', 3596.40273556231)
('citroen', 3581.996894409938)
('ford', 3245.2543334383863)
('peugeot', 3083.2807525325616)
('opel', 2872.742914979757)
('fiat', 2818.1863406408092)
('renault', 2388.126034958602)

Average Price Analysis

It makes sense that the luxury German brands Audi, BMW, and Mercedes Benz take the top three spots in order. After that top cluster, we find a relatively smooth progression of primarially European and Japanese brands down to Renault in last place.

In [11]:
bmp_series = pd.Series(price_avgs)
print(bmp_series)
df = pd.DataFrame(bmp_series, columns=['mean_price'])
print(df)
0              (audi, 6553.900535362074)
1                (bmw, 6373.01667735728)
2     (mercedes_benz, 6112.457078875342)
3             (skoda, 5823.146540027137)
4           (hyundai, 5244.239130434783)
5           (toyota, 4852.5504273504275)
6        (volkswagen, 4615.367980504344)
7            (nissan, 4388.680972818312)
8              (seat, 4114.786057692308)
9            (mazda, 3678.8311688311687)
10             (smart, 3596.40273556231)
11          (citroen, 3581.996894409938)
12            (ford, 3245.2543334383863)
13         (peugeot, 3083.2807525325616)
14             (opel, 2872.742914979757)
15            (fiat, 2818.1863406408092)
16          (renault, 2388.126034958602)
dtype: object
                            mean_price
0            (audi, 6553.900535362074)
1              (bmw, 6373.01667735728)
2   (mercedes_benz, 6112.457078875342)
3           (skoda, 5823.146540027137)
4         (hyundai, 5244.239130434783)
5         (toyota, 4852.5504273504275)
6      (volkswagen, 4615.367980504344)
7          (nissan, 4388.680972818312)
8            (seat, 4114.786057692308)
9          (mazda, 3678.8311688311687)
10           (smart, 3596.40273556231)
11        (citroen, 3581.996894409938)
12          (ford, 3245.2543334383863)
13       (peugeot, 3083.2807525325616)
14           (opel, 2872.742914979757)
15          (fiat, 2818.1863406408092)
16        (renault, 2388.126034958602)
In [16]:
mileage_avgs = {}
for el in above_1_brands:
    if el == 'sonstige_autos':
        continue
    mileage_avgs[el] = autos[autos['brand'] == el]['odometer_km'].mean()
    
odometer_series = pd.Series(mileage_avgs)

df['mean_mileage'] = odometer_series
print(df)
                mean_price   mean_mileage
audi           6553.900535  137313.327698
bmw            6373.016677  137308.103485
citroen        3581.996894  120217.391304
fiat           2818.186341  117032.040472
ford           3245.254333  125832.020170
hyundai        5244.239130  107554.347826
mazda          3678.831169  126443.001443
mercedes_benz  6112.457079  136684.498632
nissan         4388.680973  119613.733906
opel           2872.742915  129969.635628
peugeot        3083.280753  127261.215630
renault        2388.126035  128574.057038
seat           4114.786058  122968.750000
skoda          5823.146540  113398.914518
smart          3596.402736   99734.042553
toyota         4852.550427  117094.017094
volkswagen     4615.367981  131526.276754

Mileage Analysis

There doesn't seem to be much of a relationship between price and mileage when broken down by brand. With two exceptions (Hyundai and Smart) all of the brands were between 115,000 and 137,000. If we were to aggregate all the cars and look for a relationship between mileage and price, a clearer trend may emerge.