pd.merge(..... ... ... how)

by: JGSmith123, 8 years ago


I am running into a strange issue when I use "how" when mergine in pandas. It is adding values after a decimal point.

Using the example in the Pandas tutorial, I have the following code without reference to "how"


import pandas as pd

df1 = pd.DataFrame({'Year':[2001, 2002, 2003, 2004],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]})


df3 = pd.DataFrame({'Year':[2001,2003, 2004, 2005],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]})
merge = pd.merge(df1, df3, on = "Year")
merge.set_index("Year", inplace = True)
print(merge)


And the following df is returned:

      Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
Year                                                        
2001         2                50            50             7
2003         2                65            52             8
2004         2                55            50             9

But look at what happens as soon as I add "how" to the merge. This happens with inner, outer, left, and right; but, different values posses decimal points depending on which way I merge it


import pandas as pd

df1 = pd.DataFrame({'Year':[2001, 2002, 2003, 2004],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]})


df3 = pd.DataFrame({'Year':[2001,2003, 2004, 2005],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]})
merge = pd.merge(df1, df3, on = "Year", how="right")
merge.set_index("Year", inplace = True)
print(merge)


This code produces decimal points on Year, Int_Rate, & US_GDP_Thousands
        Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
Year                                                          
2001.0       2.0              50.0            50             7
2003.0       2.0              65.0            52             8
2004.0       2.0              55.0            50             9
2005.0       NaN               NaN            53             6

But if I change it to "left" I get this:
      Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
Year                                                        
2001         2                50          50.0           7.0
2002         3                55           NaN           NaN
2003         2                65          52.0           8.0
2004         2                55          50.0           9.0

The decimals are on Low_tier_HPI & Unemployment

Using "inner" does not produce any decimal points but with "outer" there is a return of the decimals, but this time it has taken over the entire world and infected every column with those pesky decimals....


        Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
Year                                                          
2001.0       2.0              50.0          50.0           7.0
2002.0       3.0              55.0           NaN           NaN
2003.0       2.0              65.0          52.0           8.0
2004.0       2.0              55.0          50.0           9.0
2005.0       NaN               NaN          53.0           6.0


Any thoughts on how to protect my beautiful little columns from the decimal demons?



You must be logged in to post. Please login or register an account.