大家好,我是小小明。上次我们的云朵君同学在不严谨的测试下,得出了join可以比merge快5倍的结论。虽然默认参数用法下,join确实比merge快一些,但实际上join并不见得会比merge快。链接:《再见!不再使用 Pandas 中的 Merge 方法 》
下面我们继续深入测试,首先构造如下生成测试数据的方法:
import pandas as pd
import numpy as np
from time import time
def create_df ( col_prefix, nrows= 1000_0000 ) :
data = np. random. randint( 0 , 255 , size= ( nrows, 4 ) )
df = pd. DataFrame( data, columns= map ( f" { col_prefix} _{{}}" . format , range ( 4 ) ) , dtype= "uint8" )
df[ "idx" ] = df. index. values
return df
生成的测试数据格式如下:
print ( create_df( "c" ) )
c_0 c_1 c_2 c_3 idx
0 111 32 70 73 0
1 12 78 38 205 1
2 9 52 203 171 2
3 200 22 99 19 3
4 124 205 233 65 4
... ... ... ... ... ...
9999995 41 136 246 162 9999995
9999996 209 16 73 51 9999996
9999997 132 41 46 234 9999997
9999998 163 101 120 184 9999998
9999999 116 189 205 40 9999999
[10000000 rows x 5 columns]
默认生成1千万条数据,连接字段idx确保唯一。
初步测试
我们按照原文的逻辑简单测试一下:
df1 = create_df( "x" )
df2 = create_df( "y" )
start = time( )
df1. merge( df2, on= "idx" )
print ( f"merge耗时: { time( ) - start: .2f } s" )
start = time( )
df1. set_index( "idx" , inplace= True )
df2. set_index( "idx" , inplace= True )
df1. join( df2, how= "inner" )
print ( f"join耗时: { time( ) - start: .2f } s" )
merge 耗时:4.68s
join 耗时:1.13s
确实默认参数情况下join比merge快4倍。
join函数的本质
但是为什么join会比merge快呢?pandas开发者总不能开发两套完全不相同的处理逻辑把?所以我们点开源码一探究竟。
下面我的pandas环境是基于1.5.1 版本进行测试。
这里我使用pycharm通过Ctrl+点击对应函数查看源码,首先看看join的核心源码:
def join (
self,
other: DataFrame | Series | list [ DataFrame | Series] ,
on: IndexLabel | None = None ,
how: str = "left" ,
lsuffix: str = "" ,
rsuffix: str = "" ,
sort: bool = False ,
validate: str | None = None ,
) - > DataFrame:
return self. _join_compat(
other,
on= on,
how= how,
lsuffix= lsuffix,
rsuffix= rsuffix,
sort= sort,
validate= validate,
)
def _join_compat (
self,
other: DataFrame | Series,
on: IndexLabel | None = None ,
how: str = "left" ,
lsuffix: str = "" ,
rsuffix: str = "" ,
sort: bool = False ,
validate: str | None = None ,
) :
from pandas. core. reshape. merge import merge
return merge(
self,
other,
left_on= on,
how= how,
left_index= on is None ,
right_index= True ,
suffixes= ( lsuffix, rsuffix) ,
sort= sort,
validate= validate,
)
咦,join不就是最终依然调用merge方法吗?
再看看merge的源码:
def merge (
self,
right: DataFrame | Series,
how: str = "inner" ,
on: IndexLabel | None = None ,
left_on: IndexLabel | None = None ,
right_on: IndexLabel | None = None ,
left_index: bool = False ,
right_index: bool = False ,
sort: bool = False ,
suffixes: Suffixes = ( "_x" , "_y" ) ,
copy: bool = True ,
indicator: bool = False ,
validate: str | None = None ,
) - > DataFrame:
from pandas. core. reshape. merge import merge
return merge(
self,
right,
how= how,
on= on,
left_on= left_on,
right_on= right_on,
left_index= left_index,
right_index= right_index,
sort= sort,
suffixes= suffixes,
copy= copy,
indicator= indicator,
validate= validate,
)
merge也是调用了pandas.core.reshape.merge
中的merge方法。
**结论:**join本质上是merge传入了left_index和right_index为True。
再次测试
基于上述结论,我们再次测试一下:
df1 = create_df( "x" )
df2 = create_df( "y" )
df1. set_index( "idx" , inplace= True )
df2. set_index( "idx" , inplace= True )
start = time( )
df1. merge( df2, on= "idx" )
print ( f"merge on耗时: { time( ) - start: .2f } s" )
start = time( )
df1. merge( df2, left_index= True , right_index= True )
print ( f"merge index耗时: { time( ) - start: .2f } s" )
start = time( )
df1. join( df2, how= "inner" )
print ( f"join耗时: { time( ) - start: .2f } s" )
merge on耗时:4.30s
merge index耗时:1.15s
join耗时:0.11s
merge index依然比merge on快了4倍,可是join为啥又比merge index快10倍?按照前面的分析,join多调用了2层,应该比merge index更慢才对呀?难道是因为缓存?
现在我们调换一下merge index和join的执行顺序再试试:
df1 = create_df( "x" )
df2 = create_df( "y" )
df1. set_index( "idx" , inplace= True )
df2. set_index( "idx" , inplace= True )
start = time( )
df1. join( df2, how= "inner" )
print ( f"join耗时: { time( ) - start: .2f } s" )
start = time( )
df1. merge( df2, left_index= True , right_index= True )
print ( f"merge index耗时: { time( ) - start: .2f } s" )
join耗时:1.12s
merge index耗时:0.13s
结果这次变成merge index比join快10倍了。那么我们干脆让数据都重新生成后再测试:
df1 = create_df( "x" )
df2 = create_df( "y" )
df1. set_index( "idx" , inplace= True )
df2. set_index( "idx" , inplace= True )
start = time( )
df1. join( df2, how= "inner" )
print ( f"join耗时: { time( ) - start: .2f } s" )
df1 = create_df( "x" )
df2 = create_df( "y" )
df1. set_index( "idx" , inplace= True )
df2. set_index( "idx" , inplace= True )
start = time( )
df1. merge( df2, left_index= True , right_index= True )
print ( f"merge index耗时: { time( ) - start: .2f } s" )
join耗时:1.21s
merge index耗时:1.10s
这次相差不大,merge index比join略快一点,符合对源码的认知。但是明显第二次join或merge时,会比第一次快10倍,初步推测肯定是存在某种缓存。
缓存存在的位置
在join函数打上断点后,一步步内层跟,发现如下缓存位置:
而self是Int64Index
对象,说明索引对象缓存是否具备唯一性的属性。索引是否具备唯一性决定了索引连接的形式:
我们可以再对比前先判断索引唯一性再测试:
df1 = create_df( "x" )
df2 = create_df( "y" )
df1. set_index( "idx" , inplace= True )
df2. set_index( "idx" , inplace= True )
start = time( )
print ( df1. index. is_unique, df2. index. is_unique)
print ( f"索引判断唯一性耗时: { time( ) - start: .2f } s" )
time1s = [ ]
time2s = [ ]
repeat = 10
for _ in range ( repeat) :
start = time( )
df1. join( df2)
time2s. append( time( ) - start)
start = time( )
df1. merge( df2, left_index= True , right_index= True )
time1s. append( time( ) - start)
print ( f"merge index耗时: { sum ( time1s) / repeat : .2f } s" )
print ( f"join耗时: { sum ( time2s) / repeat : .2f } s" )
True True
索引判断唯一性耗时:0.99s
merge index耗时:0.12s
join耗时:0.12s
可以看到索引判断唯一性耗时占了大头,取消这部分耗时后,两者耗时相差无几。
索引不唯一时速度对比
以上对比都是基于索引唯一的情景下,这次我们重新设计数据,假设索引存在五分之一的重复:
import pandas as pd
import numpy as np
from time import time
def create_df ( col_prefix, nrows= 100_0000 ) :
data = np. random. randint( 0 , 255 , size= ( nrows, 4 ) )
df = pd. DataFrame( data, columns= map ( f" { col_prefix} _{{}}" . format , range ( 4 ) ) )
df[ "idx" ] = np. random. choice( np. arange( nrows * 4 // 5 ) , nrows)
return df
再次测试:
start = time( )
df1. merge( df2, on= "idx" )
print ( f"merge on耗时: { time( ) - start: .2f } s" )
df1. set_index( "idx" , inplace= True )
df2. set_index( "idx" , inplace= True )
start = time( )
print ( df1. index. is_unique, df2. index. is_unique)
print ( f"索引判断唯一性耗时: { time( ) - start: .2f } s" )
start = time( )
df1. merge( df2, left_index= True , right_index= True )
print ( f"merge index耗时: { time( ) - start: .2f } s" )
start = time( )
df1. join( df2, how= "inner" )
print ( f"join耗时: { time( ) - start: .2f } s" )
df1. reset_index( inplace= True )
df2. reset_index( inplace= True )
start = time( )
df1. merge( df2, on= "idx" )
print ( f"merge on耗时: { time( ) - start: .2f } s" )
merge on耗时:0.53s
False False
索引判断唯一性耗时:0.06s
merge index耗时:0.91s
join耗时:0.93s
merge on耗时:0.50s
可以看到,在索引存在重复时,直接merge on反而更快。
结论
join等价于merge传入left_index和right_index参数为True 索引唯一时,join比merge默认的on指定字段速度快 索引不唯一时,merge默认的on指定字段比join快