我有一个.csv文件,其中包含多列长度的行.
import pandas as pd
df = pd.read_csv(infile, header=None)
返回
ParserError: Error tokenizing data. C error: Expected 6 fields in line 8, saw 8
错误.我知道我可以使用
names=my_cols
read_csv调用中的选项,但肯定有更多’pythonic’比那些?此外,这不是一个重复的问题,因为
error_bad_lines=False
导致跳过行(这是不希望的). .csv看起来像::
Anne,Beth,Caroline,Ernie,Frank,Hannah
Beth,Caroline,David,Ernie
Caroline,Hannah
David,,Anne,Beth,Caroline,Ernie
Ernie,Anne,Beth,Frank,George
Frank,Anne,Caroline,Hannah
George,
Hannah,Anne,Beth,Caroline,David,Ernie,Frank,George
解决方法: 好的,有点受到这个相关问题的启发:Pandas variable numbers of columns to binary matrix
因此,请阅读csv,但将分隔符覆盖到选项卡,以便它不会尝试拆分名称:
In[7]:
import pandas as pd
import io
t="""Anne,Beth,Caroline,Ernie,Frank,Hannah
Beth,Caroline,David,Ernie
Caroline,Hannah
David,,Anne,Beth,Caroline,Ernie
Ernie,Anne,Beth,Frank,George
Frank,Anne,Caroline,Hannah
George,
Hannah,Anne,Beth,Caroline,David,Ernie,Frank,George"""
df = pd.read_csv(io.StringIO(t), sep='\t', header=None)
df
Out[7]:
0
0 Anne,Beth,Caroline,Ernie,Frank,Hannah
1 Beth,Caroline,David,Ernie
2 Caroline,Hannah
3 David,,Anne,Beth,Caroline,Ernie
4 Ernie,Anne,Beth,Frank,George
5 Frank,Anne,Caroline,Hannah
6 George,
7 Hannah,Anne,Beth,Caroline,David,Ernie,Frank,Ge...
我们现在可以使用带有expand = True的str.split将名称扩展到它们自己的列中:
In[8]:
df[0].str.split(',', expand=True)
Out[8]:
0 1 2 3 4 5 6 7
0 Anne Beth Caroline Ernie Frank Hannah None None
1 Beth Caroline David Ernie None None None None
2 Caroline Hannah None None None None None None
3 David Anne Beth Caroline Ernie None None
4 Ernie Anne Beth Frank George None None None
5 Frank Anne Caroline Hannah None None None None
6 George None None None None None None
7 Hannah Anne Beth Caroline David Ernie Frank George
所以只需要明确修改read_csv行:
df = pd.read_csv(infile, header=None, sep='\t')
然后像上面那样执行str.split 来源:https://www./content-1-495251.html
|