pandas常用命令记录

创建

两种数据类型 Series, DataFrame.系列是一维数组,DataFrame 是二维数组。
可以从 excel、dict、sql 查询创建
df = pd.DataFrame(dict)

1
2
3
# 从excel创建DataFrame
xlsx = pd.ExcelFile('/tmp/vm_in.xlsx')
df =pd.read_excel(xlsx, 'vm_in')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [5]: dates = pd.date_range('20130101', periods=6)

In [6]: dates
Out[6]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')

In [7]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [8]: df
Out[8]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-06 -0.673690 0.113648 -1.478427 0.524988

选择

选择一列 df[‘column1’]
选择多列 df.loc[:,[‘column1’, ‘column2’]]

变更

改变一个 Series 字符串数字改为数字, pd.to_numeric(df[‘column1’])
以一行的某一列做为函数参数,将结果放在新列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(6,4), index = pd.date_range('20201101', periods=6), columns=list('ABCD'))

>>> df
A B C D
2020-11-01 0.172708 0.354707 0.092118 0.626022
2020-11-02 0.556920 0.769790 0.600710 0.950671
2020-11-03 0.029216 0.010799 0.802327 0.374498
2020-11-04 0.878683 0.443748 0.598541 0.016172
2020-11-05 0.099418 0.869219 0.478047 0.738399
2020-11-06 0.038316 0.332397 0.851113 0.926423

def add1(num):
retrun num+1
df['D+1'] =df.apply(lambda x:add1(x['D']),axis=1)
# 函数返回多个值
def add1(num1,num2):
total = num1 + num2
avg = total/2
retrun total, avg

参考

https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#min