创建数据
Series和python的列表类似。DataFrame则类似值为Series的字典。
create.py
#!/usr/bin/env python3# -*- coding: utf-8 -*-# create.pyimport pandas as pdprint("\n\n创建序列Series")s = pd.Series(['banana', 42])print(s)print("\n\n指定索引index创建序列Series")s = pd.Series(['Wes McKinney', 'Creator of Pandas'], index=['Person', 'Who'])print(s)# 注意:列名未必为执行的顺序,通常为按字母排序print("\n\n创建数据帧DataFrame")scientists = pd.DataFrame({ ' Name': ['Rosaline Franklin', 'William Gosset'], ' Occupation': ['Chemist', 'Statistician'], ' Born': ['1920-07-25', '1876-06-13'], ' Died': ['1958-04-16', '1937-10-16'], ' Age': [37, 61]})print(scientists)print("\n\n指定顺序(index和columns)创建数据帧DataFrame")scientists = pd.DataFrame( data={'Occupation': ['Chemist', 'Statistician'], 'Born': ['1920-07-25', '1876-06-13'], 'Died': ['1958-04-16', '1937-10-16'], 'Age': [37, 61]}, index=['Rosaline Franklin', 'William Gosset'], columns=['Occupation', 'Born', 'Died', 'Age'])print(scientists)
执行结果:
$ ./create.py 创建序列Series0 banana1 42dtype: object指定索引index创建序列SeriesPerson Wes McKinneyWho Creator of Pandasdtype: object创建数据帧DataFrame Name Occupation Born Died Age0 Rosaline Franklin Chemist 1920-07-25 1958-04-16 371 William Gosset Statistician 1876-06-13 1937-10-16 61指定顺序(index和columns)创建数据帧DataFrame Occupation Born Died AgeRosaline Franklin Chemist 1920-07-25 1958-04-16 37William Gosset Statistician 1876-06-13 1937-10-16 61
Series
官方文档:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html
Series的属性
属性 | 描述 |
---|---|
loc | 使用索引值获取子集 |
iloc | 使用索引位置获取子集 |
dtype或dtypes | 类型 |
T | 转置 |
shape | 数据的尺寸 |
size | 元素的数量 |
values | ndarray或类似ndarray的Series |
Series的方法
方法 | 描述 |
---|---|
append | 连接2个或更多系列 |
corr | 计算与其他Series的关联 |
cov | 与其他Series计算协方差 |
describe | 计算汇总统计 |
drop duplicates | 返回一个没有重复项的Series |
equals | Series是否具有相同的元素 |
get values | 获取Series的值,与values属性相同 |
hist | 绘制直方图 |
min | 返回最小值 |
max | 返回最大值 |
mean | 返回算术平均值 |
median | 返回中位数 |
mode(s) | 返回mode(s) |
replace | 用指定值替换系列中的值 |
sample | 返回Series中值的随机样本 |
sort values | 排序 |
to frame | 转换为数据帧 |
transpose | 返回转置 |
unique | 返回numpy.ndarray唯一值 |
series.py
#!/usr/bin/python3# -*- coding: utf-8 -*-# CreateDate: 2018-3-14# series.pyimport pandas as pdimport numpy as npscientists = pd.DataFrame( data={'Occupation': ['Chemist', 'Statistician'], 'Born': ['1920-07-25', '1876-06-13'], 'Died': ['1958-04-16', '1937-10-16'], 'Age': [37, 61]}, index=['Rosaline Franklin', 'William Gosset'], columns=['Occupation', 'Born', 'Died', 'Age'])print(scientists)# 从数据帧(DataFrame)获取的行或者列为Seriesfirst_row = scientists.loc['William Gosset']print(type(first_row))print(first_row)# index和keys是一样的print(first_row.index)print(first_row.keys())print(first_row.values)print(first_row.index[0])print(first_row.keys()[0])# Pandas.Series和numpy.ndarray很类似ages = scientists['Age']print(ages)# 统计,更多参考http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statisticsprint(ages.mean())print(ages.min())print(ages.max())print(ages.std())scientists = pd.read_csv('../data/scientists.csv')ages = scientists['Age']print(ages)print(ages.mean())print(ages.describe())print(ages[ages > ages.mean()])print(ages > ages.mean())manual_bool_values = [True, True, False, False, True, True, False, False]print(ages[manual_bool_values])print(ages + ages)print(ages * ages)print(ages + 100)print(ages * 2)print(ages + pd.Series([1, 100]))# print(ages + np.array([1, 100])) 会报错,不同类型相加,大小一定要一样print(ages + np.array([1, 100, 1, 100, 1, 100, 1, 100]))# 排序: 默认有自动排序print(ages)rev_ages = ages.sort_index(ascending=False)print(rev_ages)print(ages * 2)print(ages + rev_ages)
执行结果
$ python3 series.py Occupation Born Died AgeRosaline Franklin Chemist 1920-07-25 1958-04-16 37William Gosset Statistician 1876-06-13 1937-10-16 61<class 'pandas.core.series.Series'>Occupation StatisticianBorn 1876-06-13Died 1937-10-16Age 61Name: William Gosset, dtype: objectIndex(['Occupation', 'Born', 'Died', 'Age'], dtype='object')Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')['Statistician' '1876-06-13' '1937-10-16' 61]OccupationOccupationRosaline Franklin 37William Gosset 61Name: Age, dtype: int6449.0376116.970562748477140 371 612 903 664 565 456 417 77Name: Age, dtype: int6459.125count 8.000000mean 59.125000std 18.325918min 37.00000025% 44.00000050% 58.50000075% 68.750000max 90.000000Name: Age, dtype: float641 612 903 667 77Name: Age, dtype: int640 False1 True2 True3 True4 False5 False6 False7 TrueName: Age, dtype: bool0 371 614 565 45Name: Age, dtype: int640 741 1222 1803 1324 1125 906 827 154Name: Age, dtype: int640 13691 37212 81003 43564 31365 20256 16817 5929Name: Age, dtype: int640 1371 1612 1903 1664 1565 1456 1417 177Name: Age, dtype: int640 741 1222 1803 1324 1125 906 827 154Name: Age, dtype: int640 38.01 161.02 NaN3 NaN4 NaN5 NaN6 NaN7 NaNdtype: float640 381 1612 913 1664 575 1456 427 177Name: Age, dtype: int640 371 612 903 664 565 456 417 77Name: Age, dtype: int647 776 415 454 563 662 901 610 37Name: Age, dtype: int640 741 1222 1803 1324 1125 906 827 154Name: Age, dtype: int640 741 1222 1803 1324 1125 906 827 154Name: Age, dtype: int64
数据帧(DataFrame)
DataFrame是最常见的Pandas对象,可认为是Python存储类似电子表格的数据的方式。Series多常见功能都包含在DataFrame中。
子集的方法
注意ix现在已经不推荐使用。
DataFrame常用的索引操作有:
方式 | 描述 |
---|---|
df[val] | 选择单个列 |
df [[ column1, column2, … ]] | 选择多个列 |
df.loc[val] | 选择行 |
df. loc [[ label1 , label2 ,…]] | 选择多行 |
df.loc[:, val] | 基于行index选择列 |
df.loc[val1, val2] | 选择行列 |
df.iloc[row number] | 基于行数选择行 |
df. iloc [[ row1, row2, …]] Multiple rows by row number | 基于行数选择多行 |
df.iloc[:, where] | 选择列 |
df.iloc[where_i, where_j] | 选择行列 |
df.at[label_i, label_j] | 选择值 |
df.iat[i, j] | 选择值 |
reindex method | 通过label选择多行或列 |
get_value, set_value | 通过label选择耽搁行或列 |
df[bool] | 选择行 |
df [[ bool1, bool2, …]] | 选择行 |
df[ start :stop: step ] | 基于行数选择行 |
#!/usr/bin/python3# -*- coding: utf-8 -*-# CreateDate: 2018-3-31# df.pyimport pandas as pdimport numpy as npscientists = pd.read_csv('../data/scientists.csv')print(scientists[scientists['Age'] > scientists['Age'].mean()])first_half = scientists[: 4]second_half = scientists[ 4 :]print(first_half)print(second_half)print(first_half + second_half)print(scientists * 2)
执行结果
#!/usr/bin/python3# -*- coding: utf-8 -*-# df.pyimport pandas as pdimport numpy as npscientists = pd.read_csv('../data/scientists.csv')print(scientists[scientists['Age'] > scientists['Age'].mean()])first_half = scientists[: 4]second_half = scientists[ 4 :]print(first_half)print(second_half)print(first_half + second_half)print(scientists * 2)
执行结果
$ python3 df.py Name Born Died Age Occupation1 William Gosset 1876-06-13 1937-10-16 61 Statistician2 Florence Nightingale 1820-05-12 1910-08-13 90 Nurse3 Marie Curie 1867-11-07 1934-07-04 66 Chemist7 Johann Gauss 1777-04-30 1855-02-23 77 Mathematician Name Born Died Age Occupation0 Rosaline Franklin 1920-07-25 1958-04-16 37 Chemist1 William Gosset 1876-06-13 1937-10-16 61 Statistician2 Florence Nightingale 1820-05-12 1910-08-13 90 Nurse3 Marie Curie 1867-11-07 1934-07-04 66 Chemist Name Born Died Age Occupation4 Rachel Carson 1907-05-27 1964-04-14 56 Biologist5 John Snow 1813-03-15 1858-06-16 45 Physician6 Alan Turing 1912-06-23 1954-06-07 41 Computer Scientist7 Johann Gauss 1777-04-30 1855-02-23 77 Mathematician Name Born Died Age Occupation0 NaN NaN NaN NaN NaN1 NaN NaN NaN NaN NaN2 NaN NaN NaN NaN NaN3 NaN NaN NaN NaN NaN4 NaN NaN NaN NaN NaN5 NaN NaN NaN NaN NaN6 NaN NaN NaN NaN NaN7 NaN NaN NaN NaN NaN Name Born \0 Rosaline FranklinRosaline Franklin 1920-07-251920-07-25 1 William GossetWilliam Gosset 1876-06-131876-06-13 2 Florence NightingaleFlorence Nightingale 1820-05-121820-05-12 3 Marie CurieMarie Curie 1867-11-071867-11-07 4 Rachel CarsonRachel Carson 1907-05-271907-05-27 5 John SnowJohn Snow 1813-03-151813-03-15 6 Alan TuringAlan Turing 1912-06-231912-06-23 7 Johann GaussJohann Gauss 1777-04-301777-04-30 Died Age Occupation 0 1958-04-161958-04-16 74 ChemistChemist 1 1937-10-161937-10-16 122 StatisticianStatistician 2 1910-08-131910-08-13 180 NurseNurse 3 1934-07-041934-07-04 132 ChemistChemist 4 1964-04-141964-04-14 112 BiologistBiologist 5 1858-06-161858-06-16 90 PhysicianPhysician 6 1954-06-071954-06-07 82 Computer ScientistComputer Scientist 7 1855-02-231855-02-23 154 MathematicianMathematician
修改列
#!/usr/bin/python3# -*- coding: utf-8 -*-# Author: xurongzhong#126.com wechat:pythontesting qq:37391319# qq群:144081101 591302926 567351477# CreateDate: 2018-06-07# change.pyimport pandas as pdimport numpy as npimport randomscientists = pd.read_csv('../data/scientists.csv')print(scientists['Born'].dtype)print(scientists['Died'].dtype)print(scientists.head())# 转为日期 参考:https://docs.python.org/3.5/library/datetime.htmlborn_datetime = pd.to_datetime(scientists['Born'], format='%Y-%m-%d')died_datetime = pd.to_datetime(scientists['Died'], format='%Y-%m-%d')# 增加列scientists['born_dt'], scientists['died_dt'] = (born_datetime, died_datetime)print(scientists.shape)print(scientists.head())random.seed(42)random.shuffle(scientists['Age']) # 此修改会作用于scientistsprint(scientists.head())scientists['age_days_dt'] = (scientists['died_dt'] - scientists['born_dt'])print(scientists.head())
执行结果:
$ python3 change.py objectobject Name Born Died Age Occupation0 Rosaline Franklin 1920-07-25 1958-04-16 37 Chemist1 William Gosset 1876-06-13 1937-10-16 61 Statistician2 Florence Nightingale 1820-05-12 1910-08-13 90 Nurse3 Marie Curie 1867-11-07 1934-07-04 66 Chemist4 Rachel Carson 1907-05-27 1964-04-14 56 Biologist(8, 7) Name Born Died Age Occupation born_dt \0 Rosaline Franklin 1920-07-25 1958-04-16 37 Chemist 1920-07-25 1 William Gosset 1876-06-13 1937-10-16 61 Statistician 1876-06-13 2 Florence Nightingale 1820-05-12 1910-08-13 90 Nurse 1820-05-12 3 Marie Curie 1867-11-07 1934-07-04 66 Chemist 1867-11-07 4 Rachel Carson 1907-05-27 1964-04-14 56 Biologist 1907-05-27 died_dt 0 1958-04-16 1 1937-10-16 2 1910-08-13 3 1934-07-04 4 1964-04-14 /usr/lib/python3.5/random.py:272: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy x[i], x[j] = x[j], x[i] Name Born Died Age Occupation born_dt \0 Rosaline Franklin 1920-07-25 1958-04-16 66 Chemist 1920-07-25 1 William Gosset 1876-06-13 1937-10-16 56 Statistician 1876-06-13 2 Florence Nightingale 1820-05-12 1910-08-13 41 Nurse 1820-05-12 3 Marie Curie 1867-11-07 1934-07-04 77 Chemist 1867-11-07 4 Rachel Carson 1907-05-27 1964-04-14 90 Biologist 1907-05-27 died_dt 0 1958-04-16 1 1937-10-16 2 1910-08-13 3 1934-07-04 4 1964-04-14 Name Born Died Age Occupation born_dt \0 Rosaline Franklin 1920-07-25 1958-04-16 66 Chemist 1920-07-25 1 William Gosset 1876-06-13 1937-10-16 56 Statistician 1876-06-13 2 Florence Nightingale 1820-05-12 1910-08-13 41 Nurse 1820-05-12 3 Marie Curie 1867-11-07 1934-07-04 77 Chemist 1867-11-07 4 Rachel Carson 1907-05-27 1964-04-14 90 Biologist 1907-05-27 died_dt age_days_dt 0 1958-04-16 13779 days 1 1937-10-16 22404 days 2 1910-08-13 32964 days 3 1934-07-04 24345 days 4 1964-04-14 20777 days
数据导入导出
out.py
#!/usr/bin/python3# -*- coding: utf-8 -*-# Author: china-testing#126.com wechat:pythontesting qq群:630011153# CreateDate: 2018-3-31# out.pyimport pandas as pdimport numpy as npimport randomscientists = pd.read_csv('../data/scientists.csv')names = scientists['Name']print(names)names.to_pickle('../output/scientists_names_series.pickle')scientists.to_pickle('../output/scientists_df.pickle')# .p, .pkl, .pickle 是常用的pickle文件扩展名scientist_names_from_pickle = pd.read_pickle('../output/scientists_df.pickle')print(scientist_names_from_pickle)names.to_csv('../output/scientist_names_series.csv')scientists.to_csv('../output/scientists_df.tsv', sep='\t')# 不输出行号scientists.to_csv('../output/scientists_df_no_index.csv', index=None)# Series可以转为df再输出成excel文件names_df = names.to_frame()names_df.to_excel('../output/scientists_names_series_df.xls')names_df.to_excel('../output/scientists_names_series_df.xlsx')scientists.to_excel('../output/scientists_df.xlsx', sheet_name='scientists', index=False)
执行结果:
$ python3 out.py 0 Rosaline Franklin1 William Gosset2 Florence Nightingale3 Marie Curie4 Rachel Carson5 John Snow6 Alan Turing7 Johann GaussName: Name, dtype: object Name Born Died Age Occupation0 Rosaline Franklin 1920-07-25 1958-04-16 37 Chemist1 William Gosset 1876-06-13 1937-10-16 61 Statistician2 Florence Nightingale 1820-05-12 1910-08-13 90 Nurse3 Marie Curie 1867-11-07 1934-07-04 66 Chemist4 Rachel Carson 1907-05-27 1964-04-14 56 Biologist5 John Snow 1813-03-15 1858-06-16 45 Physician6 Alan Turing 1912-06-23 1954-06-07 41 Computer Scientist7 Johann Gauss 1777-04-30 1855-02-23 77 Mathematician
注意:序列一般是直接输出成excel文件
更多的输入输出方法:
方式 | 描述 |
---|---|
to_clipboard | 将数据保存到系统剪贴板进行粘贴 |
to_dense | 将数据转换为常规“密集”DataFrame |
to_dict | 将数据转换为Python字典 |
to_gbq | 将数据转换为Google BigQuery表格 |
toJidf | 将数据保存为分层数据格式(HDF) |
to_msgpack | 将数据保存到可移植的类似JSON的二进制文件中 |
toJitml | 将数据转换为HTML表格 |
tojson | 将数据转换为JSON字符串 |
toJatex | 将数据转换为LTEXtabular环境 |
to_records | 将数据转换为记录数组 |
to_string | 将DataFrame显示为stdout的字符串 |
to_sparse | 将数据转换为SparceDataFrame |
to_sql | 将数据保存到SQL数据库中 |
to_stata | 将数据转换为Stata dta文件 |
- 读CSV文件
read_csv.py
#!/usr/bin/python3# -*- coding: utf-8 -*-# Author: china-testing#126.com wechat:pythontesting QQ群:630011153# CreateDate: 2018-3-9# read_csv.pyimport pandas as pddf = pd.read_csv("1.csv", header=None) # 不读取列名print("df:")print(df)print("df.head():")print(df.head()) # head(self, n=5),默认为5行,类似的有tailprint("df.tail():")print(df.tail())df = pd.read_csv("1.csv") # 默认读取列名print("df:")print(df)df = pd.read_csv("1.csv", names=['号码','群号']) # 自定义列名print("df:")print(df)# 自定义列名,去掉第一行df = pd.read_csv("1.csv", skiprows=[0], names=['号码','群号'])print("df:")print(df)
执行结果:
df: 0 10 qq qqgroup1 37391319 1440811012 37391320 1440811023 37391321 1440811034 37391322 1440811045 37391323 1440811056 37391324 1440811067 37391325 1440811078 37391326 1440811089 37391327 14408110910 37391328 14408111011 37391329 14408111112 37391330 14408111213 37391331 14408111314 37391332 14408111415 37391333 144081115df.head(): 0 10 qq qqgroup1 37391319 1440811012 37391320 1440811023 37391321 1440811034 37391322 144081104df.tail(): 0 111 37391329 14408111112 37391330 14408111213 37391331 14408111314 37391332 14408111415 37391333 144081115df: qq qqgroup0 37391319 1440811011 37391320 1440811022 37391321 1440811033 37391322 1440811044 37391323 1440811055 37391324 1440811066 37391325 1440811077 37391326 1440811088 37391327 1440811099 37391328 14408111010 37391329 14408111111 37391330 14408111212 37391331 14408111313 37391332 14408111414 37391333 144081115df: 号码 群号0 qq qqgroup1 37391319 1440811012 37391320 1440811023 37391321 1440811034 37391322 1440811045 37391323 1440811056 37391324 1440811067 37391325 1440811078 37391326 1440811089 37391327 14408110910 37391328 14408111011 37391329 14408111112 37391330 14408111213 37391331 14408111314 37391332 14408111415 37391333 144081115df: 号码 群号0 37391319 1440811011 37391320 1440811022 37391321 1440811033 37391322 1440811044 37391323 1440811055 37391324 1440811066 37391325 1440811077 37391326 1440811088 37391327 1440811099 37391328 14408111010 37391329 14408111111 37391330 14408111212 37391331 14408111313 37391332 14408111414 37391333 144081115
- 写CSV文件
#!/usr/bin/python3# -*- coding: utf-8 -*-# write_csv.pyimport pandas as pddata ={'qq': [37391319,37391320], 'group':[1,2]}df = pd.DataFrame(data=data, columns=['qq','group'])df.to_csv('2.csv',index=False)
读写excel和csv类似,不过要改用read_excel来读,excel_summary_demo, 提供了多个excel求和的功能,可以做为excel读写的实例,这里不再赘述。
使用pandas处理excel有更多的pandas处理excel的资料,深入学习可以参考。
- 讨论qq群144081101 591302926 567351477 钉钉免费群21745728
- 本文涉及的python测试开发库 请在github上点赞,谢谢!
- 本文相关书籍下载
- 源码下载
- 本文英文版书籍下载