简介
Pandas是用于数据分析的开源Python库,也是目前数据分析最重要的开源库。它能够处理类似电子表格的数据,用于快速数据加载,操作,对齐,合并等。为Python提供这些增强功能,Pandas的数据类型为:Series和DataFrame。DataFrame为整个电子表格或矩形数据,而Series是DataFrame的列。DataFrame也可以被认为是字典或Series的集合。
加载数据
load.py
#!/usr/bin/env python3# -*- coding: utf-8 -*-# load.pyimport pandas as pddf = pd.read_csv(r"../data/gapminder.tsv", sep='\t') print("\n\n查看前五行")print(df.head())print("\n\n查看类型")print(type(df))print("\n\n查看大小")print(df.shape)print("\n\n查看列名")print(df.columns)print("\n\n查看dtypes(基于列)")print(df.dtypes)print("\n\n查看统计信息")print(df.info())
执行结果
$ ./load.py 查看前五行 country continent year lifeExp pop gdpPercap0 Afghanistan Asia 1952 28.801 8425333 779.4453141 Afghanistan Asia 1957 30.332 9240934 820.8530302 Afghanistan Asia 1962 31.997 10267083 853.1007103 Afghanistan Asia 1967 34.020 11537966 836.1971384 Afghanistan Asia 1972 36.088 13079460 739.981106查看类型<class 'pandas.core.frame.DataFrame'>查看大小(1704, 6)查看列名Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')查看dtypes(基于列)country objectcontinent objectyear int64lifeExp float64pop int64gdpPercap float64dtype: object查看统计信息<class 'pandas.core.frame.DataFrame'>RangeIndex: 1704 entries, 0 to 1703Data columns (total 6 columns):country 1704 non-null objectcontinent 1704 non-null objectyear 1704 non-null int64lifeExp 1704 non-null float64pop 1704 non-null int64gdpPercap 1704 non-null float64dtypes: float64(2), int64(2), object(2)memory usage: 80.0+ KBNone
Pandas类型 | Python类型 |
---|---|
object | string |
int64 | int |
float64 | float |
datetime64 | datetime |
行列与单元格
col.py
#!/usr/bin/env python3# -*- coding: utf-8 -*-# col.pyimport pandas as pddf = pd.read_csv(r"../data/gapminder.tsv", sep='\t') # 列操作country_df = df['country'] # 列名选取单列print("\n\n列首5行")print(country_df.head())print("\n\n列尾5行")print(country_df.tail())country_df_dot = df.country # 点号的方式选取列print("\n\n点号的方式选取列")print(country_df_dot.head())subset = df[['country', 'continent', 'year']] # 选取多列print("\n\n选取多列")print(subset.head())
执行结果
$ ./col.py 列首5行0 Afghanistan1 Afghanistan2 Afghanistan3 Afghanistan4 AfghanistanName: country, dtype: object列尾5行1699 Zimbabwe1700 Zimbabwe1701 Zimbabwe1702 Zimbabwe1703 ZimbabweName: country, dtype: object点号的方式选取列0 Afghanistan1 Afghanistan2 Afghanistan3 Afghanistan4 AfghanistanName: country, dtype: object选取多列 country continent year0 Afghanistan Asia 19521 Afghanistan Asia 19572 Afghanistan Asia 19623 Afghanistan Asia 19674 Afghanistan Asia 1972
row.py
#!/usr/bin/env python3# -*- coding: utf-8 -*-# row.pyimport pandas as pddf = pd.read_csv(r"../data/gapminder.tsv", sep='\t') # 行操作,注意df.loc[-1]是非法的print("\n\n第一行")print(df.loc[0])print("\n\n行数")number_of_rows = df.shape[0]print(number_of_rows)last_row_index = number_of_rows - 1print("\n\n最后一行")print(df.loc[last_row_index])print("\n\ntail的方法输出最后一行")print(df.tail(n=1))subset_loc = df.loc[0]subset_head = df.head(n=1)print("\n\nloc的类型为序列Series")print(type(subset_loc))print("\n\nhead的类型为数据帧DataFrame")print(type(subset_head))print("\n\nloc选取三列,类型为数据帧DataFrame")print(df.loc[[0, 99, 999]])print(type(df.loc[[0, 99, 999]]))print("\n\niloc选取第一行")print(df.iloc[0])print("\n\niloc选取三行")print(df.iloc[[0, 99, 999]])
执行结果
$ ./row.py 第一行country Afghanistancontinent Asiayear 1952lifeExp 28.801pop 8425333gdpPercap 779.445Name: 0, dtype: object行数1704最后一行country Zimbabwecontinent Africayear 2007lifeExp 43.487pop 12311143gdpPercap 469.709Name: 1703, dtype: objecttail的方法输出最后一行 country continent year lifeExp pop gdpPercap1703 Zimbabwe Africa 2007 43.487 12311143 469.709298loc的类型为序列Series<class 'pandas.core.series.Series'>head的类型为数据帧DataFrame<class 'pandas.core.frame.DataFrame'>loc选取三列,类型为数据帧DataFrame country continent year lifeExp pop gdpPercap0 Afghanistan Asia 1952 28.801 8425333 779.44531499 Bangladesh Asia 1967 43.453 62821884 721.186086999 Mongolia Asia 1967 51.253 1149500 1226.041130<class 'pandas.core.frame.DataFrame'>iloc选取第一行country Afghanistancontinent Asiayear 1952lifeExp 28.801pop 8425333gdpPercap 779.445Name: 0, dtype: objectiloc选取三行 country continent year lifeExp pop gdpPercap0 Afghanistan Asia 1952 28.801 8425333 779.44531499 Bangladesh Asia 1967 43.453 62821884 721.186086999 Mongolia Asia 1967 51.253 1149500 1226.041130
mix.py
#!/usr/bin/env python3# -*- coding: utf-8 -*-# Author: xurongzhong#126.com wechat:pythontesting qq:37391319# qq群:144081101 591302926 567351477# CreateDate: 2018-06-07# mix.pyimport pandas as pddf = pd.read_csv(r"../data/gapminder.tsv", sep='\t') # 混合选取print("\n\nloc选取坐标")print(df.loc[42, 'country'])print("\n\niloc选取坐标")print(df.iloc[42, 0])print("\n\nloc选取子集")print(df.loc[[0, 99, 999], ['country', 'lifeExp', 'gdpPercap']])
执行结果
#!python$ ./mix.py loc选取坐标Angolailoc选取坐标Angolaloc选取子集 country lifeExp gdpPercap0 Afghanistan 28.801 779.44531499 Bangladesh 43.453 721.186086999 Mongolia 51.253 1226.041130
分组和聚合
group.py
#!/usr/bin/env python3# -*- coding: utf-8 -*-# group.pyimport pandas as pddf = pd.read_csv(r"../data/gapminder.tsv", sep='\t') print("\n\n年人均产值")print(df.groupby('year')['lifeExp'].mean())print("\n\n基于年分组")grouped_year_df = df.groupby('year')print(type(grouped_year_df))print(grouped_year_df)print("\n\nlifeExp")grouped_year_df_lifeExp = grouped_year_df['lifeExp']print(type(grouped_year_df_lifeExp))print(grouped_year_df_lifeExp)print("\n\n年平均产值")mean_lifeExp_by_year = grouped_year_df_lifeExp.mean()print(mean_lifeExp_by_year)print("\n\n基于年和洲分组")print(df.groupby(['year', 'continent'])[['lifeExp','gdpPercap']].mean())print("\n\n统计每个洲的国家数")print(df.groupby('continent')['country'].nunique())
执行结果
#!python$ ./group.py 年人均产值year1952 49.0576201957 51.5074011962 53.6092491967 55.6782901972 57.6473861977 59.5701571982 61.5331971987 63.2126131992 64.1603381997 65.0146762002 65.6949232007 67.007423Name: lifeExp, dtype: float64基于年分组<class 'pandas.core.groupby.groupby.DataFrameGroupBy'><pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f0e2b0c89e8>lifeExp<class 'pandas.core.groupby.groupby.SeriesGroupBy'><pandas.core.groupby.groupby.SeriesGroupBy object at 0x7f0e151e2f28>年平均产值year1952 49.0576201957 51.5074011962 53.6092491967 55.6782901972 57.6473861977 59.5701571982 61.5331971987 63.2126131992 64.1603381997 65.0146762002 65.6949232007 67.007423Name: lifeExp, dtype: float64基于年和洲分组 lifeExp gdpPercapyear continent 1952 Africa 39.135500 1252.572466 Americas 53.279840 4079.062552 Asia 46.314394 5195.484004 Europe 64.408500 5661.057435 Oceania 69.255000 10298.0856501957 Africa 41.266346 1385.236062 Americas 55.960280 4616.043733 Asia 49.318544 5787.732940 Europe 66.703067 6963.012816 Oceania 70.295000 11598.5224551962 Africa 43.319442 1598.078825 Americas 58.398760 4901.541870 Asia 51.563223 5729.369625 Europe 68.539233 8365.486814 Oceania 71.085000 12696.4524301967 Africa 45.334538 2050.363801 Americas 60.410920 5668.253496 Asia 54.663640 5971.173374 Europe 69.737600 10143.823757 Oceania 71.310000 14495.0217901972 Africa 47.450942 2339.615674 Americas 62.394920 6491.334139 Asia 57.319269 8187.468699 Europe 70.775033 12479.575246 Oceania 71.910000 16417.3333801977 Africa 49.580423 2585.938508 Americas 64.391560 7352.007126 Asia 59.610556 7791.314020 Europe 71.937767 14283.979110 Oceania 72.855000 17283.9576051982 Africa 51.592865 2481.592960 Americas 66.228840 7506.737088 Asia 62.617939 7434.135157 Europe 72.806400 15617.896551 Oceania 74.290000 18554.7098401987 Africa 53.344788 2282.668991 Americas 68.090720 7793.400261 Asia 64.851182 7608.226508 Europe 73.642167 17214.310727 Oceania 75.320000 20448.0401601992 Africa 53.629577 2281.810333 Americas 69.568360 8044.934406 Asia 66.537212 8639.690248 Europe 74.440100 17061.568084 Oceania 76.945000 20894.0458851997 Africa 53.598269 2378.759555 Americas 71.150480 8889.300863 Asia 68.020515 9834.093295 Europe 75.505167 19076.781802 Oceania 78.190000 24024.1751702002 Africa 53.325231 2599.385159 Americas 72.422040 9287.677107 Asia 69.233879 10174.090397 Europe 76.700600 21711.732422 Oceania 79.740000 26938.7780402007 Africa 54.806038 3089.032605 Americas 73.608120 11003.031625 Asia 70.728485 12473.026870 Europe 77.648600 25054.481636 Oceania 80.719500 29810.188275统计每个洲的国家数continentAfrica 52Americas 25Asia 33Europe 30Oceania 2Name: country, dtype: int64
基本绘图
import pandas as pddf = pd.read_csv(r"../data/gapminder.tsv", sep='\t') global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean()print(global_yearly_life_expectancy)global_yearly_life_expectancy.plot()
- 本文涉及的python测试开发库 请在github上点赞,谢谢!
- 本文相关书籍下载
- 源码下载
- 本文英文版书籍下载
讨论qq群144081101 591302926 567351477 钉钉免费群21745728