数据分析工具pandas快速入门教程1-开胃菜

简介

Pandas是用于数据分析的开源Python库,也是目前数据分析最重要的开源库。它能够处理类似电子表格的数据,用于快速数据加载,操作,对齐,合并等。为Python提供这些增强功能,Pandas的数据类型为:Series和DataFrame。DataFrame为整个电子表格或矩形数据,而Series是DataFrame的列。DataFrame也可以被认为是字典或Series的集合。

加载数据

load.py

#!/usr/bin/env python3# -*- coding: utf-8 -*-# load.pyimport pandas as pddf = pd.read_csv(r"../data/gapminder.tsv", sep='\t') print("\n\n查看前五行")print(df.head())print("\n\n查看类型")print(type(df))print("\n\n查看大小")print(df.shape)print("\n\n查看列名")print(df.columns)print("\n\n查看dtypes(基于列)")print(df.dtypes)print("\n\n查看统计信息")print(df.info())

执行结果

$ ./load.py 查看前五行       country continent  year  lifeExp       pop   gdpPercap0  Afghanistan      Asia  1952   28.801   8425333  779.4453141  Afghanistan      Asia  1957   30.332   9240934  820.8530302  Afghanistan      Asia  1962   31.997  10267083  853.1007103  Afghanistan      Asia  1967   34.020  11537966  836.1971384  Afghanistan      Asia  1972   36.088  13079460  739.981106查看类型<class 'pandas.core.frame.DataFrame'>查看大小(1704, 6)查看列名Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')查看dtypes(基于列)country       objectcontinent     objectyear           int64lifeExp      float64pop            int64gdpPercap    float64dtype: object查看统计信息<class 'pandas.core.frame.DataFrame'>RangeIndex: 1704 entries, 0 to 1703Data columns (total 6 columns):country      1704 non-null objectcontinent    1704 non-null objectyear         1704 non-null int64lifeExp      1704 non-null float64pop          1704 non-null int64gdpPercap    1704 non-null float64dtypes: float64(2), int64(2), object(2)memory usage: 80.0+ KBNone
Pandas类型Python类型
objectstring
int64int
float64float
datetime64datetime

行列与单元格

col.py

#!/usr/bin/env python3# -*- coding: utf-8 -*-# col.pyimport pandas as pddf = pd.read_csv(r"../data/gapminder.tsv", sep='\t') # 列操作country_df = df['country'] # 列名选取单列print("\n\n列首5行")print(country_df.head())print("\n\n列尾5行")print(country_df.tail())country_df_dot = df.country # 点号的方式选取列print("\n\n点号的方式选取列")print(country_df_dot.head())subset = df[['country', 'continent', 'year']] # 选取多列print("\n\n选取多列")print(subset.head())

执行结果

$ ./col.py 列首5行0    Afghanistan1    Afghanistan2    Afghanistan3    Afghanistan4    AfghanistanName: country, dtype: object列尾5行1699    Zimbabwe1700    Zimbabwe1701    Zimbabwe1702    Zimbabwe1703    ZimbabweName: country, dtype: object点号的方式选取列0    Afghanistan1    Afghanistan2    Afghanistan3    Afghanistan4    AfghanistanName: country, dtype: object选取多列       country continent  year0  Afghanistan      Asia  19521  Afghanistan      Asia  19572  Afghanistan      Asia  19623  Afghanistan      Asia  19674  Afghanistan      Asia  1972

row.py

#!/usr/bin/env python3# -*- coding: utf-8 -*-# row.pyimport pandas as pddf = pd.read_csv(r"../data/gapminder.tsv", sep='\t') # 行操作,注意df.loc[-1]是非法的print("\n\n第一行")print(df.loc[0])print("\n\n行数")number_of_rows = df.shape[0]print(number_of_rows)last_row_index = number_of_rows - 1print("\n\n最后一行")print(df.loc[last_row_index])print("\n\ntail的方法输出最后一行")print(df.tail(n=1))subset_loc = df.loc[0]subset_head = df.head(n=1)print("\n\nloc的类型为序列Series")print(type(subset_loc))print("\n\nhead的类型为数据帧DataFrame")print(type(subset_head))print("\n\nloc选取三列,类型为数据帧DataFrame")print(df.loc[[0, 99, 999]])print(type(df.loc[[0, 99, 999]]))print("\n\niloc选取第一行")print(df.iloc[0])print("\n\niloc选取三行")print(df.iloc[[0, 99, 999]])

执行结果

$ ./row.py 第一行country      Afghanistancontinent           Asiayear                1952lifeExp           28.801pop              8425333gdpPercap        779.445Name: 0, dtype: object行数1704最后一行country      Zimbabwecontinent      Africayear             2007lifeExp        43.487pop          12311143gdpPercap     469.709Name: 1703, dtype: objecttail的方法输出最后一行       country continent  year  lifeExp       pop   gdpPercap1703  Zimbabwe    Africa  2007   43.487  12311143  469.709298loc的类型为序列Series<class 'pandas.core.series.Series'>head的类型为数据帧DataFrame<class 'pandas.core.frame.DataFrame'>loc选取三列,类型为数据帧DataFrame         country continent  year  lifeExp       pop    gdpPercap0    Afghanistan      Asia  1952   28.801   8425333   779.44531499    Bangladesh      Asia  1967   43.453  62821884   721.186086999     Mongolia      Asia  1967   51.253   1149500  1226.041130<class 'pandas.core.frame.DataFrame'>iloc选取第一行country      Afghanistancontinent           Asiayear                1952lifeExp           28.801pop              8425333gdpPercap        779.445Name: 0, dtype: objectiloc选取三行         country continent  year  lifeExp       pop    gdpPercap0    Afghanistan      Asia  1952   28.801   8425333   779.44531499    Bangladesh      Asia  1967   43.453  62821884   721.186086999     Mongolia      Asia  1967   51.253   1149500  1226.041130

mix.py

#!/usr/bin/env python3# -*- coding: utf-8 -*-# Author:    xurongzhong#126.com wechat:pythontesting qq:37391319# qq群:144081101 591302926  567351477# CreateDate: 2018-06-07# mix.pyimport pandas as pddf = pd.read_csv(r"../data/gapminder.tsv", sep='\t') # 混合选取print("\n\nloc选取坐标")print(df.loc[42, 'country'])print("\n\niloc选取坐标")print(df.iloc[42, 0])print("\n\nloc选取子集")print(df.loc[[0, 99, 999], ['country', 'lifeExp', 'gdpPercap']])

执行结果

#!python$ ./mix.py loc选取坐标Angolailoc选取坐标Angolaloc选取子集         country  lifeExp    gdpPercap0    Afghanistan   28.801   779.44531499    Bangladesh   43.453   721.186086999     Mongolia   51.253  1226.041130

分组和聚合

group.py

#!/usr/bin/env python3# -*- coding: utf-8 -*-# group.pyimport pandas as pddf = pd.read_csv(r"../data/gapminder.tsv", sep='\t') print("\n\n年人均产值")print(df.groupby('year')['lifeExp'].mean())print("\n\n基于年分组")grouped_year_df = df.groupby('year')print(type(grouped_year_df))print(grouped_year_df)print("\n\nlifeExp")grouped_year_df_lifeExp = grouped_year_df['lifeExp']print(type(grouped_year_df_lifeExp))print(grouped_year_df_lifeExp)print("\n\n年平均产值")mean_lifeExp_by_year = grouped_year_df_lifeExp.mean()print(mean_lifeExp_by_year)print("\n\n基于年和洲分组")print(df.groupby(['year', 'continent'])[['lifeExp','gdpPercap']].mean())print("\n\n统计每个洲的国家数")print(df.groupby('continent')['country'].nunique())

执行结果

#!python$ ./group.py 年人均产值year1952    49.0576201957    51.5074011962    53.6092491967    55.6782901972    57.6473861977    59.5701571982    61.5331971987    63.2126131992    64.1603381997    65.0146762002    65.6949232007    67.007423Name: lifeExp, dtype: float64基于年分组<class 'pandas.core.groupby.groupby.DataFrameGroupBy'><pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f0e2b0c89e8>lifeExp<class 'pandas.core.groupby.groupby.SeriesGroupBy'><pandas.core.groupby.groupby.SeriesGroupBy object at 0x7f0e151e2f28>年平均产值year1952    49.0576201957    51.5074011962    53.6092491967    55.6782901972    57.6473861977    59.5701571982    61.5331971987    63.2126131992    64.1603381997    65.0146762002    65.6949232007    67.007423Name: lifeExp, dtype: float64基于年和洲分组                  lifeExp     gdpPercapyear continent                         1952 Africa     39.135500   1252.572466     Americas   53.279840   4079.062552     Asia       46.314394   5195.484004     Europe     64.408500   5661.057435     Oceania    69.255000  10298.0856501957 Africa     41.266346   1385.236062     Americas   55.960280   4616.043733     Asia       49.318544   5787.732940     Europe     66.703067   6963.012816     Oceania    70.295000  11598.5224551962 Africa     43.319442   1598.078825     Americas   58.398760   4901.541870     Asia       51.563223   5729.369625     Europe     68.539233   8365.486814     Oceania    71.085000  12696.4524301967 Africa     45.334538   2050.363801     Americas   60.410920   5668.253496     Asia       54.663640   5971.173374     Europe     69.737600  10143.823757     Oceania    71.310000  14495.0217901972 Africa     47.450942   2339.615674     Americas   62.394920   6491.334139     Asia       57.319269   8187.468699     Europe     70.775033  12479.575246     Oceania    71.910000  16417.3333801977 Africa     49.580423   2585.938508     Americas   64.391560   7352.007126     Asia       59.610556   7791.314020     Europe     71.937767  14283.979110     Oceania    72.855000  17283.9576051982 Africa     51.592865   2481.592960     Americas   66.228840   7506.737088     Asia       62.617939   7434.135157     Europe     72.806400  15617.896551     Oceania    74.290000  18554.7098401987 Africa     53.344788   2282.668991     Americas   68.090720   7793.400261     Asia       64.851182   7608.226508     Europe     73.642167  17214.310727     Oceania    75.320000  20448.0401601992 Africa     53.629577   2281.810333     Americas   69.568360   8044.934406     Asia       66.537212   8639.690248     Europe     74.440100  17061.568084     Oceania    76.945000  20894.0458851997 Africa     53.598269   2378.759555     Americas   71.150480   8889.300863     Asia       68.020515   9834.093295     Europe     75.505167  19076.781802     Oceania    78.190000  24024.1751702002 Africa     53.325231   2599.385159     Americas   72.422040   9287.677107     Asia       69.233879  10174.090397     Europe     76.700600  21711.732422     Oceania    79.740000  26938.7780402007 Africa     54.806038   3089.032605     Americas   73.608120  11003.031625     Asia       70.728485  12473.026870     Europe     77.648600  25054.481636     Oceania    80.719500  29810.188275统计每个洲的国家数continentAfrica      52Americas    25Asia        33Europe      30Oceania      2Name: country, dtype: int64

基本绘图

import pandas as pddf = pd.read_csv(r"../data/gapminder.tsv", sep='\t') global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean()print(global_yearly_life_expectancy)global_yearly_life_expectancy.plot()
  • 讨论qq群144081101 591302926 567351477 钉钉免费群21745728

    数据分析工具pandas快速入门教程1-开胃菜
    image.png
文章链接:https://www.sbkko.com/ganhuo-331.html
文章标题:数据分析工具pandas快速入门教程1-开胃菜
文章版权:SBKKO 所发布的内容,部分为原创文章,转载请注明来源,网络转载文章如有侵权请联系我们!

给TA打赏
共{{data.count}}人
人已打赏
干货分享

3个优质壁纸小程序,帮你轻松更换好看的手机壁纸

2018-8-16 20:35:00

干货分享

数据分析工具pandas快速入门教程2-pandas数据结构

2018-8-16 22:34:00

0 条回复 A文章作者 M管理员
    暂无讨论,说说你的看法吧
个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索