第5章 缺失数据
介绍
很少没有任何缺失值的数据集。 有许多缺失数据的表示。 在数据库中是NULL值,一些编程语言使用NA。缺失值可以是空字符串:”或者甚至是数值88或99等。Pandas显示缺失值为NaN。
本章将涵盖:
- 什么是缺失值
- 如何创建缺失值
- 如何重新编码并使用缺失值进行计算
什么是缺失值
可以从numpy中获得NaN值,在Python中看到缺失值使用几种方式显示:NaN,NAN或nan,他们都是相等的。
NaN不等于0或空字符串”。
In [1]: from numpy import NaN, NAN, nanIn [2]: print(NaN == True, NaN == False, NaN == 0, NaN == '', sep='|')False|False|False|FalseIn [3]: print(NaN == NaN, NaN == nan, NaN == NAN, nan == NAN, sep='|')False|False|False|FalseIn [4]: import pandas as pdIn [5]: print(pd.isnull(NaN), pd.isnull(nan), pd.isnull(NAN), sep='|')True|True|TrueIn [6]: print(pd.notnull(NaN), pd.notnull(99), pd.notnull("https://china-testing.github.io"), sep='|')False|True|True
缺失值的来源
来自加载数据或数据处理
- 加载数据
当我们加载数据时,pandas会自动找到该缺少数据的单元格,并填充NaN值。在read_csv函数中,参数na_values, keep_default_na, na_filter用于处理缺失值。比如:na_values=[99]。na_filter设置为False,在读大文件时会提升性能。
5-1.py
import pandas as pdvisited_file = 'data/survey_visited.csv'print(pd.read_csv(visited_file))print(pd.read_csv(visited_file, keep_default_na=False))print(pd.read_csv(visited_file, na_values=[''], keep_default_na=False))
执行结果
$ python3 5-1.py ident site dated0 619 DR-1 1927-02-081 622 DR-1 1927-02-102 734 DR-3 1939-01-073 735 DR-3 1930-01-124 751 DR-3 1930-02-265 752 DR-3 NaN6 837 MSK-4 1932-01-147 844 DR-1 1932-03-22 ident site dated0 619 DR-1 1927-02-081 622 DR-1 1927-02-102 734 DR-3 1939-01-073 735 DR-3 1930-01-124 751 DR-3 1930-02-265 752 DR-3 6 837 MSK-4 1932-01-147 844 DR-1 1932-03-22 ident site dated0 619 DR-1 1927-02-081 622 DR-1 1927-02-102 734 DR-3 1939-01-073 735 DR-3 1930-01-124 751 DR-3 1930-02-265 752 DR-3 NaN6 837 MSK-4 1932-01-147 844 DR-1 1932-03-22
- 合并数据
import pandas as pdvisited = pd.read_csv('data/survey_visited.csv')survey = pd.read_csv('data/survey_survey.csv')print(visited)print(survey)vs = visited.merge(survey, left_on='ident', right_on='taken')print(vs)
执行结果
$ python3 5-2.py ident site dated0 619 DR-1 1927-02-081 622 DR-1 1927-02-102 734 DR-3 1939-01-073 735 DR-3 1930-01-124 751 DR-3 1930-02-265 752 DR-3 NaN6 837 MSK-4 1932-01-147 844 DR-1 1932-03-22 taken person quant reading0 619 dyer rad 9.821 619 dyer sal 0.132 622 dyer rad 7.803 622 dyer sal 0.094 734 pb rad 8.415 734 lake sal 0.056 734 pb temp -21.507 735 pb rad 7.228 735 NaN sal 0.069 735 NaN temp -26.0010 751 pb rad 4.3511 751 pb temp -18.5012 751 lake sal 0.1013 752 lake rad 2.1914 752 lake sal 0.0915 752 lake temp -16.0016 752 roe sal 41.6017 837 lake rad 1.4618 837 lake sal 0.2119 837 roe sal 22.5020 844 roe rad 11.25 ident site dated taken person quant reading0 619 DR-1 1927-02-08 619 dyer rad 9.821 619 DR-1 1927-02-08 619 dyer sal 0.132 622 DR-1 1927-02-10 622 dyer rad 7.803 622 DR-1 1927-02-10 622 dyer sal 0.094 734 DR-3 1939-01-07 734 pb rad 8.415 734 DR-3 1939-01-07 734 lake sal 0.056 734 DR-3 1939-01-07 734 pb temp -21.507 735 DR-3 1930-01-12 735 pb rad 7.228 735 DR-3 1930-01-12 735 NaN sal 0.069 735 DR-3 1930-01-12 735 NaN temp -26.0010 751 DR-3 1930-02-26 751 pb rad 4.3511 751 DR-3 1930-02-26 751 pb temp -18.5012 751 DR-3 1930-02-26 751 lake sal 0.1013 752 DR-3 NaN 752 lake rad 2.1914 752 DR-3 NaN 752 lake sal 0.0915 752 DR-3 NaN 752 lake temp -16.0016 752 DR-3 NaN 752 roe sal 41.6017 837 MSK-4 1932-01-14 837 lake rad 1.4618 837 MSK-4 1932-01-14 837 lake sal 0.2119 837 MSK-4 1932-01-14 837 roe sal 22.5020 844 DR-1 1932-03-22 844 roe rad 11.25
- 用户输入
import pandas as pdfrom numpy import NaN, NAN, nannum_legs = pd.Series({'goat': 4, 'amoeba': nan})print(num_legs)scientists = pd.DataFrame({'Name': ['Rosaline Franklin', 'William Gosset'], 'Occupation': ['Chemist', 'Statistician'], 'Born': ['1920-07-25', '1876-06-13'], 'Died': ['1958-04-16', '1937-10-16'], 'missing': [NaN, nan]})print(scientists)scientists['missing'] = nanprint(scientists)
执行结果
$ python3 5-3.py amoeba NaNgoat 4.0dtype: float64 Born Died Name Occupation missing0 1920-07-25 1958-04-16 Rosaline Franklin Chemist NaN1 1876-06-13 1937-10-16 William Gosset Statistician NaN Born Died Name Occupation missing0 1920-07-25 1958-04-16 Rosaline Franklin Chemist NaN1 1876-06-13 1937-10-16 William Gosset Statistician NaN
- 重新索引
5-4.py
import pandas as pdfrom numpy import NaN, NAN, nangapminder = pd.read_csv('data/gapminder.tsv', sep='\t')life_exp = gapminder.groupby(['year'])['lifeExp'].mean()print(life_exp)print(life_exp.reindex(range(2000, 2010)))
执行结果
year1952 49.0576201957 51.5074011962 53.6092491967 55.6782901972 57.6473861977 59.5701571982 61.5331971987 63.2126131992 64.1603381997 65.0146762002 65.6949232007 67.007423Name: lifeExp, dtype: float64year2000 NaN2001 NaN2002 65.6949232003 NaN2004 NaN2005 NaN2006 NaN2007 67.0074232008 NaN2009 NaNName: lifeExp, dtype: float64
处理缺失数据
- 统计缺失数据
5-5.py
import pandas as pdfrom numpy import NaN, NAN, nanimport numpy as npebola = pd.read_csv('data/country_timeseries.csv')print(ebola.head())print(ebola.count())num_rows = ebola.shape[0]print("num_rows")print(num_rows)num_missing = num_rows - ebola.count()print("num_missing:")print(num_missing)print(np.count_nonzero(ebola.isnull()))print(np.count_nonzero(ebola['Cases_Guinea'].isnull()))print(ebola.Cases_Guinea.value_counts(dropna=False).head())
执行结果
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone \0 1/5/2015 289 2776.0 NaN 10030.0 1 1/4/2015 288 2775.0 NaN 9780.0 2 1/3/2015 287 2769.0 8166.0 9722.0 3 1/2/2015 286 NaN 8157.0 NaN 4 12/31/2014 284 2730.0 8115.0 9633.0 Cases_Nigeria Cases_Senegal Cases_UnitedStates Cases_Spain Cases_Mali \0 NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN Deaths_Guinea Deaths_Liberia Deaths_SierraLeone Deaths_Nigeria \0 1786.0 NaN 2977.0 NaN 1 1781.0 NaN 2943.0 NaN 2 1767.0 3496.0 2915.0 NaN 3 NaN 3496.0 NaN NaN 4 1739.0 3471.0 2827.0 NaN Deaths_Senegal Deaths_UnitedStates Deaths_Spain Deaths_Mali 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN Date 122Day 122Cases_Guinea 93Cases_Liberia 83Cases_SierraLeone 87Cases_Nigeria 38Cases_Senegal 25Cases_UnitedStates 18Cases_Spain 16Cases_Mali 12Deaths_Guinea 92Deaths_Liberia 81Deaths_SierraLeone 87Deaths_Nigeria 38Deaths_Senegal 22Deaths_UnitedStates 18Deaths_Spain 16Deaths_Mali 12dtype: int64num_rows122num_missing:Date 0Day 0Cases_Guinea 29Cases_Liberia 39Cases_SierraLeone 35Cases_Nigeria 84Cases_Senegal 97Cases_UnitedStates 104Cases_Spain 106Cases_Mali 110Deaths_Guinea 30Deaths_Liberia 41Deaths_SierraLeone 35Deaths_Nigeria 84Deaths_Senegal 100Deaths_UnitedStates 104Deaths_Spain 106Deaths_Mali 110dtype: int64121429NaN 29 86.0 3 495.0 2 112.0 2 390.0 2Name: Cases_Guinea, dtype: int64
- 处理缺失数据
5-6.py
import pandas as pdfrom numpy import NaN, NAN, nanimport numpy as npebola = pd.read_csv('data/country_timeseries.csv')print(ebola.iloc[0:10, 0:5])print(ebola.fillna(0).iloc[0:10, 0:5])# 前向填充print(ebola.fillna(method='ffill').iloc[0:10, 0:5])# 后向填充print(ebola.fillna(method='bfill').iloc[0:10, 0:5])print(ebola.interpolate().iloc[0:10, 0:5])print(ebola.shape)ebola_dropna = ebola.dropna()print(ebola_dropna.shape)print(ebola_dropna)ebola['Cases_multiple'] = ebola['Cases_Guinea'] + ebola['Cases_Liberia'] + \ebola['Cases_SierraLeone']ebola_subset = ebola.loc[:, ['Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone', 'Cases_multiple']]print(ebola_subset.head(n=10))print(ebola.Cases_Guinea.sum(skipna = True))print(ebola.Cases_Guinea.sum(skipna = False))
执行结果
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone0 1/5/2015 289 2776.0 NaN 10030.01 1/4/2015 288 2775.0 NaN 9780.02 1/3/2015 287 2769.0 8166.0 9722.03 1/2/2015 286 NaN 8157.0 NaN4 12/31/2014 284 2730.0 8115.0 9633.05 12/28/2014 281 2706.0 8018.0 9446.06 12/27/2014 280 2695.0 NaN 9409.07 12/24/2014 277 2630.0 7977.0 9203.08 12/21/2014 273 2597.0 NaN 9004.09 12/20/2014 272 2571.0 7862.0 8939.0 Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone0 1/5/2015 289 2776.0 0.0 10030.01 1/4/2015 288 2775.0 0.0 9780.02 1/3/2015 287 2769.0 8166.0 9722.03 1/2/2015 286 0.0 8157.0 0.04 12/31/2014 284 2730.0 8115.0 9633.05 12/28/2014 281 2706.0 8018.0 9446.06 12/27/2014 280 2695.0 0.0 9409.07 12/24/2014 277 2630.0 7977.0 9203.08 12/21/2014 273 2597.0 0.0 9004.09 12/20/2014 272 2571.0 7862.0 8939.0 Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone0 1/5/2015 289 2776.0 NaN 10030.01 1/4/2015 288 2775.0 NaN 9780.02 1/3/2015 287 2769.0 8166.0 9722.03 1/2/2015 286 2769.0 8157.0 9722.04 12/31/2014 284 2730.0 8115.0 9633.05 12/28/2014 281 2706.0 8018.0 9446.06 12/27/2014 280 2695.0 8018.0 9409.07 12/24/2014 277 2630.0 7977.0 9203.08 12/21/2014 273 2597.0 7977.0 9004.09 12/20/2014 272 2571.0 7862.0 8939.0 Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone0 1/5/2015 289 2776.0 8166.0 10030.01 1/4/2015 288 2775.0 8166.0 9780.02 1/3/2015 287 2769.0 8166.0 9722.03 1/2/2015 286 2730.0 8157.0 9633.04 12/31/2014 284 2730.0 8115.0 9633.05 12/28/2014 281 2706.0 8018.0 9446.06 12/27/2014 280 2695.0 7977.0 9409.07 12/24/2014 277 2630.0 7977.0 9203.08 12/21/2014 273 2597.0 7862.0 9004.09 12/20/2014 272 2571.0 7862.0 8939.0 Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone0 1/5/2015 289 2776.0 NaN 10030.01 1/4/2015 288 2775.0 NaN 9780.02 1/3/2015 287 2769.0 8166.0 9722.03 1/2/2015 286 2749.5 8157.0 9677.54 12/31/2014 284 2730.0 8115.0 9633.05 12/28/2014 281 2706.0 8018.0 9446.06 12/27/2014 280 2695.0 7997.5 9409.07 12/24/2014 277 2630.0 7977.0 9203.08 12/21/2014 273 2597.0 7919.5 9004.09 12/20/2014 272 2571.0 7862.0 8939.0(122, 18)(1, 18) Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone \19 11/18/2014 241 2047.0 7082.0 6190.0 Cases_Nigeria Cases_Senegal Cases_UnitedStates Cases_Spain Cases_Mali \19 20.0 1.0 4.0 1.0 6.0 Deaths_Guinea Deaths_Liberia Deaths_SierraLeone Deaths_Nigeria \19 1214.0 2963.0 1267.0 8.0 Deaths_Senegal Deaths_UnitedStates Deaths_Spain Deaths_Mali 19 0.0 1.0 0.0 6.0 Cases_Guinea Cases_Liberia Cases_SierraLeone Cases_multiple0 2776.0 NaN 10030.0 NaN1 2775.0 NaN 9780.0 NaN2 2769.0 8166.0 9722.0 20657.03 NaN 8157.0 NaN NaN4 2730.0 8115.0 9633.0 20478.05 2706.0 8018.0 9446.0 20170.06 2695.0 NaN 9409.0 NaN7 2630.0 7977.0 9203.0 19810.08 2597.0 NaN 9004.0 NaN9 2571.0 7862.0 8939.0 19372.084729.0nan
参考资料
- 讨论qq群144081101 591302926 567351477 钉钉免费群21745728
- 本文最新版本地址
- 本文涉及的python测试开发库 谢谢点赞!
- 本文相关海量书籍下载
- 源码下载
- 本文英文版书籍下载
可惜我python还没学完!