pandasの使い方（１） - Remrinのpython攻略日記

pandasライブラリの使い方。
　
pandasはSeries(1次元)とDataFrame(2次元)という2つのデータ構造を持つ。
　
今回はまずSeriesの使い方。
　
○シリーズ（１次元配列）を作る。

import pandas as pd
l = [3, 4, 5, 6, 7]
series = pd.Series(l)
print(series)
# 0    3
# 1    4
# 2    5
# 3    6
# 4    7
# dtype: int64

　
index(label)が自動で割り振られます。
indexだけの取得：Series.index
値だけの取得　：Series.values

print(s1.index)         # RangeIndex(start=0, stop=5, step=1)
print(type(s1.index))   # <class 'pandas.indexes.range.RangeIndex'>
print(s1.values)        # [3 4 5 6 7]
print(type(s1.values))  # <class 'numpy.ndarray'>

Seriesのvalueはnumpyのndarreyオブジェクトになっています。
　
indexを自分でつけてSeriesを作るにはindexオプションで配列を指定する。

s2 =pd.Series(l, index=["a", "b", "c", "d", "e"])
print(s2)
# a    3
# b    4
# c    5
# d    6
# e    7
# dtype: int64

print(s2.index) # Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

　
indexを置き換えるときは、index属性に代入

d = pd.date_range("20170501", periods=5)
s1.index = d     # indexをdで置き換える
print(s1)
# 2017-05-01    3
# 2017-05-02    4
# 2017-05-03    5
# 2017-05-04    6
# 2017-05-05    7
# Freq: D, dtype: int64

　
・値の参照、代入はindex指定で。

print(s2["a"]) #  3
s2["a"] = 10

print(s2[["a", "b", "c"]])
# a    10
# b     4
# c     5
# dtype: int64

#print(s2["a", "b", "c"]) # KeyError: ('a', 'b', 'c')

　
フィルタリング、numpy演算、関数適用ができる。

# フィルタリング：5より大きい要素を選ぶ
print(s2[s2 > 5]) 
# a    10
# d     6
# e     7
# dtype: int64

# 演算はnumpyと同様にブロードキャスト
print(s2 * 2)
# a    20
# b     8
# c    10
# d    12
# e    14
# dtype: int64

#　関数適用もブロードキャスト(ユニバーサル)
print(np.power(2, s2))
# a    1024
# b      16
# c      32
# d      64
# e     128
# dtype: int64

　
・辞書型のような存在判定ができる。

print("a" in s2) # True
print("A" in s2) # False

　
・辞書型オブジェクトからSeriesを作れる。
　このとき、indexをキーとしてソートされる。

roman = {"I":1, "V":5, "X":10, "L":50, "C":100}
s3 = pd.Series(roman)
print(s3)
# C    100
# I      1
# L     50
# V      5
# X     10
# dtype: int64

　
・辞書型＋index配列指定をした場合、存在しないデータはNaNとなる。
　NaN：not a number（欠損値、NA）

roman = {"I":1, "V":5, "X":10, "L":50, "C":100}
s4 = pd.Series(roman, index=["V", "W", "X", "Y", "Z"])
print(s4)
# V     5.0
# W     NaN
# X    10.0
# Y     NaN
# Z     NaN
# dtype: float64

　
・欠損値の特定はpd.isnull()関数、pd.notnull()関数
　またはSeries.isnull()メソッド、Series.notnull()メソッド

# isnull()関数
print(pd.isnull(s4["W"]))   # True
print(pd.notnull(s4["W"]))  # False
print(pd.isnull(s4))
# V    False
# W     True
# X    False
# Y     True
# Z     True
# dtype: bool

# isnull()メソッド
print(s4.isnull())
# V    False
# W     True
# X    False
# Y     True
# Z     True
# dtype: bool

　
・Seriesどうしの演算。
　どちらか一方がNaNなら結果もNaN

print(s3 + s4)
# C     NaN
# I     NaN
# L     NaN
# V    10.0
# W     NaN
# X    20.0
# Y     NaN
# Z     NaN
# dtype: float64

　
・Seriesやindexにname属性を付けられる。

s3.name = "roman"
s3.index.name = "fig."
print(s3)
# fig.
# C    100
# I      1
# L     50
# V      5
# X     10
# Name: roman, dtype: int64