2017-05-02

collectionsライブラリ

標準ライブラリのcollectionsには便利なデータ型があります。

namedtuple()	タプル風	名前付きフィールドを持つタプルを作成するファクトリ関数
deque	リスト風	両端における append や pop を高速に行えるリスト風のコンテナ
ChainMap	辞書風	複数のマッピングの一つのビューを作成する辞書風のクラス
Counter	辞書	ハッシュ可能なオブジェクトを数え上げる辞書のサブクラス
OrderedDict	辞書	項目が追加された順序を記憶する辞書のサブクラス
defaultdict	辞書	ファクトリ関数を呼び出して存在しない値を供給する辞書のサブクラス
UserDict	辞書	辞書のサブクラス化を簡単にする辞書オブジェクトのラッパ
UserList	リスト	リストのサブクラス化を簡単にするリストオブジェクトのラッパ
UserString	文字列	文字列のサブクラス化を簡単にする文字列オブジェクトのラッパ

　
○Counter　（dict型のサブクラス）
頻度のカウントをし、辞書型オブジェクトを作る。
　
使用例

# -*- coding: utf-8 -*-
from collections import Counter

text = "あかねさす　むらさきのゆき　しめのゆき　のもりはみずや　きみがそでふる"
text = text.replace("　", "")

c = Counter(text)        #文字をカウントするCounterオブジェクト
print(c)                 #Counter({'き': 4, 'の': 3, 'ゆ': 2,...
print(c.most_common(1))  #[('き', 4)]　最頻出の要素を表示

　
Counterオブジェクトは辞書型のサブクラスなので、辞書と同様の扱いができ、
辞書オブジェクトのメソッドもほぼすべて使える。
存在しない要素に対してはKeyErrorではなく、0を返す。

print(c["き"])  #4
print(c["の"])  #3
print(c["い"])  #0　存在しないときはゼロ
print(len(c))   #23

　
・Counterの生成

c = Counter()                   # a new, empty counter
c = Counter('akasatana')        # a new counter from an iterable
c = Counter({'あ': 4, 'い': 2}) # a new counter from a mapping
c = Counter(あ=4, い=8)         # a new counter from keyword args

print(c)                        #Counter({'い': 8, 'あ': 4})

　
辞書型に追加されたメソッドは３つで、most_common(), elements(), subtract()
・Counter.most_common(n)
　nを省くと頻度の多い順に(key, value)のタプルをすべて並べたリストを得る。
　引数nを指定すると、頻度順にn個のタプルのリストを得る。

print(c.most_common(2))      #[('き', 4), ('の', 3)]
print(c.most_common()[0])    #('き', 4)
print(c.most_common()[0][0]) #き
print(c.most_common()[0][1]) #4

　
・Counter.elements()
それぞれの要素をカウントの回数生成するイテレータ

c = Counter(a=4, b=2, c=0, d=-2)
>>> sorted(c.elements())   #['a', 'a', 'a', 'a', 'b', 'b']

カウントが0や負の場合は無視される。
　
・Counter.subtract(iterable)

c = Counter(a=4, b=2, c=0)
d = Counter(a=1, b=2, c=3)
c.subtract(d)
print(c)      #Counter({'a': 3, 'b': 0, 'c': -3})

　
・update(iterable or mapping)メソッドはカウントを置き換えるのではなく、追加。
　引数は{"a":1, "b":1}のようなmappingか、
　iterableのときは(key, value)対のシーケンスではなく、要素のみのシーケンス。

c = Counter(a=4, b=2, c=0)
c.update({"a":1, "b":1})
print(c)                   #Counter({'a': 5, 'b': 3, 'c': 0})

c.update(["a", "c", "c"])
print(c)                   #Counter({'a': 6, 'b': 3, 'c': 2})

　
・数学演算

c = Counter(a=3, b=1)
d = Counter(a=1, b=2)
c1 = c + d               # add two counters together:  c[x] + d[x]
print(c1)                # Counter({'a': 4, 'b': 3})
c2 = c - d               # subtract (keeping only positive counts)
print(c2)                # Counter({'a': 2})
c3 = c & d               # intersection:  min(c[x], d[x]) 
print(c3)                # Counter({'b': 1, 'a': 1})
c4 = c | d               # union:  max(c[x], d[x])
print(c4)                # Counter({'a': 3, 'b': 2})

　
・単項加減

c = Counter(a=2, b=-4, c=0)
print(c)   #Counter({'a': 2, 'c': 0, 'b': -4})

c1 = +c    #カウントが正の要素だけ残す
print(c1)  #Counter({'a': 2})

c2 = -c    #カウントの正負を逆にし、正の要素を残す
print(c2)  #Counter({'b': 4})

　
・カウントの削除

c = Counter(a=1, b=2, c=3)
c["c"] = 0    # カウントを0にするだけで、削除されない
print(c)      # Counter({'b': 2, 'a': 1, 'c': 0})

c = +c        # カウントが正でない要素を削除
print(c)      # Counter({'b': 2, 'a': 1})

del c["b"]    # delで要素を削除できる
print(c)      # Counter({'a': 1})

del c         #　cそのものを削除し、メモリ解放
#print(c)     # NameError: name 'c' is not defined

　
・よくあるパターン

sum(c.values())                 # total of all counts
c.clear()                       # reset all counts
list(c)                         # list unique elements
set(c)                          # convert to a set
dict(c)                         # convert to a regular dictionary
c.items()                       # convert to a list of (elem, cnt) pairs
Counter(dict(list_of_pairs))    # convert from a list of (elem, cnt) pairs
c.most_common()[:-n-1:-1]       # n least common elements
+c                              # remove zero and negative counts

deque

　リストをキューとして使う。
　FIFO(first in first out)として使う。
　pop(0)でリスト型の最初の要素を取り出せるが、
　python標準のリスト型は左端の要素の追加・取り出し動作が遅いのでcollectionsのdequeを使うのがよい。
　dequeはリストの両端での追加、取り出しが高速に実行できる。

from collections import deque
q = deque([1, 3, 5])
q.append(7)
q.appendleft(-1)
print(q)           # deque([-1, 1, 3, 5, 7])

print(q.pop())     # 7
print(q.popleft()) # -1
print(q)           # deque([1, 3, 5])

他のデータ型についてはそのうち追加　

参考：
pythonドキュメント　collectionsライブラリ

2017-05-01

BeautifulSoup4の使い方

python ライブラリ

htmlの構文解釈のライブラリです。
　
○htmlの準備

# -*- coding: utf-8 -*-

import urllib.request
from bs4 import BeautifulSoup

url = "http://python-remrin.hatenadiary.jp/"
f = urllib.request.urlopen(url)
html = f.read().decode('utf-8')

soup = BeautifulSoup(html, "html.parser")

　
○タイトルの表示

title	タグ＋文字
title.name	タグのみ
title.string	文字のみ

print(soup.title)        #<title>Remrinのpython攻略日記</title>
print(soup.title.name)   #title
print(soup.title.string) #Remrinのpython攻略日記

　
○ヘッダー、本文の取得

h = soup.head    #ヘッダーの取得

b = soup.body.b  #bodyの中の最初のbタグを取得

○Aタグの取得

soup.find_all("a")	すべてのaタグのリスト
soup.find("a")	１つめのみ
soup.a	１つめのみ

a_list = soup.find_all("a")
#print(a_list)     #実際にprintすると100件くらい表示される

a_list = soup.find_all("a", limit=5)  #最大数を5に指定
print(a_list)

a = soup.find("a") #1つめだけ
print(a)           #<a href="#" id="sp-suggest-link">スマートフォン用の表示で見る</a>

a = soup.a         #1つめだけ
print(a)           #<a href="#" id="sp-suggest-link">スマートフォン用の表示で見る</a>

a = soup.find("A")
print(a)           #None　存在しないとき。

　
○タグの属性の取得

print(soup.a.get("href")) ##
print(soup.a.get("id"))   #sp-suggest-link

　
○文字列の取得

print(soup.a.string)            #スマートフォン用の表示で見る

strings = soup.strings          #ジェネレータ

strings = soup.stripped_strings #空白文字を除去したジェネレータ

　
○入れ子のタグも取得できる。
最初のdivの中のaタグすべて

#最初のdivの中のaタグすべて
da1 = soup.div.find_all("a")
print(da1)   #[<a href="#" id="sp-suggest-link">スマートフォン用の～

#最初のdivの中の最初のdivの中の最初のaタグ
print(soup.div.div.a)

　
○条件を絞ったタグの取得

soup.find_all("a", class_="link", href="/link") #「class」はpythonの予約語

soup.find_all("a", attrs={"class": "link", "href": "/link"})

soup.find_all(class_="link", href="/link")

soup.find_all(attrs={"class": "link", "href": "/link"})

　
○正規表現の利用

import re

#bで始まるすべてのタグを取得
a = soup.find_all(re.compile("^b"))
#print(a)   #大量

#href属性として「link」という文字列を含むものすべて
a = soup.find_all(href=re.compile("link"))

#文字列にhelloを含むaタグすべて
soup.find_all("a", text=re.compile("hello"))

　
参考：
PythonとBeautiful Soupでスクレイピング
BeautifulSoup4 ドキュメント（英語）

2017-05-01

urllibライブラリの使い方

python ライブラリ

urllibはURLにアクセスするライブラリです。
urllibモジュールは、Python 3 で urllib.request, urllib.parse, urllib.error に分割されて名称変更されました。
　
・urllib.request は URL を開いて読むためのモジュールです
・urllib.error は urllib.request が発生させる例外を持っています
・urllib.parse は URL をパース（構文解釈）するためのモジュールです
・urllib.robotparser は robots.txt ファイルをパースするためのモジュールです
　
○URLを開く
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
コンテクストマネージャとして機能するオブジェクトを返す。
　メソッド：
　　・geturl()　取得したリソースのURLを返す。リダイレクトのチェック用。
　　・info()　取得したページのヘッダーなどのメタ情報をemail.message_from_string() インスタンスとして返す
　　・getcode()　レスポンスの HTTP ステータスコード
　
(例)このブログのトップページのhtmlの最初の40文字を表示
　①with asのコンテキストマネージャを利用する方法。

import urllib.request
with urllib.request.urlopen('http://python-remrin.hatenadiary.jp/') as f:
    print(f.read(40))  #b'<!DOCTYPE html>\n<html\n  lang="ja"\n\ndata

　
　②with asを使わない方法。

import urllib.request
f = urllib.request.urlopen('http://python-remrin.hatenadiary.jp/')
print(f.read(40))  #b'<!DOCTYPE html>\n<html\n  lang="ja"\n\ndata-

いずれもurlopen()はバイトオブジェクトを返している。（print()での表示が「b'」で始まっている）
　
○デコード
htmlのソースにと記述があれば、utf-8でエンコードされているので、htmlのデコードにもutf-8を使う。
このブログもutf-8でエンコードされているので、最初の100文字を表示してみる。

f = urllib.request.urlopen('http://python-remrin.hatenadiary.jp/')
print(f.read(100).decode('utf-8'))

すると、以下のように表示される。

<!DOCTYPE html>
<html
  lang="ja"

data-admin-domain="//blog.hatena.ne.jp"
data-author="rare_Remrin"

html全部を表示するときはread(100)の部分をread()に変更。
　
参考：
python3.5ドキュメント
 urllib パッケージを使ってインターネット上のリソースを取得するには
[http://www.yoheim.net/blog.php?q=20160204:title=[Python] HTTP通信でGetやPostを行う]

2017-04-30

pythonのdict型

python

・辞書型。マッピング型とも言う。
・シーケンスに似ているが、順序なし。
・key:valueのセットで辞書登録。key…識別子。indexがわりに使う。
・keyでも、valueでも、key:valueでもiteration可能。
・keyはimmutable(数値、タプル、文字列など)だが、valueはmutable
・print()で出力しても、登録順序で表示されるとは限らない。
・サイズ取得はlen()
　

辞書の生成

d1 = {"a":70, "b":30}
print(d1)             #{'a': 70, 'b': 30}

d2 = {}
print(d2)             #{}   空の辞書

list1 = [("x", 2), ("y", 4)]
d3 = dict(list1)      #リストやタプルから辞書を生成する関数
print(d3)             #{'y': 4, 'x': 2}

要素へのアクセス

　indexの代わりにkeyで指定

d1 = {"a":70, "b":30}
#print(d1[0])         #KeyError: 0　　　index指定はできない
print(d1["a"])        #70
#print(d1["c"])       #KeyError: 'c'    存在しないkeyだとKeyError

要素の追加

d1["c"] = 60          #現時点で存在していないkeyを使う。
print(d1)             #{'a': 70, 'b': 30, 'c': 60}

d1.setdefault("d", 0) #値がないときのみ追加
print(d1)             #{'d': 0, 'a': 70, 'b': 30, 'c': 60}

d3.setdefault("x", 0) #すでに存在するkeyだと追加されない。エラーも出ない。
print(d3)             #{'y': 4, 'x': 2}

辞書の連結

d5 = {"blue":"0x0000ff", "red":"0xff0000"} #同じkeyがあるときは上書き
d5.update({"yellow":"0xffff00", "red":"red"})
print(d5)  #{'yellow': '0xffff00', 'blue': '0x0000ff', 'red': 'red'}

d1 = {"a":1, "b":2}
d2 = {"c":3, "d":4}
d3 = dict(d1, **d2) #辞書2つまで
print(d3)           #{'d': 4, 'a': 1, 'b': 2, 'c': 3}

d1 = {"a":1, "b":2}
d2 = {"c":3, "d":4}
d3 = {"e":5, "f":6}         
d4 = {**d1, **d2, **d3}   #辞書3つ以上でも連結できる
print(d4) #{'f': 6, 'b': 2, 'c': 3, 'd': 4, 'a': 1, 'e': 5}

削除

d1 = {"a":70, "b":30}
del d1["a"]
print(d1)              #{'b':30}

d1 = {"a":70, "b":30}
temp = d1.pop("a")     #指定されたkey:valueを削除し、valueのみを得る
print(d1)       
print(temp)            #valueのみ

d1 = {"a":70, "b":30}
temp = d1.popitem()    #ランダムに削除し、key:valueをタプルで得る
print(d1)              #{'b':30} 
print(temp)            #('a', 70)

全削除

d1 = {"a":70, "b":30}
d1.clear()             #要素を全削除
print(d1)              #{}

d1 = {"a":70, "b":30}
del d1                 #d1そのものをメモリから削除
#print(d1)             #NameError: name 'd1' is not defined

d1 = {"a":70, "b":30}
while d1:              #辞書d1が空になるまで
    temp = d1.popitem()#要素を削除していく
print(d1)              #{}
print(temp)            #('a', 70)      
      
d1 = {"a":70, "b":30}
#for i in d1:
#    temp = d1.popitem() #RuntimeError: dictionary changed size during iteration

検索

in演算子	keyが含まれればTrue
not in演算子	keyが含まれていなければTrue
get()メソッド	d1.get(“～”)　存在しないときの戻り値を設定できる

has_key()メソッド　→　python3で廃止
　

d1 = {"a":70, "b":30}
print("a" in d1)     #True
print("c" in d1)     #False
print("a" not in d1) #False
print("c" not in d1) #True

#d1.has_key()  "AttributeError: 'dict' object has no attribute 'has_key'

x = d1.get("a")
print(x)             #70

x = d1.get("c")
print(x)             #None

x = d1.get("c", "doesn't extist") #存在しないときの返り値を設定できる
print(x)             #doesn't exist

イテレーション（リストアップ）

.keys()	key の一覧(イテレータ)を返す
.values()	valueの一覧
.items()	(key, value)というタプルの一覧

iterkeys()　→python3廃止
iteritems()　→python3廃止
　

keyだけをイテレート	for k in dict1:　またはfor k in dict1.keys():
valueだけをイテレート	for k in dict1.values():
key, valueをイテレート	for k, v in dict1.items():

d1 = {"a":70, "b":30}
print(d1.keys())         #dict_keys(['a', 'b'])
print(d1.values())       #dict_values([70, 30])
print(d1.items())        #dict_items([('a', 70), ('b', 30)])

print(list(d1.keys()))   #['a', 'b']
print(list(d1.values())) #[70, 30]
print(list(d1.items()))  #[('a', 70), ('b', 30)]

　
○**dictの形で辞書を展開してキーワード引数を渡せる。

def distance(x, y, z):                 #原点からの距離を計算する関数
    return (x**2 + y**2 + z**2) ** 0.5
d1 = {"x":1, "y":2, "z":2}             #座標を辞書で書く
a = distance(**d1)                     #展開してから引数を渡す
print(a)                               #3.0

辞書型の最大値、最小値

d1 = {1:70, 2:100, 3:50, 4:80}

#keyの最大値・最小値
print(min(d1))    #1
print(max(d1))    #4

#valueの最大値・最小値
print(min(d1[x] for x in d1))    #50
print(max(d1[x] for x in d1))    #100

#valueで最大・最小を調べ、そのkeyを得る
print(min(d1, key=lambda x:d1[x]))  #3
print(max(d1, key=lambda x:d1[x]))  #2

#keyの最大値・最小値を調べ、そのvalueを得る
print(d1[min(d1)])   #70
print(d1[max(d1)])   #80

辞書のkeyとvalueを入れ替える

dic1 = {"睦月":"1月", "如月":"2月", "弥生":"3月"}
dic2 = {v:k for k, v in dic1.items()}

print(dic1) # {'如月': '2月', '睦月': '1月', '弥生': '3月'}
print(dic2) # {'1月': '睦月', '2月': '如月', '3月': '弥生'}

　
○辞書dictの利用例（１）
文字列に使われているひらがなをカウントする
　
辞書型を参照するときの注意
　代入するときはキーが存在してもしなくても動く。
　参照するときはキーが存在していないとKeyErrorが発生するので、あらかじめin演算子などで存在確認をする。

# -*- coding: utf-8 -*-

text = "あかねさす　むらさきのゆき　しめのゆき　のもりはみずや　きみがそでふる"
text = text.replace("　", "")

d = {}
for ch in text:
    if ch in d:
        d[ch] += 1
    else:
        d[ch] = 1
print(d)  #{'き': 4, 'そ': 1, 'や': 1, 'る': 1, 'ず': 1, 'ふ': 1, 'も': 1, 'り': 1, 'さ': 2, 'ら': 1, 'し': 1, 'の': 3, 'み': 2, 'は': 1, 'で': 1, 'が': 1, 'す': 1, 'む': 1, 'め': 1, 'ね': 1, 'ゆ': 2, 'か': 1, 'あ': 1}

　
dict.get()メソッドのデフォルト値を使うと条件分岐なしでシンプルに書ける。

# -*- coding: utf-8 -*-

text = "あかねさす　むらさきのゆき　しめのゆき　のもりはみずや　きみがそでふる"
text = text.replace("　", "")

d = {}
for ch in text:
    d[ch] = d.get(ch, 0) + 1

#すべてを表示すると多すぎるので、今回は最頻出のみを表示
m = max(d, key=lambda x:d[x])
print(m, d[m])  #き 4

　
collectionのCounterを使うとシンプルに最頻値を調べられる。

from collections import Counter
c = Counter(text)
print(c.most_common(1))   #[('き', 4)]

　
アミノ酸のコドン表の辞書を作る。

bases = "UCAG"
codons = [b1+b2+b3 for b1 in bases for b2 in bases for b3 in bases]
aminoacids = "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG"
codon_dict = dict(zip(codons, aminoacids))
print(codon_dict)

codon_dict={
'UUU':'F',  'UUC':'F',  'UUA':'L',  'UUG':'L',  
'UCU':'S',  'UCC':'S',  'UCA':'S',  'UCG':'S',  
'UAU':'Y',  'UAC':'Y',  'UAA':'*',  'UAG':'*',  
'UGU':'C',  'UGC':'C',  'UGA':'*',  'UGG':'W',  
'CUU':'L',  'CUC':'L',  'CUA':'L',  'CUG':'L',  
'CCU':'P',  'CCC':'P',  'CCA':'P',  'CCG':'P',  
'CAU':'H',  'CAC':'H',  'CAA':'Q',  'CAG':'Q',  
'CGU':'R',  'CGC':'R',  'CGA':'R',  'CGG':'R',  
'AUU':'I',  'AUC':'I',  'AUA':'I',  'AUG':'M',  
'ACU':'T',  'ACC':'T',  'ACA':'T',  'ACG':'T',  
'AAU':'N',  'AAC':'N',  'AAA':'K',  'AAG':'K',  
'AGU':'S',  'AGC':'S',  'AGA':'R',  'AGG':'R',  
'GUU':'V',  'GUC':'V',  'GUA':'V',  'GUG':'V',  
'GCU':'A',  'GCC':'A',  'GCA':'A',  'GCG':'A',  
'GAU':'D',  'GAC':'D',  'GAA':'E',  'GAG':'E',  
'GGU':'G',  'GGC':'G',  'GGA':'G',  'GGG':'G'}

dictクラスの継承

インタラクティブシェルでhelp(dict)とすると、メソッド一覧がある。
書き換えたいメソッドだけ上書きするとオリジナルのクラスが作れる。

# dictクラスを継承し、str型のみキーとして受け取れる辞書クラスを作る
class StrDict(dict):                    
    # 書き換えたいメソッドのみを上書きする。
    def __setitem__(self, key, value):
        if not isinstance(key, str):
            raise ValueError("Key must be str.")
        super().__setitem__(key, value)
        
d1 = StrDict()
d1["red"] = "0xFF0000"
print(d1["red"])

#d1[0] = 0  # ValueError: Key must be str.

クラスに継承についてはこちら
　

2017-04-30

pythonのタプル

python

pythonのタプルについて

・シーケンス型の１つ。
・リストと似ているが、タプルはcount, index以外のメソッドがない。
・リストは[]、タプルは()または括弧なし。
・immutableな型で、値の変更ができない。
・要素を変更できない。ソートもできない。
・indexとsliceはは使用できる。
・異種データが混じるときはcollectionsライブラリのnamedtupleも選択肢に。
・値を変更しないとき、変更したくないとき、関数の引数にするときによく使う。
・def fun(*args)として、関数のparameterにtupleとして、いくつでも受け取れる。
・内包表記がない。
・要素数 len()関数

○タプルの作成

t1 = (1, 2) #tupleのリテラルは()で。
print(t1)   #(1, 2)

t2 = 3, 4   #かっこなしでもtupleになる
print(t2)   #(3, 4)

t3 = 0,    #要素が1個のときは最後にカンマ
print(t3)  #(0,)

t4 = (2017, ),  #2次元tupleになる
print(t4)       #((2017,),)

t5 = tuple([7, 8])  #tuple(iterable) tuple関数
print(t5)           #(7, 8)

t6 = tuple("abc")
print(t6)           #('a', 'b', 'c')

　
・置換、再代入できない。スライスやindexでの参照はできる。

t = (3, 4, 5)
print(t[0])       #3
print(t[0:2])     #(3, 4)
#t[0] = 10        #TypeError: 'tuple' object does not support item assignment

t = (10,) + t[1:] #要素を変更する場合は新たに作り直すことになる。
print(t)          #(10, 4, 5)

list1 = list(t)   #リスト化→代入→再タプル化も可能
list1[0] = 20
t = tuple(list1)
print(t)          #(20, 4, 5)

　
・同時代入、アンパック

x = 0
y = 0
r = 5
print(x, y, r)    #0 0 5
x, y, r = 3, 4, 5 #x, y, rに同時代入
print(x, y, r)    #3 4 5
x, y = y, x       #x, yの入れ替え
print(x, y, r)    #4 3 5

data = ("math", 100)
sub, point = data   #dataをsubとpointにアンパック
print(point)        #100

　
・大小関係　まず[0]要素で比較、同じなら[1]要素で比較。

print((3, 5) > (2, 8))  #True
print((3, 2) > (1, 6))  #True
print((1, 2) > (1, 2))  #False
print((1, 2) > (1, 0))  #True
#print((1, 2) > ("", 1))#TypeError: unorderable types: int() > str()

　
・連結＋
・繰り返し *

t1 = (1, 2)
t2 = (10, 20)
print(t1 + t2)    #(1, 2, 10, 20)
print(t1 * 3)     #(1, 2, 1, 2, 1, 2)

　
・文字列タプルからの文字取出し

t = ("abc", "def")
print(t[1])        #def
print(t[1][0])     #d

　
・zip()関数　複数のシーケンスから要素の少ないほうに合わせてタプルを作る。
　ただし、イテラブルなzipオブジェクトが作られ、list関数やtuple関数で実体化。
　あるいはイテレータとしてfor文などで使う。

list1 = ("a", "b")
list2 = (1, 2, 3, 4, 5)
z = zip(list1, list2)
print(z)                       #<zip object at 0x～>
for i in z:
    print(i)                   #('a', 1)    ('b', 2)
t =tuple(zip(list1, list2))
print(t)                       #(('a', 1), ('b', 2))
l = list(zip(list1, list2))
print(l)                       #[('a', 1), ('b', 2)]

2017-04-28

matplotlibで棒グラフ

matplotlib python ライブラリ

matplotlibで棒グラフの描き方
　
Definition :

bar(left,        #それぞれの棒の左端の配列 
    height,      #高さの配列
    width=0.8,   #棒の太さ
    bottom=None, #複数の棒グラフを積み上げるときの土台を示す。
    hold=None,   #
    data=None,   #
    **kwargs)

　
○基本

import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [2, 3, 4]
plt.bar(x, y)

f:id:rare_Remrin:20170428180538p:plain 　
　
ちょっと棒が太すぎますね。
太さのデフォルトは0.8なので半分にしてみると
　

import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [2, 3, 4]
plt.bar(x, y, width=0.4)

f:id:rare_Remrin:20170428180906p:plain
スリムになりました。
　
○横棒グラフ

import matplotlib.pyplot as plt
y = [1, 2, 3]
x = [2, 3, 4]
plt.barh(y, x, height=0.4)

f:id:rare_Remrin:20170428183317p:plain
　
他の例も

# -*- coding: utf-8 -*-

import matplotlib.pyplot as plt
import numpy as np

#日本語を使う場合は以下の2行でフォントを準備
from matplotlib.font_manager import FontProperties
fp = FontProperties(fname='C:\WINDOWS\Fonts\msgothic.ttc', size=14)

w = 0.3  #棒の幅
y1 = np.array([80, 90, 60])
y2 = np.array([75, 95, 70])
x = np.arange(len(y1))       #データ数に合わせて横軸を準備

plt.bar(x, y1, width=w, label='第1回', align="center")
plt.bar(x + w, y2, width=w, label='第2回', align="center")
plt.legend(loc="best", prop=fp)    #凡例を表示　日本語を使う場合はprop=fp

# X軸の目盛りを科目名にする　日本語を使う場合はfontproperties=fp
plt.xticks(x + w/2, ['英語','数学','国語'], fontproperties=fp)
plt.show()

f:id:rare_Remrin:20170428150017p:plain
　
2つめのオレンジの棒グラフにbottom=y1のオプションを追加すると

# X軸の目盛りを科目名にする　日本語を使う場合はfontproperties=fp
plt.xticks(x + w/2, ['英語','数学','国語'], fontproperties=fp)
plt.show()

plt.bar(x, y1, width=w, label='第1回', align="center")
plt.bar(x + w, y2, width=w, bottom=y1, label='第2回', align="center")
plt.legend(loc="best", prop=fp)    #凡例を表示　日本語を使う場合はprop=fp

# X軸の目盛りを科目名にする　日本語を使う場合はfontproperties=fp
plt.xticks(x + w/2, ['英語','数学','国語'], fontproperties=fp)
plt.show()

f:id:rare_Remrin:20170428181919p:plain
　
何か浮いてる。
オレンジのグラフを青に重なるようにx座標をそろえて、青も右へ棒の幅の半分ほどずらします。

plt.xticks(x + w/2, ['英語','数学','国語'], fontproperties=fp)
plt.show()

plt.bar(x + w /2, y1, width=w, label='第1回', align="center")
plt.bar(x + w /2, y2, width=w, bottom=y1, label='第2回', align="center")
plt.legend(loc="best", prop=fp)    #凡例を表示　日本語を使う場合はprop=fp

# X軸の目盛りを科目名にする　日本語を使う場合はfontproperties=fp
plt.xticks(x + w/2, ['英語','数学','国語'], fontproperties=fp)
plt.show()

f:id:rare_Remrin:20170428182310p:plain
　
別の例

# -*- coding: utf-8 -*-

import matplotlib.pyplot as plt
import numpy as np
from matplotlib.font_manager import FontProperties
fp = FontProperties(fname=r'C:\WINDOWS\Fonts\msgothic.ttc', size=14)

#         県名   面積  人口
data = [("栃木", 6408, 197),
        ("茨城", 6096, 292),
        ("群馬", 6362, 197),
        ("埼玉", 3797, 727),
        ("千葉", 5157, 622),
        ("東京", 2190, 1352),
        ("神奈川", 2415, 913)]

#dataを種類ごとに分解
x, xlabel, y1, y2 = [], [], [] ,[]
for i in range(len(data)):
    x.append(i)
    xlabel.append(data[i][0])
    y1.append(data[i][1])
    y2.append(data[i][2])
#2つめの棒グラフ（人口）の横軸をブロードキャストでずらすためにarrayに変換    
x = np.array(x)  

w = 0.4
#1つめの棒グラフ(面積)
plt.bar(x, y1, width=w, label='面積(km2)', align="center")
plt.xticks(x + w/2, xlabel, fontproperties=fp)  #x軸に県名

#2つめのグラフ(人口)
plt.bar(x + w, y2, width=w, color="orange", label='人口(万人)', align="center")
plt.legend(loc="upper right", prop=fp)
plt.show()

f:id:rare_Remrin:20170428164938p:plain
　
面積の目盛りを左側、人口の目盛りを右側にするために、subplotsを使う。
ただし、subplotsを使うと
　・グラフの色がリセットされる　→　色を自分で設定する。
　・凡例が重なって表示される　　→　グラフごとに重ならないように設定。
　　　bbox_to_anchor=(x, y)の形で表す。
　

#縦軸を左右に表示するため、subplotsを使う            
_, ax1 = plt.subplots()
#1つめの棒グラフ(面積)
ax1.bar(x, y1, width=w, label='面積(km2)', align="center")
#凡例を枠外に表示するにはbbox_to_anchor(x, y)を使う
plt.legend(bbox_to_anchor=(0, 1.2), loc="upper left", prop=fp)
plt.xticks(x + w/2, xlabel, fontproperties=fp)

#2つめのグラフ(人口)
ax2 = ax1.twinx()
ax2.bar(x + w, y2, width=w, color="orange", label='人口(万人)', align="center")
plt.legend(bbox_to_anchor=(1, 1.2), prop=fp)

plt.show()

　
f:id:rare_Remrin:20170428175907p:plain
　
○百人一首のひらがなをカウントする

# -*- coding: utf-8 -*-

from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from matplotlib.font_manager import FontProperties
fp = FontProperties(fname=r'C:\WINDOWS\Fonts\msgothic.ttc', size=14)

# 百人一首のデータ読み込み
url = 'http://python-remrin.hatenadiary.jp/entry/2017/04/23/000000'
fetched = pd.io.html.read_html(url) # DataFrameのリスト

d = fetched[0]                      # そのデータフレームをdに代入
l = list(d[1])                      # 1の列を抽出してリスト化
text = "".join(l).replace(" ", "")  # 句間のスペースを除去
c = Counter(text)                   # ひらがなの頻度をカウント

# TOP20のグラフ           
x = range(20)
mc = c.most_common(20)              # 頻度TOP20を抽出
ch = [ch[0] for ch in mc]           # ひらがなのリスト
f = np.array([ch[1] for ch in mc])  # 頻度の数値のリスト

plt.bar(x, f) 
plt.xticks(x, ch, fontproperties=fp)  #日本語表示はfontproperties=fp
plt.title("百人一首のひらがな頻度 TOP20", fontproperties=fp)
plt.show()

#Last20のグラフ
least = c.most_common()[-21:-1] #Counterから末尾20個を抽出
ch = [ch[0] for ch in least]
f = np.array([ch[1] for ch in least])

plt.bar(x, f) 
plt.xticks(x, ch, fontproperties=fp)  #日本語表示はfontproperties=fp
plt.title("百人一首のひらがな頻度 Last20", fontproperties=fp)
plt.show()

　
f:id:rare_Remrin:20170504134729p:plain 　
f:id:rare_Remrin:20170504134739p:plain 　
　
縦軸の目盛りがそろってないので、Last20のひらがなの方が頻度が多いような錯覚に。
plt.ylim(0, 220)を追加して、縦軸の表示範囲をそろえると

#Last20のグラフ
least = c.most_common()[-21:-1] #Counterから末尾20個を抽出
ch = [ch[0] for ch in least]
f = np.array([ch[1] for ch in least])

plt.bar(x, f) 
plt.xticks(x, ch, fontproperties=fp)  #日本語表示はfontproperties=fp
plt.title("百人一首のひらがな頻度 Last20", fontproperties=fp)
plt.ylim(0, 220)  # 縦軸の範囲を設定
plt.show()

　
f:id:rare_Remrin:20170504135441p:plain
　
・上位と下位をあわせて表示

# -*- coding: utf-8 -*-

from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from matplotlib.font_manager import FontProperties
fp = FontProperties(fname=r'C:\WINDOWS\Fonts\msgothic.ttc', size=14)

url = 'http://python-remrin.hatenadiary.jp/entry/2017/04/23/000000'
fetched = pd.io.html.read_html(url) # DataFrameのリスト

d = fetched[0]                      # そのデータフレームをdに代入
l = list(d[1])                      # 1の列を抽出してリスト化
text = "".join(l).replace(" ", "")  # 句間のスペースを除去
c = Counter(text)                   # ひらがなの頻度をカウント
Top, Last = 20, 6                   # 上位と下位の表示数をセット  
           
# 上位のグラフ           
x = range(Top)
mc = c.most_common(Top)             # 頻度TOP20を抽出
ch = [ch[0] for ch in mc]           # ひらがなのリスト
f = np.array([ch[1] for ch in mc])  # 頻度の数値のリスト
plt.bar(x, f, color="green")

#　下位のグラフ
least = c.most_common()[-(Last+1):-1] #Counterから末尾を抽出
ch2 = [ch[0] for ch in least]
f2 = np.array([ch[1] for ch in least])
x2 = range(Top + 2, Top + 2 + Last)
plt.bar(x2, f2, color="gray") 

#横軸の目盛り
xtick = ch + [" ", " "] + ch2
plt.xticks(range(Top + 2 + Last), xtick, fontproperties=fp)

plt.title("百人一首のひらがな頻度", fontproperties=fp)         
plt.show()

　
f:id:rare_Remrin:20170504142404p:plain
　
参考サイト：
Symfoware
サイエンティストとマーケターのはざま

2017-04-28

pythonでの数値

python

python3の数値は整数(int型)、小数(float型)、複素数(complex型)の３つ。
論理型(bool型)も数値の１つと言える。(True=1, False=0)
　
○整数型桁数は無限
16進数先頭に0x
8進数先頭に0o
2進数先頭に0b（b：binary）
・int関数　文字列や小数を整数に直す関数
　

0b1001             #9
0o12               #10
0xAA               #170
1000**10           #1000000000000000000000000000000
int("10")          #10
int("10", 2)       #2    2進数として判断
int("1A", 16)      #26   5進数
int("1A", base=16) #26   16進数　オプション引数base
int("AB", 32)      #331  32進数
int("20", 5)       #10   5進数
int(1.234)         #1    原点方向に丸める
#int("1.234")      #ValueError: invalid literal for int() with base 10
int(float("1.234"))#1
int(1+2)           #3
#int("1+2")        #ValueError: invalid literal for int() with base 10
True               #True
True + 1           #2　必要になったらTrueを1として演算

　
型変換
int()の他にbin():2進数へ　oct()：8進数へ　hex()：16進数へ

bin(9)       #'0b1001'
#bin("9")    #TypeError: 'str' object cannot be interpreted as an integer
oct(9)       #'0o11'
hex(255)     #'0xff'

　
○小数型
　精度の高い計算はdecimalライブラリ。

2.5e5              #250000.0
0.1 + 0.1          #0.2
0.2 + 0.1          #0.30000000000000004
0.1 * 3            #0.30000000000000004
.1+.1+.1+.1+.1+.1+.1+.1+.1+.1   #0.9999999999999999

　
○分数
　1/2　→　0.5
　分子・分母を維持した分数の形のまま扱う時はfractionライブラリを利用する。
　
○複素数型　　　末尾にjまたはJ
　complex(re, im) re:実部　im：虚部
　・リテラル z = 2 + 3j　虚数単位はiではなくjを使う。
　・z = complex(2, 3)　またはcomplex("2+3j")　←文字列型の場合は間のスペースなし。
　　　jは虚数単位ではなく、複素数クラスであることを表す記号
　・z.real #2
　・z.imag #3
　・z.conjugate()　共役複素数
　・abs(z) # (z.real**2 + z.imag**2)**0.5
　・大小関係比較できない
　

z = 3 + 2j         #(3+2j)
z = complex(3, 2)  #(3+2j)    
#z = 3 + 2 * j     #name 'j' is not defined
z.real             #3.0
z.imag             #2.0
abs(z)             #3.6055512754639896
#z < 3             #TypeError: unorderable types: complex() < int()

　cmathライブラリに複素数の計算、座標変換などの関数

○演算など
・「1 + 2」　+:operator 1,2:operand
・print文にすると自動的に文字列型に変換
・和+ 差-　積*　商(小数)/　商(整数)//　剰余%　累乗**
・絶対値 abs()
・符号反転 -x
・累乗 2の8乗はpow(2,8)　または2**8
　
○丸め

//	切り捨て・左方向
int()	切り捨て・原点方向
round()	四捨五入・原点方向・桁指定可能
math.ceil()	切り上げ・右方向
math.floor()	切り捨て・左方向

import math
3 // 2           #1
int(1.5)         #1
round(1.5)       #2
math.ceil(1.5)   #2
math.floor(1.5)  #1

-3 // 2           #-2
int(-1.5)         #-1
round(-1.5)       #-2
math.ceil(-1.5)   #-1
math.floor(-1.5)  #-2

round(1.2345)     #1
round(1.2345, 2)  #1.23
round(1.3579, 2)  #1.36
round(1.2, 5)     #1.2

　
・ビット演算子

演算子	意味
&	ビットand
｜	ビットor
＾	ビットxor
~	ビット反転　aのビット反転は–(a+1)と同義
<<	左シフト（×2）
>>	右シフト（÷2）　右端からこぼれたビットは削除。左端（正負は不変）

　
「"12".zfill(4)」は0012として表示。formatは文字列参照
　
・文字→アスキーコード ord()
・アスキーコード→文字 chr()

ord("a")   #97
ord("あ")  #12354
chr(97)    #"a"