自然言語処理100本ノック2020(Rev 2)の記録(Python 3.11)
- Chapter 1 準備運動
- Chapter 2 UNIX コマンド
- Chapter 3 正規表現
- Chapter 4 形態素解析
- Chapter 5 係り受け解析
- Chapter 6 機械学習
- Chapter 7 単語ベクトル
- Chapter 8 ニューラルネット
- Chapter 9 RNNとCNN
- Chapter 10 機械翻訳
Chapter 7 単語ベクトル
メモ
pip install gensim
でgensimをインストールする。
ImportError: cannot import name 'triu' from 'scipy.linalg'
と怒られたので、
こちらを参考に、Scipyをv1.10.1にする。
60. 単語ベクトルの読み込みと表示
メモ
fastText(?)でロードする。下記を参考にしました。
import gensim model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) print(model["United_States"])
[-3.61328125e-02 -4.83398438e-02 2.35351562e-01 1.74804688e-01 -1.46484375e-01 -7.42187500e-02 -1.01562500e-01 -7.71484375e-02 1.09375000e-01 -5.71289062e-02 -1.48437500e-01 -6.00585938e-02 1.74804688e-01 -7.71484375e-02 2.58789062e-02 -7.66601562e-02 -3.80859375e-02 1.35742188e-01 3.75976562e-02 -4.19921875e-02 -3.56445312e-02 5.34667969e-02 3.68118286e-04 -1.66992188e-01 -1.17187500e-01 1.41601562e-01 -1.69921875e-01 -6.49414062e-02 -1.66992188e-01 1.00585938e-01 1.15722656e-01 -2.18750000e-01 -9.86328125e-02 -2.56347656e-02 1.23046875e-01 -3.54003906e-02 -1.58203125e-01 -1.60156250e-01 2.94189453e-02 8.15429688e-02 6.88476562e-02 1.87500000e-01 6.49414062e-02 1.15234375e-01 -2.27050781e-02 3.32031250e-01 -3.27148438e-02 1.77734375e-01 -2.08007812e-01 4.54101562e-02 -1.23901367e-02 1.19628906e-01 7.44628906e-03 -9.03320312e-03 1.14257812e-01 1.69921875e-01 -2.38281250e-01 -2.79541016e-02 -1.21093750e-01 2.47802734e-02 7.71484375e-02 -2.81982422e-02 -4.71191406e-02 1.78222656e-02 -1.23046875e-01 -5.32226562e-02 2.68554688e-02 -3.11279297e-02 -5.59082031e-02 -5.00488281e-02 -3.73535156e-02 1.25976562e-01 5.61523438e-02 1.51367188e-01 4.29687500e-02 -2.08007812e-01 -4.78515625e-02 2.78320312e-02 1.81640625e-01 2.20703125e-01 -3.61328125e-02 -8.39843750e-02 -3.69548798e-05 -9.52148438e-02 -1.25000000e-01 -1.95312500e-01 -1.50390625e-01 -4.15039062e-02 1.31835938e-01 1.17675781e-01 1.91650391e-02 5.51757812e-02 -9.42382812e-02 -1.08886719e-01 7.32421875e-02 -1.15234375e-01 8.93554688e-02 -1.40625000e-01 1.45507812e-01 4.49218750e-02 -1.10473633e-02 -1.62353516e-02 4.05883789e-03 3.75976562e-02 -6.98242188e-02 -5.46875000e-02 2.17285156e-02 -9.47265625e-02 4.24804688e-02 1.81884766e-02 -1.73339844e-02 4.63867188e-02 -1.42578125e-01 1.99218750e-01 1.10839844e-01 2.58789062e-02 -7.08007812e-02 -5.54199219e-02 3.45703125e-01 1.61132812e-01 -2.44140625e-01 -2.59765625e-01 -9.71679688e-02 8.00781250e-02 -8.78906250e-02 -7.22656250e-02 1.42578125e-01 -8.54492188e-02 -3.18359375e-01 8.30078125e-02 6.34765625e-02 1.64062500e-01 -1.92382812e-01 -1.17675781e-01 -5.41992188e-02 -1.56250000e-01 -1.21582031e-01 -4.95605469e-02 1.20117188e-01 -3.83300781e-02 5.51757812e-02 -8.97216797e-03 4.32128906e-02 6.93359375e-02 8.93554688e-02 2.53906250e-01 1.65039062e-01 1.64062500e-01 -1.41601562e-01 4.58984375e-02 1.97265625e-01 -8.98437500e-02 3.90625000e-02 -1.51367188e-01 -8.60595703e-03 -1.17675781e-01 -1.97265625e-01 -1.12792969e-01 1.29882812e-01 1.96289062e-01 1.56402588e-03 3.93066406e-02 2.17773438e-01 -1.43554688e-01 6.03027344e-02 -1.35742188e-01 1.16210938e-01 -1.59912109e-02 2.79296875e-01 1.46484375e-01 -1.19628906e-01 1.76757812e-01 1.28906250e-01 -1.49414062e-01 6.93359375e-02 -1.72851562e-01 9.22851562e-02 1.33056641e-02 -2.00195312e-01 -9.76562500e-02 -1.65039062e-01 -2.46093750e-01 -2.35595703e-02 -2.11914062e-01 1.84570312e-01 -1.85546875e-02 2.16796875e-01 5.05371094e-02 2.02636719e-02 4.25781250e-01 1.28906250e-01 -2.77099609e-02 1.29882812e-01 -1.15722656e-01 -2.05078125e-02 1.49414062e-01 7.81250000e-03 -2.05078125e-01 -8.05664062e-02 -2.67578125e-01 -2.29492188e-02 -8.20312500e-02 8.64257812e-02 7.61718750e-02 -3.66210938e-02 5.22460938e-02 -1.22070312e-01 -1.44042969e-02 -2.69531250e-01 8.44726562e-02 -2.52685547e-02 -2.96630859e-02 -1.68945312e-01 1.93359375e-01 -1.08398438e-01 1.94091797e-02 -1.80664062e-01 1.93359375e-01 -7.08007812e-02 5.85937500e-02 -1.01562500e-01 -1.31835938e-01 7.51953125e-02 -7.66601562e-02 3.37219238e-03 -8.59375000e-02 1.25000000e-01 2.92968750e-02 1.70898438e-01 -9.37500000e-02 -1.09375000e-01 -2.50244141e-02 2.11914062e-01 -4.44335938e-02 6.12792969e-02 2.62451172e-02 -1.77734375e-01 1.23046875e-01 -7.42187500e-02 -1.67968750e-01 -1.08886719e-01 -9.04083252e-04 -7.37304688e-02 5.49316406e-02 6.03027344e-02 8.39843750e-02 9.17968750e-02 -1.32812500e-01 1.22070312e-01 -8.78906250e-03 1.19140625e-01 -1.94335938e-01 -6.64062500e-02 -2.07031250e-01 7.37304688e-02 8.93554688e-02 1.81884766e-02 -1.20605469e-01 -2.61230469e-02 2.67333984e-02 7.76367188e-02 -8.30078125e-02 6.78710938e-02 -3.54003906e-02 3.10546875e-01 -2.42919922e-02 -1.41601562e-01 -2.08007812e-01 -4.57763672e-03 -6.54296875e-02 -4.95605469e-02 2.22656250e-01 1.53320312e-01 -1.38671875e-01 -5.24902344e-02 4.24804688e-02 -2.38281250e-01 1.56250000e-01 5.83648682e-04 -1.20605469e-01 -9.22851562e-02 -4.44335938e-02 3.61328125e-02 -1.86767578e-02 -8.25195312e-02 -8.25195312e-02 -4.05273438e-02 1.19018555e-02 1.69921875e-01 -2.80761719e-02 3.03649902e-03 9.32617188e-02 -8.49609375e-02 1.57470703e-02 7.03125000e-02 1.62353516e-02 -2.27050781e-02 3.51562500e-02 2.47070312e-01 -2.67333984e-02]
61. 単語の類似度
メモ
similarity
でコサイン類似度$ C $を計算。定義から下記を直接計算しても同様。
print(model.similarity("United_States", "U.S."))
0.73107743
62. 類似度の高い単語10件
メモ
most_similar
でコサイン類似度の高い語を順に返してくれる。
引数はベクトルでも、単語でもOK。
単語の場合、入力と同じ単語(ここでは”United_States”)は除いて出力される。
Unites_Statesとか、Untied_Statesとか、誤記が類似度の高い語として抽出された。 同じ文脈で現れるから、それもそうかとも思うけど、U.S.より類似度が高い。。
import numpy as np v = model["United_States"] v = v/np.linalg.norm(v) # normalization print("----- input:vector -----") for item, value in model.most_similar(v, topn=10): print(f"{item} {value}") print("----- input:key -----") for item, value in model.most_similar('United_States', topn=10): print(f"{item} {value}")
----- input:vector ----- United_States 1.0 Unites_States 0.7877248525619507 Untied_States 0.7541370987892151 United_Sates 0.7400724291801453 U.S. 0.7310774326324463 theUnited_States 0.6404393911361694 America 0.6178410053253174 UnitedStates 0.6167312264442444 Europe 0.6132988929748535 countries 0.6044804453849792 ----- input:key ----- Unites_States 0.7877248525619507 Untied_States 0.7541370987892151 United_Sates 0.7400724291801453 U.S. 0.7310774326324463 theUnited_States 0.6404393911361694 America 0.6178410053253174 UnitedStates 0.6167312264442444 Europe 0.6132988929748535 countries 0.6044804453849792 Canada 0.601906955242157
63. 加法構成性によるアナロジー
メモ
ベクトルを規格化する前に加減算する場合と、規格化後に加減算する場合とで、
微妙に異なる結果となる。
most_similar
の引数にpositive
, negative
のリストを指定する場合は、
規格化後に加減算するものと一致する。
ただし、positive
, negative
のリストで指定したと同じ単語(ここでは”Spain”, "Madrid", "Athens")は除いて出力される。
上位に現れる Aristeidis Grigoriadisは、ギリシャの水泳選手。
v1 = model["Spain"] v2 = model["Madrid"] v3 = model["Athens"] pos = ["Spain", "Athens",] neg = ["Madrid",] print("----- input:added/subtracted vector(un-normalized) -----") for item, value in model.most_similar(v1 - v2 + v3, topn=10): print(f"{item} {value}") print("----- input:added/subtracted vector(normalized) -----") # normalization v1 = v1/np.linalg.norm(v1) v2 = v2/np.linalg.norm(v2) v3 = v3/np.linalg.norm(v3) for item, value in model.most_similar(v1 - v2 + v3, topn=10): print(f"{item} {value}") print("----- keys -----") for item, value in model.most_similar(positive=pos, negative=neg, topn=10): print(f"{item} {value}")
----- input:added/subtracted vector(un-normalized) ----- Athens 0.7528455853462219 Greece 0.6685472130775452 Aristeidis_Grigoriadis 0.5495778322219849 Ioannis_Drymonakos 0.5361457467079163 Greeks 0.5351786017417908 Ioannis_Christou 0.5330225825309753 Hrysopiyi_Devetzi 0.5088489055633545 Iraklion 0.5059264302253723 Greek 0.5040615797042847 Athens_Greece 0.5034108757972717 ----- input:added/subtracted vector(normalized) ----- Athens 0.7548093199729919 Greece 0.6898480653762817 Aristeidis_Grigoriadis 0.560684859752655 Ioannis_Drymonakos 0.555290937423706 Greeks 0.545068621635437 Ioannis_Christou 0.5400862097740173 Hrysopiyi_Devetzi 0.5248445272445679 Heraklio 0.5207759737968445 Athens_Greece 0.516880989074707 Lithuania 0.5166865587234497 ----- keys ----- Greece 0.6898480653762817 Aristeidis_Grigoriadis 0.560684859752655 Ioannis_Drymonakos 0.5552908778190613 Greeks 0.545068621635437 Ioannis_Christou 0.5400862097740173 Hrysopiyi_Devetzi 0.5248445272445679 Heraklio 0.5207759737968445 Athens_Greece 0.516880989074707 Lithuania 0.5166865587234497 Iraklion 0.5146791338920593
64. アナロジーデータでの実験
メモ
出力は省略するが、加減算後のベクトルも求める。
行頭に:
がある行は、見出し?行。アナロジーの分類を示している。
首都-国の関係のアナロジーとか、家族関係のアナロジーとか。
形容詞-副詞のアナロジーなど、文法関係はgram
で始まる(grammer?)。
analogies = dict() buff = [] key = None with open("questions-words.txt") as f: for line in f.readlines(): ws = line.strip().split() if ws[0] == ":": if key: analogies[key] = buff buff = [] key = ws[1] else: v1 = model[ws[0]] v2 = model[ws[1]] v3 = model[ws[2]] # normalization v1 = v1/np.linalg.norm(v1) v2 = v2/np.linalg.norm(v2) v3 = v3/np.linalg.norm(v3) v0 = v2 - v1 + v3 rst = model.most_similar(positive=[ws[1], ws[2]], negative=[ws[0]], topn=1) pred, sim = rst[0] buff.append([*ws, v0, pred, sim]) analogies[key] = buff # output for key, vals in analogies.items(): print(f"* {key}") vals_sorted = sorted(vals, key=lambda v: v[6], reverse=True) # sort by similarity for v in vals_sorted[:5]: print(f"{v[1]} - {v[0]} + {v[2]} -> {v[5]} (sim={v[6]:.3f}, correct: {v[3]})") print("...") for v in vals_sorted[-5:]: print(f"{v[1]} - {v[0]} + {v[2]} -> {v[5]} (sim={v[6]:.3f}, correct: {v[3]})")
* capital-common-countries Japan - Tokyo + Moscow -> Russia (sim=0.886, correct: Russia) Russia - Moscow + Tehran -> Iran (sim=0.878, correct: Iran) Russia - Moscow + Tokyo -> Japan (sim=0.877, correct: Japan) Russia - Moscow + Beijing -> China (sim=0.869, correct: China) Cuba - Havana + Tehran -> Iran (sim=0.864, correct: Iran) ... Japan - Tokyo + Bern -> Switzerland (sim=0.483, correct: Switzerland) China - Beijing + Bern -> Bern_NC (sim=0.466, correct: Switzerland) England - London + Bern -> Hanover (sim=0.453, correct: Switzerland) Iraq - Baghdad + Bern -> coach_Bobby_Curlings (sim=0.435, correct: Switzerland) Afghanistan - Kabul + Bern -> Bern_NC (sim=0.423, correct: Switzerland) * capital-world Ukraine - Kiev + Moscow -> Russia (sim=0.888, correct: Russia) Russia - Moscow + Tehran -> Iran (sim=0.878, correct: Iran) Russia - Moscow + Tokyo -> Japan (sim=0.877, correct: Japan) Armenia - Yerevan + Baku -> Azerbaijan (sim=0.867, correct: Azerbaijan) Philippines - Manila + Moscow -> Russia (sim=0.861, correct: Russia) ... Madagascar - Antananarivo + Bern -> coach_Bobby_Curlings (sim=0.420, correct: Switzerland) Jordan - Amman + Antananarivo -> Mohamed_Berte (sim=0.413, correct: Madagascar) Tuvalu - Funafuti + Kingston -> Jamaica (sim=0.407, correct: Jamaica) Libya - Tripoli + Bern -> Switzerland (sim=0.402, correct: Switzerland) Greenland - Nuuk + Algiers -> Gulf (sim=0.390, correct: Algeria) * currency baht - Thailand + Malaysia -> RM## (sim=0.822, correct: ringgit) ringgit - Malaysia + Thailand -> baht (sim=0.820, correct: baht) baht - Thailand + Bulgaria -> leva (sim=0.814, correct: lev) baht - Thailand + Nigeria -> naira (sim=0.811, correct: naira) ringgit - Malaysia + Nigeria -> naira (sim=0.806, correct: naira) ... dong - Vietnam + USA -> lifts_Squaw_Valley (sim=0.376, correct: dollar) lev - Bulgaria + USA -> lifts_Squaw_Valley (sim=0.376, correct: dollar) riel - Cambodia + USA -> lifts_Squaw_Valley (sim=0.374, correct: dollar) dram - Armenia + USA -> Tennent_lager (sim=0.373, correct: dollar) rial - Iran + USA -> V_Singh_Fij (sim=0.346, correct: dollar) * city-in-state Nebraska - Omaha + Wichita -> Kansas (sim=0.825, correct: Kansas) Michigan - Detroit + Milwaukee -> Wisconsin (sim=0.824, correct: Wisconsin) Illinois - Chicago + Louisville -> Kentucky (sim=0.819, correct: Kentucky) Florida - Tampa + Honolulu -> Hawaii (sim=0.817, correct: Hawaii) Hawaii - Honolulu + Anchorage -> Alaska (sim=0.809, correct: Alaska) ... Washington - Spokane + Mesa -> Piñon (sim=0.424, correct: Arizona) Georgia - Atlanta + Irving -> Marilyn_Flax_whose (sim=0.420, correct: Texas) Washington - Tacoma + Mesa -> Arizona (sim=0.420, correct: Arizona) Washington - Tacoma + Fontana -> WUSM_SL (sim=0.392, correct: California) Washington - Spokane + Fontana -> Gus_Guida (sim=0.385, correct: California) * family her - his + he -> she (sim=0.944, correct: she) she - he + his -> her (sim=0.941, correct: her) granddaughter - grandson + son -> daughter (sim=0.937, correct: daughter) daughter - son + grandson -> granddaughter (sim=0.937, correct: granddaughter) daughters - sons + son -> daughter (sim=0.932, correct: daughter) ... mom - dad + stepbrother -> mother (sim=0.548, correct: stepsister) policewoman - policeman + stepbrother -> stepfather (sim=0.546, correct: stepsister) stepsister - stepbrother + groom -> bride (sim=0.544, correct: bride) stepdaughter - stepson + stepbrother -> niece (sim=0.542, correct: stepsister) stepmother - stepfather + stepbrother -> sister (sim=0.521, correct: stepsister) * gram1-adjective-to-adverb quickly - quick + swift -> swiftly (sim=0.770, correct: swiftly) fortunately - fortunate + lucky -> luckily (sim=0.767, correct: luckily) quickly - quick + rapid -> rapidly (sim=0.762, correct: rapidly) luckily - lucky + fortunate -> fortunately (sim=0.760, correct: fortunately) usually - usual + typical -> typically (sim=0.753, correct: typically) ... happily - happy + possible -> humanly_possible (sim=0.390, correct: possibly) freely - free + possible -> deliberatively (sim=0.385, correct: possibly) mostly - most + possible -> mainly (sim=0.376, correct: possibly) freely - free + immediate -> immediately (sim=0.373, correct: immediately) furiously - furious + immediate -> frantically (sim=0.368, correct: immediately) * gram2-opposite inefficient - efficient + productive -> unproductive (sim=0.738, correct: unproductive) uncomfortable - comfortable + pleasant -> unpleasant (sim=0.737, correct: unpleasant) illogical - logical + rational -> irrational (sim=0.734, correct: irrational) unreasonable - reasonable + rational -> irrational (sim=0.722, correct: irrational) unsure - sure + aware -> unaware (sim=0.714, correct: unaware) ... impossible - possible + informed -> informing (sim=0.390, correct: uninformed) impossibly - possibly + responsible -> solely_responsible (sim=0.389, correct: irresponsible) undecided - decided + possible -> persuadable (sim=0.387, correct: impossible) uninformative - informative + certain -> predetermined (sim=0.387, correct: uncertain) uninformative - informative + possible -> humanly_possible (sim=0.387, correct: impossible) * gram3-comparative larger - large + big -> bigger (sim=0.848, correct: bigger) stronger - strong + hard -> harder (sim=0.845, correct: harder) tighter - tight + tough -> tougher (sim=0.840, correct: tougher) harder - hard + tough -> tougher (sim=0.834, correct: tougher) bigger - big + large -> larger (sim=0.834, correct: larger) ... worse - bad + short -> shorter (sim=0.458, correct: shorter) newer - new + low -> high (sim=0.453, correct: lower) worse - bad + new -> thenew (sim=0.452, correct: newer) greater - great + old -> yearold (sim=0.427, correct: older) newer - new + wide -> scatter_bomblets (sim=0.397, correct: wider) * gram4-superlative strangest - strange + weird -> weirdest (sim=0.827, correct: weirdest) weirdest - weird + strange -> strangest (sim=0.824, correct: strangest) strongest - strong + sharp -> sharpest (sim=0.818, correct: sharpest) lowest - low + high -> highest (sim=0.814, correct: highest) highest - high + low -> lowest (sim=0.805, correct: lowest) ... tallest - tall + short -> shortest (sim=0.455, correct: shortest) oldest - old + fast -> fastest (sim=0.449, correct: fastest) warmest - warm + quick -> quickest (sim=0.444, correct: quickest) youngest - young + quick -> quickest (sim=0.444, correct: quickest) oldest - old + quick -> easiest (sim=0.426, correct: quickest) * gram5-present-participle decreasing - decrease + increase -> increasing (sim=0.885, correct: increasing) implementing - implement + enhance -> enhancing (sim=0.878, correct: enhancing) increasing - increase + decrease -> decreasing (sim=0.874, correct: decreasing) singing - sing + swim -> swimming (sim=0.869, correct: swimming) enhancing - enhance + generate -> generating (sim=0.860, correct: generating) ... thinking - think + go -> Going (sim=0.471, correct: going) coding - code + look -> looking (sim=0.461, correct: looking) thinking - think + say -> Say (sim=0.447, correct: saying) coding - code + go -> goes (sim=0.443, correct: going) saying - say + look -> looks (sim=0.440, correct: looking) * gram6-nationality-adjective Japanese - Japan + China -> Chinese (sim=0.928, correct: Chinese) Chinese - China + Japan -> Japanese (sim=0.927, correct: Japanese) German - Germany + Italy -> Italian (sim=0.926, correct: Italian) Russian - Russia + Bulgaria -> Bulgarian (sim=0.919, correct: Bulgarian) Italian - Italy + Germany -> German (sim=0.917, correct: German) ... English - England + Albania -> Macedonian (sim=0.534, correct: Albanian) English - England + Austria -> Austrian (sim=0.533, correct: Austrian) Albanian - Albania + England -> stock_symbol_BNK (sim=0.530, correct: English) Argentinean - Argentina + England -> ticker_symbol_BNK (sim=0.514, correct: English) Belorussian - Belarus + England -> stock_symbol_BNK (sim=0.514, correct: English) * gram7-past-tense increased - increasing + decreasing -> decreased (sim=0.900, correct: decreased) decreased - decreasing + increasing -> increased (sim=0.886, correct: increased) danced - dancing + singing -> sang (sim=0.866, correct: sang) saw - seeing + taking -> took (sim=0.848, correct: took) sang - singing + screaming -> screamed (sim=0.847, correct: screamed) ... fed - feeding + saying -> implying (sim=0.459, correct: said) went - going + knowing -> hid (sim=0.458, correct: knew) decreased - decreasing + saying -> stating (sim=0.457, correct: said) implemented - implementing + saying -> stating (sim=0.455, correct: said) spent - spending + looking -> look (sim=0.443, correct: looked) * gram8-plural horses - horse + dog -> dogs (sim=0.931, correct: dogs) dogs - dog + horse -> horses (sim=0.926, correct: horses) cats - cat + dog -> dogs (sim=0.909, correct: dogs) dogs - dog + cat -> cats (sim=0.898, correct: cats) cats - cat + horse -> horses (sim=0.883, correct: horses) ... monkeys - monkey + hand -> hands (sim=0.449, correct: hands) dollars - dollar + lion -> lions (sim=0.445, correct: lions) pineapples - pineapple + hand -> hands (sim=0.432, correct: hands) dollars - dollar + mouse -> Logitech_MX_Revolution (sim=0.432, correct: mice) mice - mouse + hand -> hands (sim=0.429, correct: hands) * gram9-plural-verbs increases - increase + decrease -> decreases (sim=0.878, correct: decreases) decreases - decrease + increase -> increases (sim=0.878, correct: increases) provides - provide + generate -> generates (sim=0.862, correct: generates) generates - generate + provide -> provides (sim=0.852, correct: provides) provides - provide + enhance -> enhances (sim=0.842, correct: enhances) ... implements - implement + say -> argue (sim=0.472, correct: says) talks - talk + see -> negotiations (sim=0.471, correct: sees) listens - listen + say -> believe (sim=0.464, correct: says) works - work + say -> argue (sim=0.458, correct: says) talks - talk + describe -> describing (sim=0.448, correct: describes)
65. アナロジータスクでの正解率
メモ
意味的アナロジーは、gram
で始まるもの以外。
文法的アナロジーは、gram
で始まるもの。
どちらも7割強くらいの正解率。有意かどうかはわからないけど、文法的アナロジーの方が正解率が高い?
cor_sem = 0 # number of correct answer in semantic analogy tot_sem = 0 # total number of semantic analogy cor_syn = 0 # number of correct answer in syntactic analogy tot_syn = 0 # total number of syntactic analogy for key, vals in analogies.items(): if key[:4] == "gram": tot_syn += len(vals) cor_syn += len(list(filter(lambda v: v[3]==v[5], vals))) else: tot_sem += len(vals) cor_sem += len(list(filter(lambda v: v[3]==v[5], vals))) # output print(f"accuracy for semantic analogies: {cor_sem/tot_sem}") print(f"accuracy for syntactic analogies: {cor_syn/tot_syn}")
accuracy for semantic analogies: 0.7308602999210734 accuracy for syntactic analogies: 0.7400468384074942
66. WordSimilarity-353での評価
メモ
スピアマン相関係数は、順位データから求められる相関の(ノンパラメトリックな)指標。
$ X,Y $のペアの数を$ N $, 対応する$ X,Y $間の順位差を$ D $とすると、スピアマン相関係数$ \rho _ {XY} $は、次の式で計算できる。
scipy.stats.spearmanr
で計算可。
zipを展開して得られるcombined.tabには、タブ区切りで単語1,単語2,類似度(人による評価)が入力されている。一行目は凡例。
from scipy.stats import spearmanr scores_human = [] scores_vector = [] with open("wordsim353/combined.tab") as f: for line in f.readlines()[1:]: w1, w2, v = line.strip().split() scores_human.append(float(v)) scores_vector.append(model.similarity(w1, w2)) rho, _ = spearmanr(scores_human, scores_vector) print(rho)
0.7000166486272194
67. k-meansクラスタリング
メモ
国連加盟国リストから、スペースをアンダースコアにしてリスト化(countries.txtというファイル名で保存)。"United States of America"は”United_States”ではないけど、とりあえず国連の表記に従う。"Bolivia (Plurinational State of)"などの括弧書きは無視。
from sklearn.cluster import KMeans from collections import defaultdict countries = [] vectors = [] with open("countries.txt") as f: for line in f.readlines(): name = line.strip() if name in model: countries.append(name) vectors.append(model[name]) kmeans = KMeans(n_clusters=5).fit(vectors) # output clusters = defaultdict(list) for c, l in zip(countries, kmeans.labels_): clusters[l].append(c) for i, v in enumerate(clusters.values()): print(f"** cluster-{i}: {v}")
** cluster-0: ['Afghanistan', 'Australia', 'Bahrain', 'Bangladesh', 'Bhutan', 'Brunei_Darussalam', 'Cambodia', 'China', 'Egypt', 'India', 'Indonesia', 'Iran', 'Iraq', 'Israel', 'Japan', 'Jordan', 'Kazakhstan', 'Kuwait', 'Kyrgyzstan', 'Lebanon', 'Libya', 'Malaysia', 'Mongolia', 'Morocco', 'Myanmar', 'Nepal', 'Oman', 'Pakistan', 'Qatar', 'Saudi_Arabia', 'Singapore', 'Sri_Lanka', 'Tajikistan', 'Thailand', 'Turkmenistan', 'United_Arab_Emirates', 'Uzbekistan', 'Viet_Nam', 'Yemen'] ** cluster-1: ['Albania', 'Andorra', 'Armenia', 'Austria', 'Azerbaijan', 'Belarus', 'Belgium', 'Bulgaria', 'Canada', 'Croatia', 'Cyprus', 'Czech_Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Georgia', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Italy', 'Latvia', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Malta', 'Monaco', 'Montenegro', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'San_Marino', 'Serbia', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'Turkey', 'Ukraine'] ** cluster-2: ['Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina_Faso', 'Burundi', 'Cameroon', 'Chad', 'Comoros', 'Congo', 'Djibouti', 'Equatorial_Guinea', 'Eritrea', 'Ethiopia', 'Gabon', 'Gambia', 'Ghana', 'Guinea', 'Guinea_Bissau', 'Kenya', 'Lesotho', 'Liberia', 'Madagascar', 'Malawi', 'Mali', 'Mauritania', 'Mozambique', 'Namibia', 'Niger', 'Nigeria', 'Rwanda', 'Senegal', 'Somalia', 'South_Africa', 'Sudan', 'Togo', 'Tunisia', 'Uganda', 'Zambia', 'Zimbabwe'] ** cluster-3: ['Argentina', 'Bahamas', 'Barbados', 'Belize', 'Bolivia', 'Brazil', 'Cabo_Verde', 'Chile', 'Colombia', 'Costa_Rica', 'Cuba', 'Dominica', 'Dominican_Republic', 'Ecuador', 'El_Salvador', 'Grenada', 'Guatemala', 'Guyana', 'Haiti', 'Honduras', 'Jamaica', 'Mexico', 'Nicaragua', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Suriname', 'Uruguay', 'Venezuela'] ** cluster-4: ['Fiji', 'Kiribati', 'Maldives', 'Marshall_Islands', 'Mauritius', 'Micronesia', 'Nauru', 'New_Zealand', 'Palau', 'Saint_Lucia', 'Samoa', 'Seychelles', 'Solomon_Islands', 'Tonga', 'Tuvalu', 'Vanuatu']
68. Ward法によるクラスタリング
メモ
scikit-learnのAgglomerativeClustering
(凝集性クラスタリング、ボトムアップなクラスタリング)のクラスタ間距離のデフォルトはWard
。
scikit-learnのデンドログラム図示を参考、というかほぼコピペ。
scipyのdendrogramの引数(linkage matrix Z
)については、
linkageを参照。
クラスタ数の4成分タプル?のリストで、タプルの1,2成分目は子ノードのクラスタ番号。
クラスタ番号は、leaf nodeを先頭に並べて、子を持つノードはそのあとから並べる形式。
タプルの3成分目は子ノード間の距離。タプルの4成分目はクラスタ内のleaf nodeの数(ここでは、国の数)。
AgglomerativeClustering.children
は1,2成分目の情報しかないので、
AgglomerativeClustering.distances
を使ってlinkage matrixを作るのが、plot_dendrogram
関数。
JapanとChinaが近くにあり(緑ゾーン)、その後Mongoliaや東南アジアの国々とまとめられていくのは、なんとなく納得感*1。 Japanと近そうな韓国があらわれていないのは、正式名称?のRepublic_of_Koreaがモデルに含まれていないということかな。 アメリカ(United_States_of_America)もUnited_Statesにしていないためあらわれていないのと同じ事情かと。 ちゃんと分析するには、このあたりの名寄せとかも必要なんだろうけど、ここではこの辺りで。
from matplotlib import pyplot as plt from scipy.cluster.hierarchy import dendrogram from sklearn.cluster import AgglomerativeClustering def plot_dendrogram(model, **kwargs): # Create linkage matrix and then plot the dendrogram # create the counts of samples under each node counts = np.zeros(model.children_.shape[0]) n_samples = len(model.labels_) for i, merge in enumerate(model.children_): current_count = 0 for child_idx in merge: if child_idx < n_samples: current_count += 1 # leaf node else: current_count += counts[child_idx - n_samples] counts[i] = current_count linkage_matrix = np.column_stack( [model.children_, model.distances_, counts] ).astype(float) # Plot the corresponding dendrogram dendrogram(linkage_matrix, **kwargs) # setting distance_threshold=0 ensures we compute the full tree. clustering = AgglomerativeClustering(linkage="ward", compute_distances=True).fit(vectors) plt.figure(figsize=(18, 28)) plt.title("Hierarchical Clustering Dendrogram") # plot the top three levels of the dendrogram plot_dendrogram(clustering, orientation="left", labels=countries, leaf_font_size=10) plt.show()
69. t-SNEによる可視化
メモ
SNEはStochastic Neighbor Embeddingの略。t-SNEは、t分布を使ったもの、ということらしい。 次元削減手法。 類似する点が近くに配置されるように、低次元空間への埋め込みを行う。
色分けはKMeansのグループ分けを使った。 文字が見づらいので、以下を参考に縁取りをした。
import matplotlib.patheffects as patheffects colors=['tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple',] from sklearn.manifold import TSNE X__embedded = TSNE(n_components=2).fit_transform(np.array(vectors)) plt.scatter(X__embedded[:, 0], X__embedded[:, 1], s=3, marker='x', color="gray") plt.axis("off") for i, country in enumerate(countries): plt.annotate(country, (X__embedded[i, 0], X__embedded[i, 1]), ha="center", size=6, color=colors[kmeans.labels_[i]], path_effects=[patheffects.withStroke(linewidth=2, foreground='whitesmoke', capstyle="round")] ) plt.show()
*1:見えない