Quibako Note

Questions unveiled invite brilliant and keen observation

自然言語処理100本ノック2020(Rev 2)の記録 Chapter 7

自然言語処理100本ノック2020(Rev 2)の記録(Python 3.11)

Chapter 7 単語ベクトル

単語の意味を実ベクトルで表現する単語ベクトル(単語埋め込み)に関して,以下の処理を行うプログラムを作成せよ.

メモ

pip install gensimgensimをインストールする。

ImportError: cannot import name 'triu' from 'scipy.linalg'と怒られたので、 こちらを参考に、Scipyをv1.10.1にする。

60. 単語ベクトルの読み込みと表示

Google Newsデータセット(約1,000億単語)での[学習済み単語ベクトル](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing)(300万単語・フレーズ,300次元)をダウンロードし,”United States”の単語ベクトルを表示せよ.ただし,”United States”は内部的には”United_States”と表現されていることに注意せよ.

メモ

fastText(?)でロードする。下記を参考にしました。

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

print(model["United_States"])
[-3.61328125e-02 -4.83398438e-02  2.35351562e-01  1.74804688e-01
 -1.46484375e-01 -7.42187500e-02 -1.01562500e-01 -7.71484375e-02
  1.09375000e-01 -5.71289062e-02 -1.48437500e-01 -6.00585938e-02
  1.74804688e-01 -7.71484375e-02  2.58789062e-02 -7.66601562e-02
 -3.80859375e-02  1.35742188e-01  3.75976562e-02 -4.19921875e-02
 -3.56445312e-02  5.34667969e-02  3.68118286e-04 -1.66992188e-01
 -1.17187500e-01  1.41601562e-01 -1.69921875e-01 -6.49414062e-02
 -1.66992188e-01  1.00585938e-01  1.15722656e-01 -2.18750000e-01
 -9.86328125e-02 -2.56347656e-02  1.23046875e-01 -3.54003906e-02
 -1.58203125e-01 -1.60156250e-01  2.94189453e-02  8.15429688e-02
  6.88476562e-02  1.87500000e-01  6.49414062e-02  1.15234375e-01
 -2.27050781e-02  3.32031250e-01 -3.27148438e-02  1.77734375e-01
 -2.08007812e-01  4.54101562e-02 -1.23901367e-02  1.19628906e-01
  7.44628906e-03 -9.03320312e-03  1.14257812e-01  1.69921875e-01
 -2.38281250e-01 -2.79541016e-02 -1.21093750e-01  2.47802734e-02
  7.71484375e-02 -2.81982422e-02 -4.71191406e-02  1.78222656e-02
 -1.23046875e-01 -5.32226562e-02  2.68554688e-02 -3.11279297e-02
 -5.59082031e-02 -5.00488281e-02 -3.73535156e-02  1.25976562e-01
  5.61523438e-02  1.51367188e-01  4.29687500e-02 -2.08007812e-01
 -4.78515625e-02  2.78320312e-02  1.81640625e-01  2.20703125e-01
 -3.61328125e-02 -8.39843750e-02 -3.69548798e-05 -9.52148438e-02
 -1.25000000e-01 -1.95312500e-01 -1.50390625e-01 -4.15039062e-02
  1.31835938e-01  1.17675781e-01  1.91650391e-02  5.51757812e-02
 -9.42382812e-02 -1.08886719e-01  7.32421875e-02 -1.15234375e-01
  8.93554688e-02 -1.40625000e-01  1.45507812e-01  4.49218750e-02
 -1.10473633e-02 -1.62353516e-02  4.05883789e-03  3.75976562e-02
 -6.98242188e-02 -5.46875000e-02  2.17285156e-02 -9.47265625e-02
  4.24804688e-02  1.81884766e-02 -1.73339844e-02  4.63867188e-02
 -1.42578125e-01  1.99218750e-01  1.10839844e-01  2.58789062e-02
 -7.08007812e-02 -5.54199219e-02  3.45703125e-01  1.61132812e-01
 -2.44140625e-01 -2.59765625e-01 -9.71679688e-02  8.00781250e-02
 -8.78906250e-02 -7.22656250e-02  1.42578125e-01 -8.54492188e-02
 -3.18359375e-01  8.30078125e-02  6.34765625e-02  1.64062500e-01
 -1.92382812e-01 -1.17675781e-01 -5.41992188e-02 -1.56250000e-01
 -1.21582031e-01 -4.95605469e-02  1.20117188e-01 -3.83300781e-02
  5.51757812e-02 -8.97216797e-03  4.32128906e-02  6.93359375e-02
  8.93554688e-02  2.53906250e-01  1.65039062e-01  1.64062500e-01
 -1.41601562e-01  4.58984375e-02  1.97265625e-01 -8.98437500e-02
  3.90625000e-02 -1.51367188e-01 -8.60595703e-03 -1.17675781e-01
 -1.97265625e-01 -1.12792969e-01  1.29882812e-01  1.96289062e-01
  1.56402588e-03  3.93066406e-02  2.17773438e-01 -1.43554688e-01
  6.03027344e-02 -1.35742188e-01  1.16210938e-01 -1.59912109e-02
  2.79296875e-01  1.46484375e-01 -1.19628906e-01  1.76757812e-01
  1.28906250e-01 -1.49414062e-01  6.93359375e-02 -1.72851562e-01
  9.22851562e-02  1.33056641e-02 -2.00195312e-01 -9.76562500e-02
 -1.65039062e-01 -2.46093750e-01 -2.35595703e-02 -2.11914062e-01
  1.84570312e-01 -1.85546875e-02  2.16796875e-01  5.05371094e-02
  2.02636719e-02  4.25781250e-01  1.28906250e-01 -2.77099609e-02
  1.29882812e-01 -1.15722656e-01 -2.05078125e-02  1.49414062e-01
  7.81250000e-03 -2.05078125e-01 -8.05664062e-02 -2.67578125e-01
 -2.29492188e-02 -8.20312500e-02  8.64257812e-02  7.61718750e-02
 -3.66210938e-02  5.22460938e-02 -1.22070312e-01 -1.44042969e-02
 -2.69531250e-01  8.44726562e-02 -2.52685547e-02 -2.96630859e-02
 -1.68945312e-01  1.93359375e-01 -1.08398438e-01  1.94091797e-02
 -1.80664062e-01  1.93359375e-01 -7.08007812e-02  5.85937500e-02
 -1.01562500e-01 -1.31835938e-01  7.51953125e-02 -7.66601562e-02
  3.37219238e-03 -8.59375000e-02  1.25000000e-01  2.92968750e-02
  1.70898438e-01 -9.37500000e-02 -1.09375000e-01 -2.50244141e-02
  2.11914062e-01 -4.44335938e-02  6.12792969e-02  2.62451172e-02
 -1.77734375e-01  1.23046875e-01 -7.42187500e-02 -1.67968750e-01
 -1.08886719e-01 -9.04083252e-04 -7.37304688e-02  5.49316406e-02
  6.03027344e-02  8.39843750e-02  9.17968750e-02 -1.32812500e-01
  1.22070312e-01 -8.78906250e-03  1.19140625e-01 -1.94335938e-01
 -6.64062500e-02 -2.07031250e-01  7.37304688e-02  8.93554688e-02
  1.81884766e-02 -1.20605469e-01 -2.61230469e-02  2.67333984e-02
  7.76367188e-02 -8.30078125e-02  6.78710938e-02 -3.54003906e-02
  3.10546875e-01 -2.42919922e-02 -1.41601562e-01 -2.08007812e-01
 -4.57763672e-03 -6.54296875e-02 -4.95605469e-02  2.22656250e-01
  1.53320312e-01 -1.38671875e-01 -5.24902344e-02  4.24804688e-02
 -2.38281250e-01  1.56250000e-01  5.83648682e-04 -1.20605469e-01
 -9.22851562e-02 -4.44335938e-02  3.61328125e-02 -1.86767578e-02
 -8.25195312e-02 -8.25195312e-02 -4.05273438e-02  1.19018555e-02
  1.69921875e-01 -2.80761719e-02  3.03649902e-03  9.32617188e-02
 -8.49609375e-02  1.57470703e-02  7.03125000e-02  1.62353516e-02
 -2.27050781e-02  3.51562500e-02  2.47070312e-01 -2.67333984e-02]

61. 単語の類似度

"United States"と”U.S.”のコサイン類似度を計算せよ.

メモ

similarityでコサイン類似度$ C $を計算。定義から下記を直接計算しても同様。

$$ % \require{physics} \begin{equation*} C = \frac{v_1 \cdot v_2}{\norm{v_1}\norm{v_2}} \end{equation*} $$
print(model.similarity("United_States", "U.S."))
0.73107743

62. 類似度の高い単語10件

"United States"とコサイン類似度が高い10語と,その類似度を出力せよ.

メモ

most_similarでコサイン類似度の高い語を順に返してくれる。 引数はベクトルでも、単語でもOK。 単語の場合、入力と同じ単語(ここでは”United_States”)は除いて出力される。

Unites_Statesとか、Untied_Statesとか、誤記が類似度の高い語として抽出された。 同じ文脈で現れるから、それもそうかとも思うけど、U.S.より類似度が高い。。

import numpy as np

v = model["United_States"]
v = v/np.linalg.norm(v) # normalization

print("----- input:vector -----")
for item, value in model.most_similar(v, topn=10):
    print(f"{item}  {value}")

print("----- input:key -----")
for item, value in model.most_similar('United_States', topn=10):
    print(f"{item}  {value}")
----- input:vector -----
United_States  1.0
Unites_States  0.7877248525619507
Untied_States  0.7541370987892151
United_Sates  0.7400724291801453
U.S.  0.7310774326324463
theUnited_States  0.6404393911361694
America  0.6178410053253174
UnitedStates  0.6167312264442444
Europe  0.6132988929748535
countries  0.6044804453849792
----- input:key -----
Unites_States  0.7877248525619507
Untied_States  0.7541370987892151
United_Sates  0.7400724291801453
U.S.  0.7310774326324463
theUnited_States  0.6404393911361694
America  0.6178410053253174
UnitedStates  0.6167312264442444
Europe  0.6132988929748535
countries  0.6044804453849792
Canada  0.601906955242157

63. 加法構成性によるアナロジー

"Spain"の単語ベクトルから”Madrid”のベクトルを引き,”Athens”のベクトルを足したベクトルを計算し,そのベクトルと類似度の高い10語とその類似度を出力せよ.

メモ

ベクトルを規格化する前に加減算する場合と、規格化後に加減算する場合とで、 微妙に異なる結果となる。 most_similarの引数にpositive, negativeのリストを指定する場合は、 規格化後に加減算するものと一致する。 ただし、positive, negativeのリストで指定したと同じ単語(ここでは”Spain”, "Madrid", "Athens")は除いて出力される。

上位に現れる Aristeidis Grigoriadisは、ギリシャの水泳選手。

v1 = model["Spain"]
v2 = model["Madrid"]
v3 = model["Athens"]
pos = ["Spain", "Athens",]
neg = ["Madrid",]

print("----- input:added/subtracted vector(un-normalized) -----")
for item, value in model.most_similar(v1 - v2 + v3, topn=10):
    print(f"{item} {value}")

print("----- input:added/subtracted vector(normalized) -----")
# normalization
v1 = v1/np.linalg.norm(v1)
v2 = v2/np.linalg.norm(v2)
v3 = v3/np.linalg.norm(v3)
for item, value in model.most_similar(v1 - v2 + v3, topn=10):
    print(f"{item} {value}")

print("----- keys -----")
for item, value in model.most_similar(positive=pos, negative=neg, topn=10):
    print(f"{item} {value}")
----- input:added/subtracted vector(un-normalized) -----
Athens 0.7528455853462219
Greece 0.6685472130775452
Aristeidis_Grigoriadis 0.5495778322219849
Ioannis_Drymonakos 0.5361457467079163
Greeks 0.5351786017417908
Ioannis_Christou 0.5330225825309753
Hrysopiyi_Devetzi 0.5088489055633545
Iraklion 0.5059264302253723
Greek 0.5040615797042847
Athens_Greece 0.5034108757972717
----- input:added/subtracted vector(normalized) -----
Athens 0.7548093199729919
Greece 0.6898480653762817
Aristeidis_Grigoriadis 0.560684859752655
Ioannis_Drymonakos 0.555290937423706
Greeks 0.545068621635437
Ioannis_Christou 0.5400862097740173
Hrysopiyi_Devetzi 0.5248445272445679
Heraklio 0.5207759737968445
Athens_Greece 0.516880989074707
Lithuania 0.5166865587234497
----- keys -----
Greece 0.6898480653762817
Aristeidis_Grigoriadis 0.560684859752655
Ioannis_Drymonakos 0.5552908778190613
Greeks 0.545068621635437
Ioannis_Christou 0.5400862097740173
Hrysopiyi_Devetzi 0.5248445272445679
Heraklio 0.5207759737968445
Athens_Greece 0.516880989074707
Lithuania 0.5166865587234497
Iraklion 0.5146791338920593

64. アナロジーデータでの実験

[単語アナロジーの評価データ](https://download.tensorflow.org/data/questions-words.txt) をダウンロードし, vec(2列目の単語) - vec(1列目の単語) + vec(3列目の単語)を計算し, そのベクトルと類似度が最も高い単語と,その類似度を求めよ. 求めた単語と類似度は,各事例の末尾に追記せよ.

メモ

出力は省略するが、加減算後のベクトルも求める。

行頭に:がある行は、見出し?行。アナロジーの分類を示している。 首都-国の関係のアナロジーとか、家族関係のアナロジーとか。 形容詞-副詞のアナロジーなど、文法関係はgramで始まる(grammer?)。

analogies = dict()
buff = []
key = None

with open("questions-words.txt") as f:
    for line in f.readlines():
        ws = line.strip().split()
        if ws[0] == ":":
            if key:
                analogies[key] = buff
                buff = []
            key = ws[1]
        else:
            v1 = model[ws[0]]
            v2 = model[ws[1]]
            v3 = model[ws[2]]

            # normalization
            v1 = v1/np.linalg.norm(v1)
            v2 = v2/np.linalg.norm(v2)
            v3 = v3/np.linalg.norm(v3)

            v0 = v2 - v1 + v3

            rst = model.most_similar(positive=[ws[1], ws[2]], negative=[ws[0]], topn=1)
            pred, sim = rst[0]
            buff.append([*ws, v0, pred, sim])

    analogies[key] = buff

# output
for key, vals in analogies.items():
    print(f"* {key}")
    vals_sorted = sorted(vals, key=lambda v: v[6], reverse=True) # sort by similarity

    for v in vals_sorted[:5]:
        print(f"{v[1]} - {v[0]} + {v[2]} -> {v[5]} (sim={v[6]:.3f}, correct: {v[3]})")
    print("...")
    for v in vals_sorted[-5:]:
        print(f"{v[1]} - {v[0]} + {v[2]} -> {v[5]} (sim={v[6]:.3f}, correct: {v[3]})")
* capital-common-countries
Japan - Tokyo + Moscow -> Russia (sim=0.886, correct: Russia)
Russia - Moscow + Tehran -> Iran (sim=0.878, correct: Iran)
Russia - Moscow + Tokyo -> Japan (sim=0.877, correct: Japan)
Russia - Moscow + Beijing -> China (sim=0.869, correct: China)
Cuba - Havana + Tehran -> Iran (sim=0.864, correct: Iran)
...
Japan - Tokyo + Bern -> Switzerland (sim=0.483, correct: Switzerland)
China - Beijing + Bern -> Bern_NC (sim=0.466, correct: Switzerland)
England - London + Bern -> Hanover (sim=0.453, correct: Switzerland)
Iraq - Baghdad + Bern -> coach_Bobby_Curlings (sim=0.435, correct: Switzerland)
Afghanistan - Kabul + Bern -> Bern_NC (sim=0.423, correct: Switzerland)
* capital-world
Ukraine - Kiev + Moscow -> Russia (sim=0.888, correct: Russia)
Russia - Moscow + Tehran -> Iran (sim=0.878, correct: Iran)
Russia - Moscow + Tokyo -> Japan (sim=0.877, correct: Japan)
Armenia - Yerevan + Baku -> Azerbaijan (sim=0.867, correct: Azerbaijan)
Philippines - Manila + Moscow -> Russia (sim=0.861, correct: Russia)
...
Madagascar - Antananarivo + Bern -> coach_Bobby_Curlings (sim=0.420, correct: Switzerland)
Jordan - Amman + Antananarivo -> Mohamed_Berte (sim=0.413, correct: Madagascar)
Tuvalu - Funafuti + Kingston -> Jamaica (sim=0.407, correct: Jamaica)
Libya - Tripoli + Bern -> Switzerland (sim=0.402, correct: Switzerland)
Greenland - Nuuk + Algiers -> Gulf (sim=0.390, correct: Algeria)
* currency
baht - Thailand + Malaysia -> RM## (sim=0.822, correct: ringgit)
ringgit - Malaysia + Thailand -> baht (sim=0.820, correct: baht)
baht - Thailand + Bulgaria -> leva (sim=0.814, correct: lev)
baht - Thailand + Nigeria -> naira (sim=0.811, correct: naira)
ringgit - Malaysia + Nigeria -> naira (sim=0.806, correct: naira)
...
dong - Vietnam + USA -> lifts_Squaw_Valley (sim=0.376, correct: dollar)
lev - Bulgaria + USA -> lifts_Squaw_Valley (sim=0.376, correct: dollar)
riel - Cambodia + USA -> lifts_Squaw_Valley (sim=0.374, correct: dollar)
dram - Armenia + USA -> Tennent_lager (sim=0.373, correct: dollar)
rial - Iran + USA -> V_Singh_Fij (sim=0.346, correct: dollar)
* city-in-state
Nebraska - Omaha + Wichita -> Kansas (sim=0.825, correct: Kansas)
Michigan - Detroit + Milwaukee -> Wisconsin (sim=0.824, correct: Wisconsin)
Illinois - Chicago + Louisville -> Kentucky (sim=0.819, correct: Kentucky)
Florida - Tampa + Honolulu -> Hawaii (sim=0.817, correct: Hawaii)
Hawaii - Honolulu + Anchorage -> Alaska (sim=0.809, correct: Alaska)
...
Washington - Spokane + Mesa -> Piñon (sim=0.424, correct: Arizona)
Georgia - Atlanta + Irving -> Marilyn_Flax_whose (sim=0.420, correct: Texas)
Washington - Tacoma + Mesa -> Arizona (sim=0.420, correct: Arizona)
Washington - Tacoma + Fontana -> WUSM_SL (sim=0.392, correct: California)
Washington - Spokane + Fontana -> Gus_Guida (sim=0.385, correct: California)
* family
her - his + he -> she (sim=0.944, correct: she)
she - he + his -> her (sim=0.941, correct: her)
granddaughter - grandson + son -> daughter (sim=0.937, correct: daughter)
daughter - son + grandson -> granddaughter (sim=0.937, correct: granddaughter)
daughters - sons + son -> daughter (sim=0.932, correct: daughter)
...
mom - dad + stepbrother -> mother (sim=0.548, correct: stepsister)
policewoman - policeman + stepbrother -> stepfather (sim=0.546, correct: stepsister)
stepsister - stepbrother + groom -> bride (sim=0.544, correct: bride)
stepdaughter - stepson + stepbrother -> niece (sim=0.542, correct: stepsister)
stepmother - stepfather + stepbrother -> sister (sim=0.521, correct: stepsister)
* gram1-adjective-to-adverb
quickly - quick + swift -> swiftly (sim=0.770, correct: swiftly)
fortunately - fortunate + lucky -> luckily (sim=0.767, correct: luckily)
quickly - quick + rapid -> rapidly (sim=0.762, correct: rapidly)
luckily - lucky + fortunate -> fortunately (sim=0.760, correct: fortunately)
usually - usual + typical -> typically (sim=0.753, correct: typically)
...
happily - happy + possible -> humanly_possible (sim=0.390, correct: possibly)
freely - free + possible -> deliberatively (sim=0.385, correct: possibly)
mostly - most + possible -> mainly (sim=0.376, correct: possibly)
freely - free + immediate -> immediately (sim=0.373, correct: immediately)
furiously - furious + immediate -> frantically (sim=0.368, correct: immediately)
* gram2-opposite
inefficient - efficient + productive -> unproductive (sim=0.738, correct: unproductive)
uncomfortable - comfortable + pleasant -> unpleasant (sim=0.737, correct: unpleasant)
illogical - logical + rational -> irrational (sim=0.734, correct: irrational)
unreasonable - reasonable + rational -> irrational (sim=0.722, correct: irrational)
unsure - sure + aware -> unaware (sim=0.714, correct: unaware)
...
impossible - possible + informed -> informing (sim=0.390, correct: uninformed)
impossibly - possibly + responsible -> solely_responsible (sim=0.389, correct: irresponsible)
undecided - decided + possible -> persuadable (sim=0.387, correct: impossible)
uninformative - informative + certain -> predetermined (sim=0.387, correct: uncertain)
uninformative - informative + possible -> humanly_possible (sim=0.387, correct: impossible)
* gram3-comparative
larger - large + big -> bigger (sim=0.848, correct: bigger)
stronger - strong + hard -> harder (sim=0.845, correct: harder)
tighter - tight + tough -> tougher (sim=0.840, correct: tougher)
harder - hard + tough -> tougher (sim=0.834, correct: tougher)
bigger - big + large -> larger (sim=0.834, correct: larger)
...
worse - bad + short -> shorter (sim=0.458, correct: shorter)
newer - new + low -> high (sim=0.453, correct: lower)
worse - bad + new -> thenew (sim=0.452, correct: newer)
greater - great + old -> yearold (sim=0.427, correct: older)
newer - new + wide -> scatter_bomblets (sim=0.397, correct: wider)
* gram4-superlative
strangest - strange + weird -> weirdest (sim=0.827, correct: weirdest)
weirdest - weird + strange -> strangest (sim=0.824, correct: strangest)
strongest - strong + sharp -> sharpest (sim=0.818, correct: sharpest)
lowest - low + high -> highest (sim=0.814, correct: highest)
highest - high + low -> lowest (sim=0.805, correct: lowest)
...
tallest - tall + short -> shortest (sim=0.455, correct: shortest)
oldest - old + fast -> fastest (sim=0.449, correct: fastest)
warmest - warm + quick -> quickest (sim=0.444, correct: quickest)
youngest - young + quick -> quickest (sim=0.444, correct: quickest)
oldest - old + quick -> easiest (sim=0.426, correct: quickest)
* gram5-present-participle
decreasing - decrease + increase -> increasing (sim=0.885, correct: increasing)
implementing - implement + enhance -> enhancing (sim=0.878, correct: enhancing)
increasing - increase + decrease -> decreasing (sim=0.874, correct: decreasing)
singing - sing + swim -> swimming (sim=0.869, correct: swimming)
enhancing - enhance + generate -> generating (sim=0.860, correct: generating)
...
thinking - think + go -> Going (sim=0.471, correct: going)
coding - code + look -> looking (sim=0.461, correct: looking)
thinking - think + say -> Say (sim=0.447, correct: saying)
coding - code + go -> goes (sim=0.443, correct: going)
saying - say + look -> looks (sim=0.440, correct: looking)
* gram6-nationality-adjective
Japanese - Japan + China -> Chinese (sim=0.928, correct: Chinese)
Chinese - China + Japan -> Japanese (sim=0.927, correct: Japanese)
German - Germany + Italy -> Italian (sim=0.926, correct: Italian)
Russian - Russia + Bulgaria -> Bulgarian (sim=0.919, correct: Bulgarian)
Italian - Italy + Germany -> German (sim=0.917, correct: German)
...
English - England + Albania -> Macedonian (sim=0.534, correct: Albanian)
English - England + Austria -> Austrian (sim=0.533, correct: Austrian)
Albanian - Albania + England -> stock_symbol_BNK (sim=0.530, correct: English)
Argentinean - Argentina + England -> ticker_symbol_BNK (sim=0.514, correct: English)
Belorussian - Belarus + England -> stock_symbol_BNK (sim=0.514, correct: English)
* gram7-past-tense
increased - increasing + decreasing -> decreased (sim=0.900, correct: decreased)
decreased - decreasing + increasing -> increased (sim=0.886, correct: increased)
danced - dancing + singing -> sang (sim=0.866, correct: sang)
saw - seeing + taking -> took (sim=0.848, correct: took)
sang - singing + screaming -> screamed (sim=0.847, correct: screamed)
...
fed - feeding + saying -> implying (sim=0.459, correct: said)
went - going + knowing -> hid (sim=0.458, correct: knew)
decreased - decreasing + saying -> stating (sim=0.457, correct: said)
implemented - implementing + saying -> stating (sim=0.455, correct: said)
spent - spending + looking -> look (sim=0.443, correct: looked)
* gram8-plural
horses - horse + dog -> dogs (sim=0.931, correct: dogs)
dogs - dog + horse -> horses (sim=0.926, correct: horses)
cats - cat + dog -> dogs (sim=0.909, correct: dogs)
dogs - dog + cat -> cats (sim=0.898, correct: cats)
cats - cat + horse -> horses (sim=0.883, correct: horses)
...
monkeys - monkey + hand -> hands (sim=0.449, correct: hands)
dollars - dollar + lion -> lions (sim=0.445, correct: lions)
pineapples - pineapple + hand -> hands (sim=0.432, correct: hands)
dollars - dollar + mouse -> Logitech_MX_Revolution (sim=0.432, correct: mice)
mice - mouse + hand -> hands (sim=0.429, correct: hands)
* gram9-plural-verbs
increases - increase + decrease -> decreases (sim=0.878, correct: decreases)
decreases - decrease + increase -> increases (sim=0.878, correct: increases)
provides - provide + generate -> generates (sim=0.862, correct: generates)
generates - generate + provide -> provides (sim=0.852, correct: provides)
provides - provide + enhance -> enhances (sim=0.842, correct: enhances)
...
implements - implement + say -> argue (sim=0.472, correct: says)
talks - talk + see -> negotiations (sim=0.471, correct: sees)
listens - listen + say -> believe (sim=0.464, correct: says)
works - work + say -> argue (sim=0.458, correct: says)
talks - talk + describe -> describing (sim=0.448, correct: describes)

65. アナロジータスクでの正解率

64の実行結果を用い,意味的アナロジー(semantic analogy)と文法的アナロジー(syntactic analogy)の正解率を測定せよ.

メモ

意味的アナロジーは、gramで始まるもの以外。 文法的アナロジーは、gramで始まるもの。

どちらも7割強くらいの正解率。有意かどうかはわからないけど、文法的アナロジーの方が正解率が高い?

cor_sem = 0 # number of correct answer in semantic analogy 
tot_sem = 0 # total number of semantic analogy
cor_syn = 0 # number of correct answer in syntactic analogy
tot_syn = 0 # total number of syntactic analogy

for key, vals in analogies.items():
    if key[:4] == "gram":
        tot_syn += len(vals)
        cor_syn += len(list(filter(lambda v: v[3]==v[5], vals)))
    else:
        tot_sem += len(vals)
        cor_sem += len(list(filter(lambda v: v[3]==v[5], vals)))

# output
print(f"accuracy for semantic analogies: {cor_sem/tot_sem}")
print(f"accuracy for syntactic analogies: {cor_syn/tot_syn}")
accuracy for semantic analogies: 0.7308602999210734
accuracy for syntactic analogies: 0.7400468384074942

66. WordSimilarity-353での評価

[The WordSimilarity-353 Test Collection](https://www.gabrilovich.com/resources/data/wordsim353/wordsim353.html)の評価データをダウンロードし,単語ベクトルにより計算される類似度のランキングと,人間の類似度判定のランキングの間のスピアマン相関係数を計算せよ.

メモ

スピアマン相関係数は、順位データから求められる相関の(ノンパラメトリックな)指標。

$ X,Y $のペアの数を$ N $, 対応する$ X,Y $間の順位差を$ D $とすると、スピアマン相関係数$ \rho _ {XY} $は、次の式で計算できる。

$$ \begin{equation*} \rho_{XY} = 1 - \frac{6\sum D^2}{N^3-N} \end{equation*} $$

scipy.stats.spearmanrで計算可。

zipを展開して得られるcombined.tabには、タブ区切りで単語1,単語2,類似度(人による評価)が入力されている。一行目は凡例。

from scipy.stats import spearmanr

scores_human = []
scores_vector = []

with open("wordsim353/combined.tab") as f:
    for line in f.readlines()[1:]:
        w1, w2, v = line.strip().split()
        scores_human.append(float(v))
        scores_vector.append(model.similarity(w1, w2))

rho, _ = spearmanr(scores_human, scores_vector)
print(rho)        
0.7000166486272194

67. k-meansクラスタリング

国名に関する単語ベクトルを抽出し,k-meansクラスタリングをクラスタ数k=5として実行せよ.

メモ

国連加盟国リストから、スペースをアンダースコアにしてリスト化(countries.txtというファイル名で保存)。"United States of America"は”United_States”ではないけど、とりあえず国連の表記に従う。"Bolivia (Plurinational State of)"などの括弧書きは無視。

from sklearn.cluster import KMeans
from collections import defaultdict

countries = []
vectors = []
with open("countries.txt") as f:
    for line in f.readlines():
        name = line.strip()
        if name in model:
            countries.append(name)
            vectors.append(model[name])

kmeans = KMeans(n_clusters=5).fit(vectors)

# output
clusters = defaultdict(list)
for c, l in zip(countries, kmeans.labels_):
    clusters[l].append(c) 
for i, v in enumerate(clusters.values()):
    print(f"** cluster-{i}: {v}")
** cluster-0: ['Afghanistan', 'Australia', 'Bahrain', 'Bangladesh', 'Bhutan', 'Brunei_Darussalam', 'Cambodia', 'China', 'Egypt', 'India', 'Indonesia', 'Iran', 'Iraq', 'Israel', 'Japan', 'Jordan', 'Kazakhstan', 'Kuwait', 'Kyrgyzstan', 'Lebanon', 'Libya', 'Malaysia', 'Mongolia', 'Morocco', 'Myanmar', 'Nepal', 'Oman', 'Pakistan', 'Qatar', 'Saudi_Arabia', 'Singapore', 'Sri_Lanka', 'Tajikistan', 'Thailand', 'Turkmenistan', 'United_Arab_Emirates', 'Uzbekistan', 'Viet_Nam', 'Yemen']
** cluster-1: ['Albania', 'Andorra', 'Armenia', 'Austria', 'Azerbaijan', 'Belarus', 'Belgium', 'Bulgaria', 'Canada', 'Croatia', 'Cyprus', 'Czech_Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Georgia', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Italy', 'Latvia', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Malta', 'Monaco', 'Montenegro', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'San_Marino', 'Serbia', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'Turkey', 'Ukraine']
** cluster-2: ['Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina_Faso', 'Burundi', 'Cameroon', 'Chad', 'Comoros', 'Congo', 'Djibouti', 'Equatorial_Guinea', 'Eritrea', 'Ethiopia', 'Gabon', 'Gambia', 'Ghana', 'Guinea', 'Guinea_Bissau', 'Kenya', 'Lesotho', 'Liberia', 'Madagascar', 'Malawi', 'Mali', 'Mauritania', 'Mozambique', 'Namibia', 'Niger', 'Nigeria', 'Rwanda', 'Senegal', 'Somalia', 'South_Africa', 'Sudan', 'Togo', 'Tunisia', 'Uganda', 'Zambia', 'Zimbabwe']
** cluster-3: ['Argentina', 'Bahamas', 'Barbados', 'Belize', 'Bolivia', 'Brazil', 'Cabo_Verde', 'Chile', 'Colombia', 'Costa_Rica', 'Cuba', 'Dominica', 'Dominican_Republic', 'Ecuador', 'El_Salvador', 'Grenada', 'Guatemala', 'Guyana', 'Haiti', 'Honduras', 'Jamaica', 'Mexico', 'Nicaragua', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Suriname', 'Uruguay', 'Venezuela']
** cluster-4: ['Fiji', 'Kiribati', 'Maldives', 'Marshall_Islands', 'Mauritius', 'Micronesia', 'Nauru', 'New_Zealand', 'Palau', 'Saint_Lucia', 'Samoa', 'Seychelles', 'Solomon_Islands', 'Tonga', 'Tuvalu', 'Vanuatu']

68. Ward法によるクラスタリング

国名に関する単語ベクトルに対し,Ward法による階層型クラスタリングを実行せよ.さらに,クラスタリング結果をデンドログラムとして可視化せよ.

メモ

scikit-learnのAgglomerativeClustering(凝集性クラスタリング、ボトムアップなクラスタリング)のクラスタ間距離のデフォルトはWard

scikit-learnのデンドログラム図示を参考、というかほぼコピペ。

scipyのdendrogramの引数(linkage matrix Z)については、 linkageを参照。 クラスタ数の4成分タプル?のリストで、タプルの1,2成分目は子ノードのクラスタ番号。 クラスタ番号は、leaf nodeを先頭に並べて、子を持つノードはそのあとから並べる形式。 タプルの3成分目は子ノード間の距離。タプルの4成分目はクラスタ内のleaf nodeの数(ここでは、国の数)。 AgglomerativeClustering.childrenは1,2成分目の情報しかないので、 AgglomerativeClustering.distancesを使ってlinkage matrixを作るのが、plot_dendrogram関数。

JapanとChinaが近くにあり(緑ゾーン)、その後Mongoliaや東南アジアの国々とまとめられていくのは、なんとなく納得感*1。 Japanと近そうな韓国があらわれていないのは、正式名称?のRepublic_of_Koreaがモデルに含まれていないということかな。 アメリカ(United_States_of_America)もUnited_Statesにしていないためあらわれていないのと同じ事情かと。 ちゃんと分析するには、このあたりの名寄せとかも必要なんだろうけど、ここではこの辺りで。

from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import AgglomerativeClustering

def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)


# setting distance_threshold=0 ensures we compute the full tree.
clustering = AgglomerativeClustering(linkage="ward", compute_distances=True).fit(vectors)

plt.figure(figsize=(18, 28))
plt.title("Hierarchical Clustering Dendrogram")
# plot the top three levels of the dendrogram
plot_dendrogram(clustering, orientation="left", labels=countries, leaf_font_size=10)
plt.show()

69. t-SNEによる可視化

ベクトル空間上の国名に関する単語ベクトルをt-SNEで可視化せよ.

メモ

SNEはStochastic Neighbor Embeddingの略。t-SNEは、t分布を使ったもの、ということらしい。 次元削減手法。 類似する点が近くに配置されるように、低次元空間への埋め込みを行う。

色分けはKMeansのグループ分けを使った。 文字が見づらいので、以下を参考に縁取りをした。

import matplotlib.patheffects as patheffects
colors=['tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple',]
 
from sklearn.manifold import TSNE

X__embedded = TSNE(n_components=2).fit_transform(np.array(vectors))

plt.scatter(X__embedded[:, 0], X__embedded[:, 1],
            s=3, marker='x', color="gray")
plt.axis("off")
for i, country in enumerate(countries):
    plt.annotate(country, (X__embedded[i, 0], X__embedded[i, 1]),
                 ha="center",
                 size=6,
                 color=colors[kmeans.labels_[i]],
                 path_effects=[patheffects.withStroke(linewidth=2, foreground='whitesmoke', capstyle="round")]
                 )
plt.show()

*1:見えない