Analogy Prediction with Embeddings

Author

Pedro Teles

Published

November 1, 2023

Introduction

This project aims to forecast analogies utilizing word embeddings, leveraging the Text8 corpus—a refined version of the English Wikipedia dump as of March 3, 2006. This dataset encompasses 17,005,207 words and features a diverse vocabulary of 253,854 unique terms.

For evaluation, the questions-words dataset, comprising 19,544 analogies across 14 categories (like capital-common countries, family, and currency) and 14,064 distinct words, is utilized.

Given that both datasets are preprocessed, our preprocessing involved solely the elimination of stopwords from the Text8 corpus. Had the datasets not been preprocessed, additional steps like tokenization, conversion to lowercase, and removal of punctuation, numbers, and special characters would have been necessary.

We evaluate using cosine similarity between the predicted and actual words, calculated as:

\[ \text{cosine similarity} = \frac{\mathbf{w}_1 \cdot \mathbf{w}_2}{\|\mathbf{w}_1\| \times \|\mathbf{w}_2\|} \]

Here, \(\mathbf{w_1}\) and \(\mathbf{w_2}\) represent the embeddings of the predicted and actual words, respectively.

For comprehensive analysis, we compute the mean and standard deviation of the cosine similarity across analogies. The final score is derived from the average of these means, normalized by their standard deviations, ensuring accuracy and consistency in the model.

Additionally, we utilize the Optuna library for hyperparameter tuning, exploring various combinations to optimize performance. Key hyperparameters from the Gensim documentation include:

vector_size: The dimensionality of word vectors.
sg: The training algorithm, with 1 indicating skip-gram and 0 CBOW.
alpha: The initial learning rate.
window: The maximum word distance in a sentence.
hs: Hierarchical softmax (1) or negative sampling (0).
min_count: Minimum frequency threshold for words.
negative: The count of “noise words” in negative sampling.
ns_exponent: Shapes the negative sampling distribution.
min_alpha: The final learning rate post-training.
epochs: The number of training iterations.

Import Libraries

import pandas as pd
import numpy as np
import multiprocessing
from nltk.corpus import stopwords

# Hyperparameter Tuning
import optuna
from optuna.visualization import plot_contour, plot_optimization_history, plot_parallel_coordinate, plot_param_importances, plot_slice, plot_timeline

# Word2Vec
from gensim.models import Word2Vec
from gensim.models.word2vec import Text8Corpus

Import and Preprocess Data

Corpus

text8_corpus = Text8Corpus('text8')

sentences = []
for sentence in text8_corpus:
    sentences.append(sentence)

# Remove Stopwords
stop_words = set(stopwords.words('english'))

filtered_sentences = []
for sentence in sentences:
    filtered_sentences.append([w for w in sentence if not w in stop_words])

del sentences

Evaluation Dataset (Analogies)

evaluation_dataset = pd.read_table('questions-words.txt')\
    .set_axis(["analogy"], axis=1)\
    .assign(
        analogy = lambda x: x["analogy"].str.lower(),
        count_words = lambda x: x["analogy"].apply(lambda x: len(x.split(" ")))
    )\
    .query("count_words == 4")\
    .sample(frac=1, random_state=42)\
    .reset_index(drop=True)

# Split into train and test
train_size = int(0.7 * len(evaluation_dataset))
train = evaluation_dataset[:train_size]["analogy"].to_list()
test = evaluation_dataset[train_size:]["analogy"].to_list()

Word2Vec Hyperparameter Tuning

def cosine_similarity(x_i, x_j):
    return np.dot(x_i, x_j) / (np.linalg.norm(x_i) * np.linalg.norm(x_j))

def analogy_accuracy(model, analogy):
    w1, w2, w3, w4 = analogy.split(" ")

    try:
        w1 = model.wv[w1]
        w2 = model.wv[w2]
        w3 = model.wv[w3]
        w4 = model.wv[w4]
    except KeyError:
        return np.nan
    
    w4_hat = w1 - w2 + w3

    return cosine_similarity(w4, w4_hat)

def objective(trial: optuna.Trial) -> float:
    params = {
        "sentences": filtered_sentences,
        "workers": multiprocessing.cpu_count()-1,
        "seed": 42,     
        "vector_size": trial.suggest_int("vector_size", 50, 150),# 300 may be better
        "sg": trial.suggest_categorical("sg", [0, 1]),
        "alpha": trial.suggest_float("alpha", 0.0001, 0.01, log=True),
        "window": trial.suggest_int("window", 2, 10),
        "epochs": trial.suggest_int("epochs", 5, 15),
        #"hs": trial.suggest_categorical("hs", [0, 1]),
        #"min_count": trial.suggest_int("min_count", 1, 5),
        #"negative": trial.suggest_int("negative", 5, 20),
        #"ns_exponent": trial.suggest_float("ns_exponent", 0.0001, 1.0, log=True),
        #"min_alpha": trial.suggest_float("min_alpha", 0.0001, 0.01, log=True),
    }

    w2v_model = Word2Vec(**params)

    similarity = [analogy_accuracy(w2v_model, analogy) for analogy in train]

    return np.nanmean(similarity) / np.nanstd(similarity)

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=300, timeout=28800) # 8 hours

[I 2023-11-10 02:06:42,782] A new study created in memory with name: no-name-7d3bbe02-1605-4b0f-951a-75c65b15be1c
[I 2023-11-10 02:10:04,884] Trial 0 finished with value: 1.663729050480949 and parameters: {'vector_size': 134, 'sg': 1, 'alpha': 0.004128997057427607, 'window': 7, 'epochs': 9}. Best is trial 0 with value: 1.663729050480949.
[I 2023-11-10 02:12:59,763] Trial 1 finished with value: 0.9350824913347667 and parameters: {'vector_size': 134, 'sg': 1, 'alpha': 0.001363027648561294, 'window': 5, 'epochs': 10}. Best is trial 0 with value: 1.663729050480949.
[I 2023-11-10 02:15:16,363] Trial 2 finished with value: 0.49285070636415523 and parameters: {'vector_size': 104, 'sg': 1, 'alpha': 0.00019177050090095983, 'window': 4, 'epochs': 9}. Best is trial 0 with value: 1.663729050480949.
[I 2023-11-10 02:18:55,145] Trial 3 finished with value: 0.5657513814724531 and parameters: {'vector_size': 142, 'sg': 1, 'alpha': 0.00023376071998626105, 'window': 3, 'epochs': 15}. Best is trial 0 with value: 1.663729050480949.
[I 2023-11-10 02:21:05,845] Trial 4 finished with value: 1.9349473215547983 and parameters: {'vector_size': 71, 'sg': 1, 'alpha': 0.00797292792114812, 'window': 7, 'epochs': 6}. Best is trial 4 with value: 1.9349473215547983.
[I 2023-11-10 02:22:11,387] Trial 5 finished with value: 0.25835717782530054 and parameters: {'vector_size': 81, 'sg': 0, 'alpha': 0.0001437364003613756, 'window': 6, 'epochs': 11}. Best is trial 4 with value: 1.9349473215547983.
[I 2023-11-10 02:23:14,198] Trial 6 finished with value: 0.26826887446549164 and parameters: {'vector_size': 142, 'sg': 0, 'alpha': 0.0002048186135954998, 'window': 7, 'epochs': 10}. Best is trial 4 with value: 1.9349473215547983.
[I 2023-11-10 02:23:55,698] Trial 7 finished with value: 0.26272924307945733 and parameters: {'vector_size': 51, 'sg': 0, 'alpha': 0.0001406255965404354, 'window': 3, 'epochs': 8}. Best is trial 4 with value: 1.9349473215547983.
[I 2023-11-10 02:25:25,996] Trial 8 finished with value: 0.30217020722828275 and parameters: {'vector_size': 119, 'sg': 0, 'alpha': 0.00025612332825710913, 'window': 4, 'epochs': 15}. Best is trial 4 with value: 1.9349473215547983.
[I 2023-11-10 02:25:54,609] Trial 9 finished with value: 0.260298227608561 and parameters: {'vector_size': 71, 'sg': 0, 'alpha': 0.00041793436029817927, 'window': 9, 'epochs': 5}. Best is trial 4 with value: 1.9349473215547983.
[I 2023-11-10 02:28:46,488] Trial 10 finished with value: 2.2253115230021625 and parameters: {'vector_size': 82, 'sg': 1, 'alpha': 0.009457093472682237, 'window': 10, 'epochs': 5}. Best is trial 10 with value: 2.2253115230021625.
[I 2023-11-10 02:31:38,161] Trial 11 finished with value: 2.2804843113716173 and parameters: {'vector_size': 81, 'sg': 1, 'alpha': 0.009841815787437013, 'window': 10, 'epochs': 5}. Best is trial 11 with value: 2.2804843113716173.
[I 2023-11-10 02:35:29,542] Trial 12 finished with value: 2.2036980901238037 and parameters: {'vector_size': 91, 'sg': 1, 'alpha': 0.008213873619744306, 'window': 10, 'epochs': 6}. Best is trial 11 with value: 2.2804843113716173.
[I 2023-11-10 02:39:23,832] Trial 13 finished with value: 1.7307862540553205 and parameters: {'vector_size': 102, 'sg': 1, 'alpha': 0.0037741338467705823, 'window': 10, 'epochs': 7}. Best is trial 11 with value: 2.2804843113716173.
[I 2023-11-10 02:41:54,193] Trial 14 finished with value: 2.1425528527751503 and parameters: {'vector_size': 52, 'sg': 1, 'alpha': 0.008530963089947751, 'window': 9, 'epochs': 5}. Best is trial 11 with value: 2.2804843113716173.
[I 2023-11-10 02:47:04,577] Trial 15 finished with value: 2.0849035259521425 and parameters: {'vector_size': 69, 'sg': 1, 'alpha': 0.00377317438188043, 'window': 9, 'epochs': 12}. Best is trial 11 with value: 2.2804843113716173.
[I 2023-11-10 02:50:34,788] Trial 16 finished with value: 1.0688441642084539 and parameters: {'vector_size': 87, 'sg': 1, 'alpha': 0.001646181398202101, 'window': 8, 'epochs': 7}. Best is trial 11 with value: 2.2804843113716173.
[I 2023-11-10 02:53:47,700] Trial 17 finished with value: 1.2039912869429406 and parameters: {'vector_size': 116, 'sg': 1, 'alpha': 0.002277884725502711, 'window': 10, 'epochs': 5}. Best is trial 11 with value: 2.2804843113716173.
[I 2023-11-10 02:56:22,831] Trial 18 finished with value: 0.7905828601168843 and parameters: {'vector_size': 66, 'sg': 1, 'alpha': 0.0007202234779437889, 'window': 8, 'epochs': 7}. Best is trial 11 with value: 2.2804843113716173.
[I 2023-11-10 03:03:14,186] Trial 19 finished with value: 2.6183334496284645 and parameters: {'vector_size': 92, 'sg': 1, 'alpha': 0.009244531006159412, 'window': 8, 'epochs': 13}. Best is trial 19 with value: 2.6183334496284645.
[I 2023-11-10 03:09:51,702] Trial 20 finished with value: 2.233886686561053 and parameters: {'vector_size': 111, 'sg': 1, 'alpha': 0.005537656179691586, 'window': 8, 'epochs': 13}. Best is trial 19 with value: 2.6183334496284645.
[I 2023-11-10 03:16:43,712] Trial 21 finished with value: 2.180954739959346 and parameters: {'vector_size': 115, 'sg': 1, 'alpha': 0.004920134253854277, 'window': 8, 'epochs': 13}. Best is trial 19 with value: 2.6183334496284645.
[I 2023-11-10 03:20:46,197] Trial 22 finished with value: 2.102535789102967 and parameters: {'vector_size': 96, 'sg': 1, 'alpha': 0.005910119244720405, 'window': 6, 'epochs': 13}. Best is trial 19 with value: 2.6183334496284645.
[I 2023-11-10 03:28:22,540] Trial 23 finished with value: 2.4769320102421903 and parameters: {'vector_size': 109, 'sg': 1, 'alpha': 0.006128116803550544, 'window': 9, 'epochs': 14}. Best is trial 19 with value: 2.6183334496284645.
[I 2023-11-10 03:37:19,879] Trial 24 finished with value: 1.9163664664163287 and parameters: {'vector_size': 123, 'sg': 1, 'alpha': 0.0028764541240429656, 'window': 9, 'epochs': 14}. Best is trial 19 with value: 2.6183334496284645.
[I 2023-11-10 03:43:34,584] Trial 25 finished with value: 2.342861270438214 and parameters: {'vector_size': 106, 'sg': 1, 'alpha': 0.006060848252702381, 'window': 9, 'epochs': 12}. Best is trial 19 with value: 2.6183334496284645.
[I 2023-11-10 03:44:45,482] Trial 26 finished with value: 0.35749815798405105 and parameters: {'vector_size': 107, 'sg': 0, 'alpha': 0.006506621995349536, 'window': 7, 'epochs': 12}. Best is trial 19 with value: 2.6183334496284645.
[I 2023-11-10 03:53:55,317] Trial 27 finished with value: 1.865746054539092 and parameters: {'vector_size': 126, 'sg': 1, 'alpha': 0.0027148089573914504, 'window': 9, 'epochs': 14}. Best is trial 19 with value: 2.6183334496284645.
[I 2023-11-10 03:58:37,607] Trial 28 finished with value: 2.2482956541609256 and parameters: {'vector_size': 96, 'sg': 1, 'alpha': 0.005983387139446794, 'window': 8, 'epochs': 12}. Best is trial 19 with value: 2.6183334496284645.
[I 2023-11-10 04:02:58,957] Trial 29 finished with value: 1.8755039514105354 and parameters: {'vector_size': 130, 'sg': 1, 'alpha': 0.0046172207771722875, 'window': 7, 'epochs': 11}. Best is trial 19 with value: 2.6183334496284645.
[I 2023-11-10 04:10:38,770] Trial 30 finished with value: 2.1546789208226507 and parameters: {'vector_size': 109, 'sg': 1, 'alpha': 0.004084472054449898, 'window': 9, 'epochs': 14}. Best is trial 19 with value: 2.6183334496284645.
[I 2023-11-10 04:16:29,921] Trial 31 finished with value: 2.72252877663004 and parameters: {'vector_size': 80, 'sg': 1, 'alpha': 0.009614574544583181, 'window': 10, 'epochs': 11}. Best is trial 31 with value: 2.72252877663004.
[I 2023-11-10 04:22:06,315] Trial 32 finished with value: 2.5087111398522146 and parameters: {'vector_size': 97, 'sg': 1, 'alpha': 0.007228118580643188, 'window': 10, 'epochs': 11}. Best is trial 31 with value: 2.72252877663004.
[I 2023-11-10 04:28:56,668] Trial 33 finished with value: 2.6865386694672604 and parameters: {'vector_size': 90, 'sg': 1, 'alpha': 0.009821625347381076, 'window': 10, 'epochs': 11}. Best is trial 31 with value: 2.72252877663004.
[I 2023-11-10 04:35:50,596] Trial 34 finished with value: 2.723951371893208 and parameters: {'vector_size': 90, 'sg': 1, 'alpha': 0.009987583682113695, 'window': 10, 'epochs': 11}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 04:41:13,825] Trial 35 finished with value: 2.5927240877810824 and parameters: {'vector_size': 88, 'sg': 1, 'alpha': 0.009810381268375508, 'window': 10, 'epochs': 9}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 04:42:47,932] Trial 36 finished with value: 1.4054864433971648 and parameters: {'vector_size': 76, 'sg': 1, 'alpha': 0.007977508544771266, 'window': 2, 'epochs': 10}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 04:46:20,527] Trial 37 finished with value: 1.9486868790002105 and parameters: {'vector_size': 91, 'sg': 1, 'alpha': 0.007387487002297432, 'window': 5, 'epochs': 10}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 04:47:21,036] Trial 38 finished with value: 0.38600386475581555 and parameters: {'vector_size': 59, 'sg': 0, 'alpha': 0.009925015891907581, 'window': 10, 'epochs': 9}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 04:51:59,614] Trial 39 finished with value: 2.0275692356377375 and parameters: {'vector_size': 76, 'sg': 1, 'alpha': 0.004647273892646452, 'window': 8, 'epochs': 11}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 04:53:02,174] Trial 40 finished with value: 0.36466781321463676 and parameters: {'vector_size': 83, 'sg': 0, 'alpha': 0.007252489043657421, 'window': 6, 'epochs': 11}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 04:58:35,493] Trial 41 finished with value: 2.6046449335299515 and parameters: {'vector_size': 89, 'sg': 1, 'alpha': 0.009903066852546837, 'window': 10, 'epochs': 9}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 05:03:16,156] Trial 42 finished with value: 2.6708701559623673 and parameters: {'vector_size': 76, 'sg': 1, 'alpha': 0.009880580442248559, 'window': 10, 'epochs': 9}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 05:08:27,079] Trial 43 finished with value: 2.490127760101252 and parameters: {'vector_size': 76, 'sg': 1, 'alpha': 0.0074202995900503835, 'window': 10, 'epochs': 10}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 05:12:40,301] Trial 44 finished with value: 2.288865298302825 and parameters: {'vector_size': 84, 'sg': 1, 'alpha': 0.007535585970537558, 'window': 9, 'epochs': 8}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 05:17:55,235] Trial 45 finished with value: 2.0523235691517345 and parameters: {'vector_size': 94, 'sg': 1, 'alpha': 0.005148026878895813, 'window': 10, 'epochs': 8}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 05:19:04,579] Trial 46 finished with value: 0.3754888386582906 and parameters: {'vector_size': 63, 'sg': 0, 'alpha': 0.008543611892701346, 'window': 10, 'epochs': 10}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 05:25:02,017] Trial 47 finished with value: 2.006439467106196 and parameters: {'vector_size': 102, 'sg': 1, 'alpha': 0.0037526345416018336, 'window': 9, 'epochs': 12}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 05:27:43,787] Trial 48 finished with value: 2.1480278139872397 and parameters: {'vector_size': 77, 'sg': 1, 'alpha': 0.009881165053475168, 'window': 5, 'epochs': 9}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 05:31:38,522] Trial 49 finished with value: 2.2261385563616494 and parameters: {'vector_size': 71, 'sg': 1, 'alpha': 0.006661741825715051, 'window': 7, 'epochs': 11}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 05:39:28,020] Trial 50 finished with value: 2.4202791931058303 and parameters: {'vector_size': 100, 'sg': 1, 'alpha': 0.005043885874920666, 'window': 10, 'epochs': 15}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 05:44:52,037] Trial 51 finished with value: 2.51792152108987 and parameters: {'vector_size': 88, 'sg': 1, 'alpha': 0.00866390539406381, 'window': 10, 'epochs': 9}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 05:49:12,270] Trial 52 finished with value: 2.4359112210688036 and parameters: {'vector_size': 80, 'sg': 1, 'alpha': 0.008500872637206237, 'window': 9, 'epochs': 9}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 05:54:17,490] Trial 53 finished with value: 2.3028878745868138 and parameters: {'vector_size': 92, 'sg': 1, 'alpha': 0.006826201485730979, 'window': 10, 'epochs': 8}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 06:00:08,616] Trial 54 finished with value: 2.7007018481004796 and parameters: {'vector_size': 85, 'sg': 1, 'alpha': 0.009998740847257678, 'window': 10, 'epochs': 10}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 06:05:25,864] Trial 55 finished with value: 2.5347663458148655 and parameters: {'vector_size': 84, 'sg': 1, 'alpha': 0.008555996225681305, 'window': 9, 'epochs': 10}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 06:10:39,674] Trial 56 finished with value: 2.374046169884489 and parameters: {'vector_size': 70, 'sg': 1, 'alpha': 0.005602710599256451, 'window': 10, 'epochs': 11}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 06:11:52,045] Trial 57 finished with value: 0.44792054127059056 and parameters: {'vector_size': 80, 'sg': 0, 'alpha': 0.009992889940718072, 'window': 8, 'epochs': 13}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 06:15:47,335] Trial 58 finished with value: 2.374086715194274 and parameters: {'vector_size': 65, 'sg': 1, 'alpha': 0.006611520888716835, 'window': 9, 'epochs': 10}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 06:19:32,361] Trial 59 finished with value: 1.9328280597089655 and parameters: {'vector_size': 148, 'sg': 1, 'alpha': 0.008314149182630405, 'window': 4, 'epochs': 12}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 06:25:18,204] Trial 60 finished with value: 2.324370590803694 and parameters: {'vector_size': 99, 'sg': 1, 'alpha': 0.005537857182849513, 'window': 10, 'epochs': 11}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 06:30:50,882] Trial 61 finished with value: 2.5893705507480704 and parameters: {'vector_size': 89, 'sg': 1, 'alpha': 0.009710893782974843, 'window': 10, 'epochs': 9}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 06:36:50,083] Trial 62 finished with value: 2.510988484029619 and parameters: {'vector_size': 86, 'sg': 1, 'alpha': 0.008067874350531223, 'window': 10, 'epochs': 10}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 06:42:45,402] Trial 63 finished with value: 2.2829653663963616 and parameters: {'vector_size': 93, 'sg': 1, 'alpha': 0.006535210284704072, 'window': 9, 'epochs': 10}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 06:47:01,131] Trial 64 finished with value: 2.49735899692451 and parameters: {'vector_size': 78, 'sg': 1, 'alpha': 0.008826964560693044, 'window': 10, 'epochs': 8}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 06:51:51,793] Trial 65 finished with value: 2.340810691623696 and parameters: {'vector_size': 85, 'sg': 1, 'alpha': 0.007299921654803313, 'window': 9, 'epochs': 9}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 06:58:22,423] Trial 66 finished with value: 2.479660493203295 and parameters: {'vector_size': 74, 'sg': 1, 'alpha': 0.005836739640861985, 'window': 10, 'epochs': 13}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 07:05:12,977] Trial 67 finished with value: 2.676389397607351 and parameters: {'vector_size': 90, 'sg': 1, 'alpha': 0.009894126225176915, 'window': 9, 'epochs': 12}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 07:12:32,956] Trial 68 finished with value: 2.557665522342926 and parameters: {'vector_size': 95, 'sg': 1, 'alpha': 0.007833498861467688, 'window': 9, 'epochs': 12}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 07:13:37,676] Trial 69 finished with value: 0.31971401590743537 and parameters: {'vector_size': 73, 'sg': 0, 'alpha': 0.004351782896369535, 'window': 8, 'epochs': 12}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 07:19:16,363] Trial 70 finished with value: 2.382231800735131 and parameters: {'vector_size': 82, 'sg': 1, 'alpha': 0.006330866216240428, 'window': 9, 'epochs': 11}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 07:26:05,833] Trial 71 finished with value: 2.6906067868235852 and parameters: {'vector_size': 90, 'sg': 1, 'alpha': 0.009956879155314157, 'window': 10, 'epochs': 11}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 07:32:14,606] Trial 72 finished with value: 2.70209436380254 and parameters: {'vector_size': 98, 'sg': 1, 'alpha': 0.009101006683469522, 'window': 10, 'epochs': 12}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 07:38:13,693] Trial 73 finished with value: 2.604883850443865 and parameters: {'vector_size': 104, 'sg': 1, 'alpha': 0.008571927212076861, 'window': 10, 'epochs': 11}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 07:45:42,098] Trial 74 finished with value: 2.6107270776554556 and parameters: {'vector_size': 90, 'sg': 1, 'alpha': 0.007497369730784524, 'window': 10, 'epochs': 12}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 07:52:16,152] Trial 75 finished with value: 2.623385673838762 and parameters: {'vector_size': 86, 'sg': 1, 'alpha': 0.008999323235015688, 'window': 10, 'epochs': 11}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 07:58:21,667] Trial 76 finished with value: 2.540285505588257 and parameters: {'vector_size': 97, 'sg': 1, 'alpha': 0.007025171633204193, 'window': 10, 'epochs': 12}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 08:03:41,700] Trial 77 finished with value: 2.644918153317703 and parameters: {'vector_size': 99, 'sg': 1, 'alpha': 0.009885788580190461, 'window': 9, 'epochs': 11}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 08:10:49,676] Trial 78 finished with value: 2.437553908786567 and parameters: {'vector_size': 79, 'sg': 1, 'alpha': 0.005362516928904418, 'window': 10, 'epochs': 13}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 08:17:21,591] Trial 79 finished with value: 2.4652922576437906 and parameters: {'vector_size': 93, 'sg': 1, 'alpha': 0.007854493455925928, 'window': 10, 'epochs': 10}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 08:19:43,180] Trial 80 finished with value: 1.512582792025237 and parameters: {'vector_size': 104, 'sg': 1, 'alpha': 0.006125912929259305, 'window': 3, 'epochs': 11}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 08:25:01,370] Trial 81 finished with value: 2.621378266798035 and parameters: {'vector_size': 99, 'sg': 1, 'alpha': 0.00967943401861381, 'window': 9, 'epochs': 11}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 08:31:57,838] Trial 82 finished with value: 2.652092104043864 and parameters: {'vector_size': 91, 'sg': 1, 'alpha': 0.009054192013482641, 'window': 9, 'epochs': 12}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 08:38:48,706] Trial 83 finished with value: 2.7017254922690026 and parameters: {'vector_size': 82, 'sg': 1, 'alpha': 0.008959380490581063, 'window': 10, 'epochs': 12}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 08:45:36,323] Trial 84 finished with value: 2.635035458586408 and parameters: {'vector_size': 82, 'sg': 1, 'alpha': 0.007701145797472685, 'window': 10, 'epochs': 12}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 08:53:27,461] Trial 85 finished with value: 2.590492914038667 and parameters: {'vector_size': 87, 'sg': 1, 'alpha': 0.0067489161503374966, 'window': 10, 'epochs': 13}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 08:54:24,162] Trial 86 finished with value: 0.3828563128159569 and parameters: {'vector_size': 73, 'sg': 0, 'alpha': 0.008838580622264957, 'window': 10, 'epochs': 10}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 09:01:22,470] Trial 87 finished with value: 2.6716605195566645 and parameters: {'vector_size': 84, 'sg': 1, 'alpha': 0.008080935764242882, 'window': 10, 'epochs': 12}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 09:09:25,414] Trial 88 finished with value: 2.6374969216073887 and parameters: {'vector_size': 95, 'sg': 1, 'alpha': 0.008087834458900605, 'window': 10, 'epochs': 12}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 09:13:58,840] Trial 89 finished with value: 1.9568559031641433 and parameters: {'vector_size': 84, 'sg': 1, 'alpha': 0.005014834943247649, 'window': 6, 'epochs': 12}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 09:20:49,282] Trial 90 finished with value: 2.47888558852738 and parameters: {'vector_size': 89, 'sg': 1, 'alpha': 0.006999884373477716, 'window': 10, 'epochs': 11}. Best is trial 34 with value: 2.723951371893208.
[I 2023-11-10 09:26:46,225] Trial 91 finished with value: 2.797097136974645 and parameters: {'vector_size': 68, 'sg': 1, 'alpha': 0.009080514810688597, 'window': 10, 'epochs': 13}. Best is trial 91 with value: 2.797097136974645.
[I 2023-11-10 09:34:14,590] Trial 92 finished with value: 2.8594166372918477 and parameters: {'vector_size': 55, 'sg': 1, 'alpha': 0.008914282422828311, 'window': 10, 'epochs': 13}. Best is trial 92 with value: 2.8594166372918477.
[I 2023-11-10 09:42:57,940] Trial 93 finished with value: 2.8266867123277954 and parameters: {'vector_size': 61, 'sg': 1, 'alpha': 0.00885826767599969, 'window': 10, 'epochs': 14}. Best is trial 92 with value: 2.8594166372918477.
[I 2023-11-10 09:50:47,474] Trial 94 finished with value: 2.6924916606425637 and parameters: {'vector_size': 53, 'sg': 1, 'alpha': 0.006072291231641318, 'window': 10, 'epochs': 14}. Best is trial 92 with value: 2.8594166372918477.
[I 2023-11-10 09:53:08,010] Trial 95 finished with value: 1.7784221701173493 and parameters: {'vector_size': 53, 'sg': 1, 'alpha': 0.008873629699315391, 'window': 2, 'epochs': 14}. Best is trial 92 with value: 2.8594166372918477.
[I 2023-11-10 10:01:12,363] Trial 96 finished with value: 2.6882409617421983 and parameters: {'vector_size': 55, 'sg': 1, 'alpha': 0.006147592085417427, 'window': 10, 'epochs': 14}. Best is trial 92 with value: 2.8594166372918477.
[I 2023-11-10 10:09:40,854] Trial 97 finished with value: 2.7633235293120517 and parameters: {'vector_size': 59, 'sg': 1, 'alpha': 0.007032144557262477, 'window': 10, 'epochs': 14}. Best is trial 92 with value: 2.8594166372918477.

Results

Best Parameters and Test Score

print(f"Best Parameters: {study.best_params}\n")

w2v_model = Word2Vec(
    **study.best_params, 
    sentences=filtered_sentences, 
    workers=multiprocessing.cpu_count()-1, 
    seed=42
)

similarity = [analogy_accuracy(w2v_model, analogy) for analogy in test]

print("Results for Test Set:")
print(f"    Mean: {round(np.nanmean(similarity), 3)}, Std: {round(np.nanstd(similarity), 3)}")

Best Parameters: {'vector_size': 55, 'sg': 1, 'alpha': 0.008914282422828311, 'window': 10, 'epochs': 13}

Results for Test Set:
    Mean: 0.487, Std: 0.174

Optimization History

plot_optimization_history(study)

Parallel Coordinate

plot_parallel_coordinate(study)

Hyperparameter Importance

plot_param_importances(study)

Contour

plot_contour(study)

Slice

plot_slice(study)

Timeline

plot_timeline(study)