Computational Semantics 1 category specificity Warrington (1975); Warrington & Shallice (1979, 1984) 2 basic level superiority 3 super-ordinate category preservation 1 / 13
analogy by vector space Figure 1: Mikolov, Yih, & Zweig (2013) 2 / 13
Sample example of word2vec 2 1.5 1 0.5 0-0.5-1 中国ロシア日本トルコポーランドドイツフランスイタリアスペインギリシャ 北京モスクワアンカラ東京ワルシャワベルリンパリローマアテネ -1.5 ポルトガル マドリッドリスボン -2-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Figure 2: Mikolov, Sutskever, Chen, Corrado, & Dean (2013) 3 / 13
word2vec Figure 3: :CBOW : Joulin et al. (2017) 4 / 13
For a word w with N word vector sets {c (w)} representing the words found in its contexts, and window size W, the empirical variance is: Σ w = 1 NW N i W ( c (w)ij w ) ( c (w) ij w ) j (1) This is an estimator for the covariance of a distribution assuming that the mean is fixed at w. In practice, it is also necessary to add a small ridge term δ > 0 to the diagonal of the matrix to regularize and avoid numerical problems when inverting. 5 / 13
Objective function of word2vec skip gram: J = log P (c w) (2) w D c C CBOW: J = log P (w c) (3) w D c C where, D: C: w ±h P (c w) : w C 6 / 13
Figure 4: Relations between NTT-DB and word2vec (2017) 7 / 13
Negative Sampling P ( ) softmax function: P (c w) = exp ( w ) wv c w exp ( v ) (4) wṽ w Mikolov et al. (2013) 2 log P (C w) log σ ( v wṽ c ) + κer Pn [ log σ ( v w ṽ r )], (5) 2 P n r k, σ = ( 1 + exp ( x) ) 1 Goldberg & Levy (2014); Levy & Goldberg (2014a) word2vec shifted PMI 1 1 p (x, y) p (x y ) p (y x ) pmi(x, y) log = log = log p (x) p (y) p (x) p (y) https://en.wikipedia.org/wiki/pointwise_mutual_information 8 / 13
Shifted PMI M i,j = PMI ( w i, c j ) log κ w i w j (6) PMI Levy & Goldberg (2014b) n (w, c) n (w) SGNS( Skip-gram with Negative Sampling) J = log σ ( ) [ ( )] v wṽc κer Pn log σ v w ṽ r w D c C = w D c C n (w, c) log σ ( ) v wṽc [ ( )] n (w) κe r pn log σ v w ṽ r w C (7) (8) 9 / 13
E r pn [ log σ ( v w ṽ r )] = r v c n (r) D log σ ( ) v wṽr = n (c) log σ ( ) v wṽc D + log σ ( ) v wṽr r v c \c (9) (10) 10 / 13
w c (w, c) = n (w, c) log σ ( ) v n (c) wṽc n (w, c) κ log σ ( ) v wṽc D x = v wṽc l (w, c) x 0 (11) l (w, c) x = n (w, c) σ ( x) + κn (w) n (c) σ (x) D (12) = n (w, c) {σ (x) 1} + κn (w) n (c) σx D (13) = 0 (14) 11 / 13
{ 1 + κn (w) n (c) D n (w, c) } κn (w) n (c) σ (x) = 1 exp ( x) = D n (w, c) (15) x = v wṽc (16) = D n (w, c) log κn (w) n (c) (17) = D n (w, c) log log κ n (w) n (c) (18) = PMI (w, c) log (κ) (19) 12 / 13
Goldberg, Y., & Levy, O. (2014). word2vec explained: Deriving mikolov et al. s negative-sampling word-embedding method. arxiv preprint arxiv:1402.3722. Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jëgou, H., & Mikolov, T. (2017). FASTTEXT.ZIP: Compressing text classification models. In Y. Bengio & Y. LeCun (Eds.), The proceedings of International Conference on Learning Representations (ICLR). Toulon, France.. (2017). wikipedia word2vec 80.,. Levy, O., & Goldberg, Y. (2014a). Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers) (pp. 302 308). Baltimore, Maryland, USA. Levy, O., & Goldberg, Y. (2014b). Neural word embeddingas implicit matrix factorization. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 27, p. 2177-2185). Montrèal CANADA: Curran Associates, Inc. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Weinberger (Eds.), Advances in neural information processing systems 26 (pp. 3111 3119). Curran Associates, Inc. Mikolov, T., Yih, W. tau, & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies NAACL. Atlanta, WA, USA. Warrington, E. K. (1975). The selective impairment of semantic memory. Quarterly Journal of Experimental Psychology, 27, 635 657. Warrington, E. K., & Shallice, T. (1979). Semantic access dyslexia. Brain, 102, 43 63. Warrington, E. K., & Shallice, T. (1984). Category specific semantic impairment. Brain, 107, 829 854. 13 / 13