JPWO2020009045A1

JPWO2020009045A1 - Otsu-gram using a delayed memory matrix

Info

Publication number: JPWO2020009045A1
Application number: JP2020528844A
Authority: JP
Inventors: 弘崇新妻
Original assignee: 弘崇新妻
Priority date: 2018-07-03
Filing date: 2019-06-29
Publication date: 2021-10-14
Also published as: WO2020009045A2; JP2021120766A

Abstract

【課題】
自然言語処理などで時系列データの特徴量として良く利用されるn-gramにおいてnを職人芸で調節しないといけない問題を解決し、自動的に適切な特徴が選択されるotsu-gramを提案する。この特徴量にcorrespondence analysisを適用すると単語分散表現が計算できる。この単語分散表現は従来法のword2vecやfasteTextよりも大幅に精度向上したものになる。しかし、この特徴量と単語分散表現の計算には非常に大きなメモリが必要となり計算が困難である。

【解決手段】
n-gramのnを職人芸で調節しないといけない問題を大津の2値化でnを調節することで解決した。この特徴の計算でメモリが大量に必要になる部分を遅延評価にして少ないメモリでの計算を可能とした。データの一部をディスクに保存することで少ないメモリで高速に計算する方法も提案する。このメモリ効率向上手法は他の疎行列演算にも応用できる応用範囲の広いテクニックである。

【選択図】図3【Task】
We propose otsu-gram that solves the problem that n must be adjusted by craftsmanship in n-gram, which is often used as a feature of time series data in natural language processing, and automatically selects the appropriate feature. .. By applying correspondence analysis to this feature, the word distribution expression can be calculated. This word distribution expression is much more accurate than the conventional word2vec and fasteText. However, the calculation of this feature quantity and the word distribution expression requires a very large memory and is difficult to calculate.

SOLUTION:
The problem of having to adjust n in n-gram by craftsmanship was solved by adjusting n by binarizing Otsu. In the calculation of this feature, the part that requires a large amount of memory is evaluated as lazy, and the calculation with a small amount of memory is made possible. We also propose a method for high-speed calculation with a small amount of memory by saving a part of the data on an optical disk. This memory efficiency improvement method is a technique with a wide range of applications that can be applied to other sparse matrix operations.

[Selection diagram] Fig. 3

Description

自然言語処理を始めとした様々な時系列データの特徴量として良く利用されるn-gramにおいて、nを職人芸で調節しなくとも適切な特徴が自動抽出されるようにする。この時に生じるメモリ不足の問題をプログラムコードをほとんど書き変えせずにデータの型情報の修正とそのデータの型へoperator overloadで解決する。
In n-gram, which is often used as a feature of various time-series data such as natural language processing, appropriate features are automatically extracted without adjusting n by craftsmanship. The problem of insufficient memory that occurs at this time is solved by correcting the data type information and operating overloading the data type with almost no rewriting of the program code.

本発明では自然言語処理を始めとした様々な時系列データの特徴量として良く利用されるn-gramの拡張とその応用を行う。本発明は自然言語処理以外の時系列データにも適用可能であるが、簡単のため自然言語処理特に英語への適用例を中心に以下では述べる。本発明が特に有効に働く応用は自然言語処理における単語分散表現の計算である。単語分散表現とは単語を数値のベクトルとして表現する方法である。単語分散表現の良さは単語間の類似度を使って評価されることが多い。具体的には単語間の類似度をベクトルの内積としたとき、内積が妥当な値になるかで単語分散表現の精度を評価できる。例えば( "train"と"car"の類似度 ) > ( "plane"と"car"の類似度 ) という大小関係が様々な単語間で妥当な順番になっているか順番の正解率を計算することで単語分散表現の良さを計測する。このような単語の類似度の大小関係を列挙したベンチマークとして利用できるデータセットとしては非特許文献4で提案されている ws353_similarity (Sim) データセット、非特許文献5で提案されている ws353_relatedness (Rel) データセット、非特許文献6で提案されている MEN データセット、非特許文献7で提案されている M.Turk データセットなどがある。表1はこれらのデータセットを使って単語分散表現の既存手法と本発明を利用した場合の精度比較である。OG、PY、TCが本発明で新たに提案する手法である。またその組合せについても評価している。単語分散表現の計算手法としてはword2vecとfastTextとGloVeが有名であり、これらの手法よりも精度が大幅に良くなる手法は従来は存在しなかった。しかし本発明は大幅に精度が向上する。例えば表1のSimとRelのベンチマークでfastTextとword2vecの正解率が0．6程度の時に、本発明を利用した場合の正解率は0．7以上となる。なおこの比較実験で単語分散表現の計算には自然言語処理の研究で良く利用される英文データであるtext8 corpusを学習データとして利用した。
In the present invention, n-gram, which is often used as a feature of various time series data including natural language processing, is extended and its application is performed. The present invention can be applied to time-series data other than natural language processing, but for the sake of simplicity, examples of application to natural language processing, especially English, will be described below. An application in which the present invention works particularly effectively is the calculation of word distributed expressions in natural language processing. Word distributed expression is a method of expressing a word as a vector of numerical values. The goodness of word distribution expression is often evaluated using the similarity between words. Specifically, when the similarity between words is the inner product of vectors, the accuracy of word distribution expression can be evaluated by checking whether the inner product is a valid value. For example, whether the magnitude relation of (similarity of "train" and "car")> (similarity of "plane" and "car") is in a proper order among various words, or calculate the correct answer rate of the order. Measure the goodness of word distributed expression with. The ws353_similarity (Sim) dataset proposed in Non-Patent Document 4 and the ws353_relatedness (Rel) proposed in Non-Patent Document 5 are datasets that can be used as a benchmark that enumerates the magnitude relation of the similarity of such words. There are datasets, MEN datasets proposed in Non-Patent Document 6, M.Turk datasets proposed in Non-Patent Document 7, and the like. Table 1 is an accuracy comparison between the existing method of word distribution expression using these data sets and the present invention. This is a method newly proposed by OG, PY, and TC in the present invention. We also evaluate the combination. Word2vec, fastText, and GloVe are well-known calculation methods for word distributed expressions, and there has been no method that is significantly more accurate than these methods. However, the present invention greatly improves the accuracy. For example, when the correct answer rate of fastText and word2vec is about 0.6 in the Sim and Rel benchmarks in Table 1, the correct answer rate when using the present invention is 0.7 or more. In this comparative experiment, text8 corpus, which is English data often used in the study of natural language processing, was used as learning data for the calculation of word distributed expressions.

n-gramは自然言語処理を始めとした様々な時系列データの特徴量として利用されている。n-gramのnの値は応用によって適切に調節するのが一般的である。例えば英文の品詞推定問題では2-gramと3-gramが重要な特徴であり4-gram以上の特徴を利用すると精度が落るという分析がある。このnの選択は従来は開発者の直感で行なわれてきた。本発明では、このnの選択を自動的に行うことに相当する処理を行う。具体的には、画像処理の分野でパラメータ自動調節手法として利用されている大津の2値化をn-gramに適用してnの選択を自動的に行うことに相当する処理を行う。ただしnは単語ごとに適切な値に調節されるようにする。例えば "this is a pen" という英文を分析したい場合に、"this" の後に "is" が来る確率を考えたい場合には2-gramを利用し、"this" の後に "a" が来る場合についてはは3-gramを利用し、"this" の後に "pen" が来る場合については4-gramを利用する、ということに相当する処理を自動的に行う。しかし、このような計算を英文、例えば英文Wikipediaの英文に適用すると100ギガバイト以上のメモリーが必要となることがある。英文Wikipediaの英文は少なくとも10万種類以上の単語から構成されており、これらの単語の組み合わせ全てについて大津の2値化の計算をすると10万行10万列以上の大きさの行列が100個近く必要となるためである。そのため、一般的なパソコンでは計算するのが困難であり、従来このような分析は行なわれてこなかった。
n-gram is used as a feature of various time series data including natural language processing. In general, the value of n in n-gram is adjusted appropriately depending on the application. For example, there is an analysis that 2-gram and 3-gram are important features in the part-speech estimation problem in English, and the accuracy drops when features of 4-gram or higher are used. The selection of n has traditionally been done by the intuition of the developer. In the present invention, a process corresponding to automatically selecting n is performed. Specifically, the binarization of Otsu, which is used as an automatic parameter adjustment method in the field of image processing, is applied to n-gram to perform processing equivalent to automatically selecting n. However, n should be adjusted to an appropriate value for each word. For example, if you want to analyze the English sentence "this is a pen", use 2-gram if you want to consider the probability that "is" will come after "this", and if "a" comes after "this". For, 3-gram is used, and when "pen" comes after "this", 4-gram is used. However, applying such calculations to English text, such as English Wikipedia, may require more than 100 gigabytes of memory. English Wikipedia's English text is composed of at least 100,000 kinds of words, and when Otsu's binarization calculation is performed for all combinations of these words, there are nearly 100 matrices with a size of 100,000 rows and 100,000 columns or more. This is because it is necessary. Therefore, it is difficult to calculate with a general personal computer, and such an analysis has not been performed in the past.

メモリーに収まりきらないデータはスワップなどの処理によってディスクに保存するのが一般的である。近年 M.2 SSD などの高速なディスクの登場によりディスクからデータを読み書きする処理によって著しく処理速度が遅くなることは少なくなった。そこで前述の大津の2値化でメモリが不足する問題を、データをディスクに保存することで解決するのが本発明の要点の1つである。
Data that does not fit in memory is generally saved on an optical disk by processing such as swapping. In recent years, with the advent of high-speed disks such as M.2 SSDs, the processing speed of reading and writing data from disks has decreased significantly. Therefore, one of the main points of the present invention is to solve the problem of insufficient memory due to the binarization of Otsu described above by saving the data on an optical disk.

メモリーに収まりきらないデータを扱う方法としてスワップ以外に良く利用される方法に in-memory zipと呼ばれるテクニックがある。これはファイル圧縮に利用されるzipをメモリ上のデータに適用するとデータが圧縮されてメモリ使用量が減るというものである。しかし圧縮されたデータを解凍すると元の圧縮されていない巨大なデータがメモリ上に再度展開されてしまうため、一時しのぎにすぎず、結果的にはメモリ使用量を減らせていない問題があった。この問題を解決するにはデータを解凍せずに圧縮したまま計算に利用できる仕組みや、圧縮と解凍を適切なタイミングで遅延評価する仕組みが必要となる。
There is a technique called in-memory zip that is often used other than swapping as a method of handling data that does not fit in memory. This means that if you apply the zip used for file compression to the data in memory, the data will be compressed and the memory usage will decrease. However, when the compressed data is decompressed, the original uncompressed huge data is expanded again on the memory, so it is only a temporary measure, and as a result, there is a problem that the memory usage cannot be reduced. To solve this problem, it is necessary to have a mechanism that can be used for calculation without decompressing the data, and a mechanism that delays evaluation of compression and decompression at an appropriate timing.

データをzipなどの可逆圧縮ではなく、不必要なデータを削除して非可逆圧縮する方法も本発明では提案する。
The present invention also proposes a method of irreversibly compressing data by deleting unnecessary data instead of lossless compression such as zip.

本発明のn-gramの拡張について説明を簡単にするため単語w₁と単語w₂の間に任意の単語がk個挟まった単語列が生じる回数を
To simplify the explanation of the n-gram extension of the present invention, the number of times a word string in which k arbitrary words are sandwiched between _{words w 1} and word w _{2 occurs.}

(数1)
#( w₁ *^k w₂ )

で表す。例えば "this is this is this is this is this"という英文の場合は以下となる。
(Number 1)
# (w ₁ * ^k w ₂ )

It is represented by. For example, the English sentence "this is this is this is this is this" is as follows.

(数2)

#(this is) = #(this *⁰ is) = 4

#(this * * is) = #(this *² is) = 2

#(this * is) = #(this *¹ is) = 0
(Number 2)

# (this is) = # (this * ⁰ is) = 4

# (this * * is) = # (this * ² is) = 2

# (this * is) = # (this * ¹ is) = 0

図1は自然言語処理の研究で良く利用される英文データである text8 corpusにおける #( this *^k a )のヒストグラムのkによる変化をプロットしたものである。"this" の直後は "this is a" や "this was a" などの様に動詞がくることが多く "this a" などという英文はほとんど存在しない。そのため #( this *⁰ a ) は小さく、 #( this *¹ a ) が大きくなっている。k > 1では"a"がランダムに"this"以外の他の単語に関連して生じているのが見てとれる。つまり"this" と "a"の関係は k < 2 の場合のみを考慮すると良いことがわかる。すなわち ( this と aの関係の強さ) = #( this *⁰ a ) ＋ #( this *¹ a ) とみなして良い。この k < 2 の範囲を自動的に選択するには図1の #( this *^k a ) のヒストグラムを大津の2値化で分割すれば良い。
Figure 1 plots the change in the histogram ^{of # (this * k} a) with k in text8 corpus, which is English data often used in natural language processing research. Immediately after "this", verbs such as "this is a" and "this was a" often come, and there are almost no English sentences such as "this a". Therefore, # (this * ⁰ a) is small and # (this * ¹ a) is large. At k> 1, we can see that "a" is randomly associated with words other than "this". In other words, it can be seen that the relationship between "this" and "a" should be considered only when k <2. That is, it can be regarded as (strength of the relationship between this and a) = # (this * ⁰ a) + # (this * ^{1 a).} To automatically select this range of k <2, ^{the histogram of # (this * k} a) in Fig. 1 should be divided by Otsu's binarization.

"this" と "a"以外の任意の単語同士 ( w₁ , w₂ ) の関係の強さを #( w₁ *^k w₂ ) のヒストグラムが与えられている場合に大津の2値化を使って計算する手順を示したのが図2である。これはn-gramのnを大津の2値化で自動的に調節することに相当する。このアルゴリズムをotsu-gramと呼ぶこととする。otsu-gramの出力する行列をotsu-gram行列と呼ぶこととする。単語の種類が10万種類の場合はotsu-gram行列である図2のresultは10万行10万列となる。例えば "this" の情報が3行目、 "a" の情報が5列目に格納されている場合、result[3,5]は"this"の後に"a"が意味のある形で出現する度合い、つまり関係の強さを表わす値となっている。k_maxは50以上にすると安定した値が出力されることがわかっている。しかし10万種類の単語を扱う場合、全ての単語について #( w₁ *^k w₂ ) を保存する必要があり、そのためには10万 * 10万 * 50の数値配列が必要となる。これがotsu-gramが大量にメモリを必要とする原因である。
The strength of the relationship _{between arbitrary words (w 1} , w ₂ ) other than "this" and "a" is binarized in Otsu when a histogram _{of # (w 1} * ^k w _{2) is given.} Figure 2 shows the procedure for calculating using. This is equivalent to automatically adjusting n in n-gram by binarizing Otsu. This algorithm is called otsu-gram. The matrix output by otsu-gram is called the otsu-gram matrix. When there are 100,000 types of words, the result in Fig. 2, which is an otsu-gram matrix, is 100,000 rows and 100,000 columns. For example, if the information of "this" is stored in the 3rd row and the information of "a" is stored in the 5th column, result [3,5] will have "a" appearing after "this" in a meaningful way. It is a value that represents the degree, that is, the strength of the relationship. It is known that a stable value is output when k _{max is set to 50 or more.} However, when dealing with 100,000 kinds of words, _{it is necessary to store # (w 1} * ^k w ₂ ) for all words, which requires a numerical array of 100,000 * 100,000 * 50. This is the reason why otsu-gram requires a lot of memory.

otsu-gram行列にcorrespondence analysisを適用すると単語分散表現として利用できるベクトルが計算できる。この表現の精度は表1にotsu-gram(OG)で示したように従来手法を大きく上回る。otsu-gram行列は大部分の要素がゼロの疎行列である。しかし疎行列としてゼロ成分を無視できるCompressed Sparse Row (CSR) matrix形式を利用したとしても数ギガバイト以上のメモリを必要とする場合がある。しかしこの数ギガバイト以上のメモリの疎行列に対して単純にcorrespondence analysisを適用すると途中の計算過程で疎行列が密行列となってしまう。大部分の要素がゼロの疎行列であるためCSR matrix形式を利用することでメモリが数ギガバイト程度に収まっていたが、ゼロ要素のない密行列になってしまうと100ギガバイト以上のメモリが必要となることがある。この問題は特許文献1の遅延評価を利用した方法を利用すると回避でき、疎行列が展開されて密行列になってしまう状態を回避して少ないメモリで計算することが可能になる。
Applying correspondence analysis to the otsu-gram matrix can calculate vectors that can be used as word distribution representations. The accuracy of this expression greatly exceeds that of the conventional method as shown in Otsu-gram (OG) in Table 1. The otsu-gram matrix is a sparse matrix with most elements zero. However, even if the Compressed Sparse Row (CSR) matrix format, which can ignore the zero component as a sparse matrix, is used, it may require several gigabytes or more of memory. However, if correspondence analysis is simply applied to a sparse matrix with a memory of several gigabytes or more, the sparse matrix becomes a dense matrix in the middle of the calculation process. Since most of the elements are sparse matrices with zero elements, the memory was limited to a few gigabytes by using the CSR matrix format, but a dense matrix without zero elements requires more than 100 gigabytes of memory. May become. This problem can be avoided by using the method using the lazy evaluation of Patent Document 1, and it is possible to avoid the state where the sparse matrix is expanded and become a dense matrix and perform the calculation with a small amount of memory.

特許文献1で提案されている遅延評価による効率化を自動で実行できるなら、コンパイラ最適化の一部としてこの機能が既に実装されているはずである。しかし非特許文献1では遅延評価による効率化を自動実行すると何らかのオーバーフローエラーで計算が停止してしまう場合があることが示されている。エラーで計算が停止しないためには、遅延評価のタイミングを人間が指定する必要がある。例えば非特許文献2では疎行列のLU分解を遅延評価を利用して効率的に計算するために、どの時点で遅延評価をするか、どのように評価計算するかの手順まで詳細に記述したアルゴリズムが提案されている。本発明では特許文献1の技術背景0006に記述されているタイミングで遅延評価を行うcorrespondence analysisのアルゴリズムを利用する。またotsu-gramの計算についても遅延評価のタイミングが自明な計算方法を提案する。特許文献1の技術背景0006において文章のみで説明されているアルゴリズムを具体的に記述したものが図10である。図10で利用しているdelayed sparse matrixのpython言語による実装 class delayedspmatrixを記述したものが図4である。図5にはこのclassの使用例を示した。図4では delayed sparse matrixの積演算と転置演算と行列要素を取り出す演算を定義している。他の和や差などの演算も同様に容易に定義可能である。しかし説明を簡単にするため図4では省略した。図10で利用しているdelayed sparse matrixのSingular Value Decomposition(SVD)を計算するrandomizedSVD関数を記述したものが図11である。図10においてAが疎行列であってもベクトルrとcにゼロが存在することはめったにないためΞ=A/n-rc^t/n²は密行列となってしまう。この密行列が莫大なメモリを必要とするため、従来はcorrespondence analysisを大規模なデータに適用するのは困難であった。しかしΞと何か(x)の積演算 Ξ* x =A/n * x -rc^t/n² * x の計算の順番を、Aとxの積を計算してからrc^t/n²とxの積を引き算するという順番にすると、途中で莫大なメモリを必要とする密行列が発生しないため、少ないメモリで計算ができる、というのが特許文献1の要点であった。また randomizedSVDによって全ての特異値を計算しないようにするというのも特許文献1の2つめの要点であった。randomizedSVDは分解したい対象の行列には積演算 * 以外を実行しないため、演算子 * をclass delayedspmatrixに対して定義するだけで計算できるのが特許文献1の3つめの要点であった。
If the efficiency improvement by lazy evaluation proposed in Patent Document 1 can be automatically executed, this function should have already been implemented as a part of compiler optimization. However, Non-Patent Document 1 shows that the calculation may stop due to some overflow error when the efficiency improvement by lazy evaluation is automatically executed. In order for the calculation not to stop due to an error, it is necessary for humans to specify the timing of lazy evaluation. For example, in Non-Patent Document 2, in order to efficiently calculate the LU decomposition of a sparse matrix using lazy evaluation, an algorithm that describes in detail the procedure of when to perform lazy evaluation and how to perform evaluation calculation. Has been proposed. The present invention uses a correspondence analysis algorithm that performs lazy evaluation at the timing described in Technical Background 0006 of Patent Document 1. We also propose a calculation method for otsu-gram calculation where the timing of lazy evaluation is obvious. FIG. 10 is a concrete description of the algorithm described only in sentences in the technical background 0006 of Patent Document 1. Fig. 4 describes the implementation class delayed spmatrix of the delayed sparse matrix used in Fig. 10 in the python language. Figure 5 shows an example of using this class. Figure 4 defines a delayed sparse matrix product operation, a transpose operation, and an operation to extract matrix elements. Other operations such as sums and differences can be easily defined as well. However, it is omitted in Fig. 4 for the sake of simplicity. Figure 11 describes the randomized SVD function that calculates the Singular Value Decomposition (SVD) of the delayed sparse matrix used in Figure 10. In FIG. 10, even if A is a sparse matrix, zeros rarely exist in the vectors r and c, so Ξ = A / n-rc ^t / n ² becomes a dense matrix. Since this dense matrix requires a huge amount of memory, it has been difficult to apply correspondence analysis to large-scale data in the past. However, the product operation of Ξ and something (x) Ξ * x = A / n * x -rc ^t / n ² * The order of calculation of x is rc ^t / n ^{2 after calculating the product of A and x.} The main point of Patent Document 1 is that if the product of x is subtracted, a dense matrix that requires a huge amount of memory does not occur on the way, so that the calculation can be performed with a small amount of memory. The second point of Patent Document 1 was to prevent all singular values from being calculated by randomized SVD. Since randomizedSVD does not perform any operation other than product operation * on the matrix to be decomposed, the third point of Patent Document 1 is that it can be calculated simply by defining the operator * for class delayed spmatrix.

図4のdelayed sparse matrixの実装はpython言語以外のC++言語などのオブジェクト指向言語でも利用可能な実装方法である。遅延評価で効率化できる部分をデータ型delayedspmatrixによって人間が指定することで、自動最適化で生じるオーバーフローエラーなどを回避している。データ型の切り替えを人間が指定すると、他の必要な演算も自動で切り替わる半自動処理が可能なのが図4の方法の利点である。図11のrandomizedSVDは非特許文献3で提案されているrandomizedSVDのアルゴリズムとその実装であるpython scikit-learn-0.17.1 libraryのrandomized_svd関数と同じものである。ただしpython scikit-learn-0.17.1 libraryのrandomized_svd関数では積演算 * は safe_sparse_dotという関数で実装されており、この関数が class delayedspmatrix型の行列を受け取った時に遅延評価の実行が行なわれるように拡張を行う。図12にsafe_sparse_dotの実装例を示す。簡単のためオリジナルのsafe_sparse_dotにある密行列など疎行列以外の積演算について演算方法を切り変える部分は省略している。この様に図4の方法を利用すると元のプログラムをほぼ変更せずに、データ型の変更のみを人間が指定すると他の演算も自動的に効率化される。この方法を使えば非特許文献2のような疎行列のLU分解も簡潔に記述することが可能となる。
The implementation of the delayed sparse matrix in Figure 4 is an implementation method that can be used in object-oriented languages such as C ++ languages other than the python language. By using the data type delayedspmatrix to specify the part that can be made more efficient by lazy evaluation, overflow errors that occur in automatic optimization are avoided. The advantage of the method shown in Fig. 4 is that when a human specifies data type switching, semi-automatic processing that automatically switches other necessary operations is possible. The randomized SVD in Fig. 11 is the same as the randomized SVD algorithm proposed in Non-Patent Document 3 and its implementation, the randomized_svd function of the python scikit-learn-0.17.1 library. However, in the randomized_svd function of the python scikit-learn-0.17.1 library, the product operation * is implemented by a function called safe_sparse_dot, which has been extended so that lazy evaluation is performed when this function receives a class delayed spmatrix type matrix. conduct. Figure 12 shows an implementation example of safe_sparse_dot. For the sake of simplicity, the part that switches the operation method for product operations other than sparse matrices such as dense matrices in the original safe_sparse_dot is omitted. In this way, when the method shown in Fig. 4 is used, other operations are automatically streamlined when a human specifies only the data type change without changing the original program. By using this method, it is possible to briefly describe the LU decomposition of a sparse matrix as in Non-Patent Document 2.

otsu-gramに図10のcorrespondence analysisのアルゴリズムを適用して出力される行列XまたはYが単語分散表現となる。行列X,Yを特異値で正規化した行列がcorrespondence analysisと呼ばれる場合もある。応用対象の性質によっては、この特異値で正規化した単語分散表現を利用した方が良い場合もある。
The matrix X or Y output by applying the correlation analysis algorithm shown in Fig. 10 to otsu-gram is the word distribution representation. A matrix in which the matrices X and Y are normalized by singular values is sometimes called correspondence analysis. Depending on the nature of the application target, it may be better to use the word distribution expression normalized by this singular value.

特願2017-007741 「Delayed Sparse Matrix」Japanese Patent Application No. 2017-007741 "Delayed Sparse Matrix"

Prabhat Totoo and Hans-Wolfgang Loidl "Lazy Data-oriented Evaluation Strategies" Proceedings of the 3rd ACM SIGPLAN Workshop on Functional High-performance Computing (2014)Prabhat Totoo and Hans-Wolfgang Loidl "Lazy Data-oriented Evaluation Strategies" Proceedings of the 3rd ACM SIGPLAN Workshop on Functional High-performance Computing (2014)

Shen and Tao Yang,"Efficient Sparse LU Factorization with Lazy Space Allocation" http://www.cs.rochester.edu/~kshen/papers/siam99.pdf (1999)Shen and Tao Yang, "Efficient Sparse LU Factorization with Lazy Space Allocation" http://www.cs.rochester.edu/~kshen/papers/siam99.pdf (1999)

Nathan Halko, Per-Gunnar Martinsson and Joel A. Tropp,"Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions",https://arxiv.org/abs/0909.4061,(2010)Nathan Halko, Per-Gunnar Martinsson and Joel A. Tropp, "Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix factorizations", https://arxiv.org/abs/0909.4061, (2010)

T. Zesch, C. Mu ller, and I. Gurevych, "Using wiktionary for computing semantic relatedness" Proceedings of the 23rd National Conference on Artificial Intelligence AAAI08 (2008)T. Zesch, C. Muller, and I. Gurevych, "Using wiktionary for computing semantic relatedness" Proceedings of the 23rd National Conference on Artificial Intelligence AAAI08 (2008)

E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pas ca, and A. Soroa, "A study on similarity and relatedness using distributional and wordnet-based approaches" Proceedings of The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics NAACL ’09. (2009)E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pas ca, and A. Soroa, "A study on similarity and relatedness using distributional and wordnet-based approaches" Proceedings of The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics NAACL '09. (2009)

E. Bruni, G. Boleda, M. Baroni, and N. K. Tran, "Distributional ompositionality. semantics in technicolor" Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (2012)E. Bruni, G. Boleda, M. Baroni, and N. K. Tran, "Distributional ompositionality. Semantics in technicolor" Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (2012)

K. Radinsky, E. Agichtein, E. Gabrilovich, and S. Markovitch, “A word at a time: Computing word relatedness using temporal semantic analysis” Proceedings of the 20th International Conference on World Wide Web WWW ’11 (2011)K. Radinsky, E. Agichtein, E. Gabrilovich, and S. Markovitch, “A word at a time: Computing word relatedness using temporal semantic analysis” Proceedings of the 20th International Conference on World Wide Web WWW '11 (2011)

O.Levy, Y. Goldberg, and I. Dagan, "Improving distributional similarity with lessons learned from word embeddings" TACL(2015)O.Levy, Y. Goldberg, and I. Dagan, "Improving distributional similarity with lessons learned from word embeddings" TACL (2015)

密行列演算を前提としたプログラムのコードをほぼ変更することなく効率的に疎行列演算を実行できるように最小限度の手間で改造する手順。その手順を利用してn-gramのnの自動選択に相当する処理を少ないメモリで効率的に処理する。
A procedure for remodeling a program that assumes dense matrix operations with minimal effort so that sparse matrix operations can be executed efficiently with almost no changes to the code. Using this procedure, the process equivalent to the automatic selection of n in n-gram is efficiently processed with a small amount of memory.

メモリに収まりきらない行列データをディスクに保存するが、ディスク読み込み処理で計算が遅くならないようにディスク読み込み処理を(例えば数値一個ごとにディスク読み込みするような)細切れにせず、まとめて大きな塊を一括読み込みするようにする。ただし大きな塊を一括読み込みするにしても、一般的なパソコンのメモリーサイズ内に収まる範囲で読み込みするようにする。ディスク読み込みのタイミングは遅延評価としていつでも実行できるようにしておく。またディスク読み込み以外のメモリを少なくする手順も遅延評価として適切にスケジュールしたタイミングで実行するようにする。
Matrix data that does not fit in the memory is saved on the disk, but the disk reading process is not shredded (for example, reading the disk for each numerical value) so that the calculation is not slowed down by the disk reading process. Make it read. However, even if a large chunk is read in a batch, it should be read within the memory size of a general personal computer. The timing of reading the disk should be able to be executed at any time as a lazy evaluation. In addition, the procedure for reducing the memory other than reading the disk should be executed at an appropriately scheduled timing as a lazy evaluation.

n-gramのnを人間の直感に頼らないで自動で決定することに相当する処理ができる。特に自然言語処理の場合は単純に実装すると巨大なメモリを必要とするが、この問題を回避して効率的な処理ができる。単語分散表現として類似度の正確さが従来手法より大幅に向上し、自動翻訳などの従来の単語分散表現が利用されていた多くの自然言語処理への応用が期待できる。
It is possible to perform processing equivalent to automatically determining n in n-gram without relying on human intuition. Especially in the case of natural language processing, if it is simply implemented, a huge memory is required, but this problem can be avoided and efficient processing can be performed. The accuracy of similarity as a word distributed expression is greatly improved compared to the conventional method, and it can be expected to be applied to many natural language processes in which the conventional word distributed expression such as automatic translation has been used.

英文中の単語"this"と"a"の共起関係を表わす #( this *^k a ) のヒストグラムHistogram of ^{# (this * k} a) representing the co-occurrence relationship between the words "this" and "a" in English otsu-gramを計算するアルゴリズムalgorithm to calculate otsu-gram otsu-gramを少ないメモリで計算するアルゴリズムAlgorithm for calculating otsu-gram with less memory 遅延メモリ行列のpython言語による実装Implementation of deferred memory matrix in python language 遅延メモリ行列のpython言語による実装の利用例Example of using the python language implementation of the delayed memory matrix 遅延ファイル行列のpython言語による実装Python language implementation of deferred file matrix 遅延ファイル行列のpython言語による実装の利用例Example of using the python language implementation of the deferred file matrix 圧縮された行列データを解凍しないで積演算するるアルゴリズムAlgorithm for multiplying compressed matrix data without decompressing it ハフマンコーディングで圧縮された行列データを解凍しないで積演算するるアルゴリズムAn algorithm that performs product operations without decompressing matrix data compressed by Huffman coding 疎行列データへのcorrespondence analysisを少ないメモリで計算するアルゴリズムAlgorithm for calculating correspondence analysis on sparse matrix data with less memory randomized SVDアルゴリズムrandomized SVD algorithm safe_sparse_dot関数safe_sparse_dot function

図3は図2のアルゴリズムの繰り返し処理を行列計算として書き直したものである。図3で出力される行列 C^result は図2のresultと同じotsu-gram行列である。図3では #( w₁ *^k w₂ ) の値は行列N^kに格納されている。この行列N^kを図6のclass delayed_file_matrixのように実装すると必要なデータだけメモリに読み込んで、いらなくなったらデータをメモリから削除するという処理が効率的にできるようになる。
Figure 3 is a rewrite of the iterative process of the algorithm in Figure 2 as a matrix calculation. ^{The matrix C result} output in Fig. 3 is the same otsu-gram matrix as the result in Fig. 2. In Figure 3, the value of _{# (w 1} * ^k w ₂ ) is stored in the ^{matrix N k.} If this matrix N ^k is implemented as shown in class delayed_file_matrix in Fig. 6, only the necessary data can be read into memory, and when it is no longer needed, the data can be deleted from memory efficiently.

図6は行列データをディスクから読み込むという動作を遅延評価できるように拡張した行列データ型 class delayed_file_matrixである。その実行例を図7に示す。図6、図7はpython言語で実装されている。このようにディスクからの読み込み処理などのメモリ上にデータを配置する処理を遅延評価として必要になった時に後から実行できるようにする方法を遅延ファイル行列と呼ぶことする。遅延ファイル行列は特許文献１のDelayed Sparse Matrixの応用の1つである。また delayed_file_matrix はDelayed Sparse Matrixの実装であるdelayedspmatrixの継承classとして実装されている。図6では積演算を表す __mul__ 関数が実装されているが、他の行列演算も同様に実装することができる。図6では積演算を実行した時にディスク読み込み処理の遅延評価が実行されるようになっている。そのため図3では行列N^kと1の積という数値計算としては意味のない処理が記述されている。なお図3では行列N^kとの積演算以外でディスク読み込み処理は発生しないため,図6のアルゴリズムは積演算を実装している。この方法は行列要素の数値一個ごとにディスク読み読み込みするような細切れの処理よりも高速である。積演算と同様に += 演算も実装できる。+= 演算を利用すると図3のC,A,C^total,A^totalもdelayed_file_matrixにできるためメモリ使用量をさらに減らすことができる。
Figure 6 shows the matrix data type class delayed_file_matrix, which is extended so that the operation of reading matrix data from disk can be lazy evaluated. An execution example is shown in Fig. 7. Figures 6 and 7 are implemented in the python language. A method of enabling the process of arranging data in memory, such as the process of reading from a disk, to be executed later when it is needed as a lazy evaluation is called a lazy file matrix. The delayed file matrix is one of the applications of the Delayed Sparse Matrix of Patent Document 1. Delayed_file_matrix is implemented as an inherited class of delayed spmatrix, which is an implementation of Delayed Sparse Matrix. In Figure 6, the __mul__ function that represents the product operation is implemented, but other matrix operations can be implemented as well. In Fig. 6, the delay evaluation of the disk read process is executed when the product operation is executed. Therefore, in Fig. 3, a process that is meaningless as a numerical calculation of the product of the ^{matrices N k and 1 is described.} Note that in Fig. 3, disk read processing does not occur except for the product operation with the ^{matrix N k, so the algorithm in Fig. 6 implements the product operation.} This method is faster than the shredded process of reading and reading a disk for each numerical value of a matrix element. The + = operation can be implemented as well as the product operation. By using the + = operation, C, A, C ^total , and A ^{total in} Fig. 3 can also be delayed_file_matrix, so the memory usage can be further reduced.

python言語に限らず一般的に行列などのデータを格納するclassはメンバー関数としてデータをsave,loadする関数を持っていることが多い。しかしpython言語の標準的な疎行列を格納するclassはメンバー関数としてデータをsave,loadする関数を持っていない。そのため図6ではデータをloadするメンバー関数を追加したclass mymatrixを定義している。class delayed_file_matrixは一般的なメンバー関数としてデータをloadする関数を持つ任意のclassに適用できるようになっている。この実装方法はpython言語だけでなくC++言語などのオブジェクト指向言語でも利用可能な形となっている。図7ではdelayed_file_matrixの利用例として、ディスクに保存された"tmp.npz"というファイルを行列の積演算の時に遅延評価で読み込む例が示されている。図3に現れる積演算は図7の実行例と同様に実装できる。
Not limited to the python language, classes that store data such as matrices generally have functions to save and load data as member functions. However, the class that stores the standard sparse matrix of the python language does not have a function to save and load data as a member function. Therefore, Fig. 6 defines class mymatrix with a member function that loads data. class delayed_file_matrix can be applied to any class that has a function to load data as a general member function. This implementation method can be used not only in the python language but also in object-oriented languages such as the C ++ language. In Fig. 7, as an example of using delayed_file_matrix, an example of reading the file "tmp.npz" saved on the disk by lazy evaluation at the time of matrix multiplication operation is shown. The product operation shown in FIG. 3 can be implemented in the same manner as the execution example in FIG.

図6のmymatrixのload関数にファイルからのデータ読み込みではなく、in-memory zipで圧縮されたデータの解凍処理を記述すると、ディスク保存ではなくデータ圧縮によるメモリ利用量の減少が可能となる。この方法によるメモリ利用量の減少はディスクを利用した場合よりも少ないが速度は非常に高速である。
If the load function of mymatrix in Fig. 6 describes the decompression process of the data compressed by in-memory zip instead of reading the data from the file, it is possible to reduce the memory usage by data compression instead of saving to disk. The reduction in memory usage by this method is less than when using disks, but the speed is very high.

in-memory zipによるメモリ使用量の圧縮には、圧縮データの解凍時に大きなメモリを必要とする問題があった。この問題を解決したのが図8,図9のアルゴリズムである。図8,9ではデータの圧縮を行列の行ごとに個別に行なう。そして行列の積演算の実行時には行ごとに圧縮データの解凍をする。こうすることで積演算が実行中の行以外はデータ圧縮されたままになり,少ないメモリでの演算が可能となる。図8のCompressedDotと図9のHuffConpresDot関数が積演算を実行する関数である。図9は行列全体についてのHuffman Codingを求めてから、その最適なCodingを利用して行ごとの圧縮と解凍処理を行なう点が図8と異なっている。一般的に疎行列で表現されるデータは大部分の行列要素は0であり、次に多い要素は1で、その次は2、というように小さい要素ほど疎行列に現れる回数が多い傾向がある。そのため行列全体についてのHuffman Codingを一度求めてから圧縮すると、より効率的な演算となる傾向がある。
Compressing the memory usage by in-memory zip has a problem that a large memory is required when decompressing the compressed data. The algorithms shown in Fig. 8 and Fig. 9 solve this problem. In Figures 8 and 9, the data is compressed individually for each row of the matrix. Then, when the matrix product operation is executed, the compressed data is decompressed row by row. By doing so, the data remains compressed except for the row where the product operation is being executed, and the operation can be performed with a small amount of memory. The Compressed Dot in Fig. 8 and the Huff Conpres Dot function in Fig. 9 are functions that perform product operations. Fig. 9 differs from Fig. 8 in that Huffman Coding for the entire matrix is obtained, and then the optimum Coding is used to perform row-by-row compression and decompression processing. In general, most matrix elements of data represented by sparse matrices are 0, the next most element is 1, the next is 2, and so on. Smaller elements tend to appear more often in sparse matrices. .. Therefore, once the Huffman Coding for the entire matrix is obtained and then compressed, it tends to be a more efficient operation.

これまで述べてきたメモリを効率的に利用する方法は全て class delayedspmatrix またはその継承型のデータを利用していたことを思いだそう。これらのメモリを効率的に利用する方法は一般化して以下のようにまとめられる。2-gramは行列として表現すると大部分の要素がゼロなため密行列ではなく疎行列として扱われることが多い。otsu-gram行列も同様である。自然言語処理に限らず一般的に疎行列として扱われるデータを密行列としてメモリ上に展開すると数十倍のメモリが必要となることが多い。同様にzip圧縮されたデータは解凍すると数倍の大きさになることが多い。ファイルに書き出されたデータはメモリサイズより大きなデータなことがあり、このデータを密行列としてメモリ上に展開するとメモリオーバーフローとなる。このように密行列としてメモリ上に展開するとメモリオーバーフローとなる可能性のある疎行列、圧縮データ、ファイルデータなどをまとめて過メモリ行列と呼ぶこととする。過メモリ行列の特異値分解は図11に示したrandomizedSVDで行なうことができる。このプログラムはオリジナルのrandomizedSVDと同じであるが、演算子*が過メモリ行列にも定義されており、その定義は遅延評価になっている点だけが異なる。このように密行列を前提として記述された特異値分解,固有値分解、ＬＵ分解などの行列計算プログラムを、疎行列などの過メモリ行列にも適用するには、過メモリ行列のデータ型も演算できよるうに積や和演算のoperator overloadをすれば良いだけである。ただし単純にデータ型の変更だけをするとメモリオーバーフローとなる。そこでアルゴリズム中のメモリオーバーフローとなる場所を人間が見つけて、その部分だけを遅延評価をするようにclass delayedspmatrix、またはその継承型データとしてプログラム中に目印をつける。すると目印に従って適切なデータ型の計算が実行されてメモリオーバーフローが生じないようになる。例えばrandomizedSVDを利用した図１０のcorrespondence analysisの例がそうである。この枠組みとclass delayedspmatrixを遅延メモリ行列と呼ぶ。この最適化処理を自動的にコンパイラ最適化で実行するのが困難であることは非特許文献1で述べられており、人間が最適化のポイントを見つける必要がある。
Recall that all of the memory-efficient methods mentioned so far used class delayed spmatrix or its inherited data. The methods for efficiently using these memories are generalized as follows. When expressed as a matrix, 2-gram is often treated as a sparse matrix rather than a dense matrix because most of the elements are zero. The same applies to the otsu-gram matrix. When data that is generally treated as a sparse matrix, not limited to natural language processing, is expanded in memory as a dense matrix, it often requires several tens of times more memory. Similarly, zipped data is often several times larger when decompressed. The data written to the file may be larger than the memory size, and if this data is expanded on the memory as a dense matrix, a memory overflow will occur. A sparse matrix, compressed data, file data, etc. that may cause a memory overflow when expanded on the memory as a dense matrix in this way are collectively called an overmemory matrix. The singular value decomposition of the overmemory matrix can be performed by the randomized SVD shown in FIG. This program is the same as the original randomized SVD, except that the operator * is also defined in the overmemory matrix, which is lazy evaluation. To apply matrix calculation programs such as singular value decomposition, eigenvalue decomposition, and LU decomposition described on the premise of a dense matrix to an overmemory matrix such as a sparse matrix, the data type of the overmemory matrix can also be calculated. All you have to do is perform the operator overload of the product and sum operations. However, simply changing the data type will result in a memory overflow. Therefore, a human finds a place where a memory overflow occurs in the algorithm, and marks it in the program as class delayed spmatrix or its inherited data so that only that part is lazy evaluated. Then, according to the mark, the calculation of the appropriate data type is executed and the memory overflow does not occur. For example, the example of correspondence analysis in FIG. 10 using randomized SVD is the case. This framework and class delayed spmatrix are called delayed memory matrices. It is stated in Non-Patent Document 1 that it is difficult to automatically execute this optimization process by compiler optimization, and it is necessary for humans to find the point of optimization.

図1のヒストグラムに大津の2値化を適用するのがotsu-gramであった。しかし大津の2値化よりも単純な方法で図1の右端の領域を切り捨てることができる。重複も含めた文章中の全ての単語の数をLとして,文章中に"this"が現われる確率 P("this")と、文章中に"a"が現われる確率 P("a")を使うと次の近似が言える。ここでこの近似式は kに依存しない点に注意しよう。
It was otsu-gram that applied Otsu's binarization to the histogram in Fig. 1. However, the rightmost area in Fig. 1 can be truncated by a simpler method than Otsu's binarization. Let L be the number of all words in the sentence including duplicates, and use the probability P ("this") that "this" appears in the sentence and the probability P ("a") that "a" appears in the sentence. The following approximation can be said. Note that this approximation is k-independent.

(数3)

#( this *^k a ) ≒ P("this") P("a") L
(Number 3)

# (this * ^k a) ≒ P ("this") P ("a") L

この近似値よりも低い出現回数はノイズとみなすことができる。ノイズの項をゼロとみなすデルタ関数を使って以下の近似をする。
The number of occurrences lower than this approximation can be regarded as noise. The following approximation is made using the delta function, which considers the noise term to be zero.

(数4)

#( w₁ *^k w₂ ) ≒ #^cut( w₁ *^k w₂ ) = #( w₁ *^k w₂ ) δ( #( w₁ *^k w₂ ) ＞ P(w₁) P(w₂) L)
(Number 4)

# (w ₁ * ^k w ₂ ) ≒ # ^cut (w ₁ * ^k w ₂ ) = # (w ₁ * ^k w ₂ ) δ (# (w ₁ * ^k w ₂ ) ＞ P (w ₁ ) P (w) ₂ ) L)

ここで #^cut はノイズを無視して共起回数を数えることを意味する記号である。この近似を使ってotsu-gram行列を以下で近似できる。
Here, ^#cut is a symbol that means to ignore noise and count the number of co-occurrence. This approximation can be used to approximate the otsu-gram matrix below.

(数5)

C^TC[w₁,w₂] = Σ _{k < k max} #^cut ( w₁ *^k w₂ )
(Number 5)

C ^TC [w ₁ , w ₂ ] = Σ _{k <k max} # ^cut (w ₁ * ^k w ₂ )

この近似 C^TCをtailcut-gram(TC)と呼ぶ。表1のTCはC^TCにcorrespondence analysisを適用して単語分散表現を計算した場合の精度を示したものである。また #( w₁ *^k w₂ ) ではなく #^cut ( w₁ *^k w₂ ) に対してotsu-gramを計算することもできる。この結果を表1のOG+TCで示す。
This approximate C ^TC is called tail cut-gram (TC). The TC in Table 1 shows the accuracy when the correspondence analysis is applied to the ^{C TC to calculate the word distribution expression.} The _{^{_{# (w 1 * k w 2}}} ) can also be calculated otsu-gram relative rather than ^{_{^{# cut (w 1 * k w}}} 2). This result is shown by OG + TC in Table 1.

n-gramの情報を要約する方法としてPitman-Yor Processが知られている。#( w₁ *^k w₂ ) を考える時、単語 w₁、 w₂ とそれ以外である * の3つの単語の関係についてPitman-Yor Processの考え方を適用すると、次の割引いた値に#( w₁ *^k w₂ ) を置き換えるのが妥当である。
The Pitman-Yor Process is known as a method of summarizing n-gram information. When considering # (w ₁ * ^k w ₂ ), applying the Pitman-Yor Process concept to the relationship between the three words w ₁ , w _{2 and the other *, the following discounted value is # (} It is reasonable to replace _{w 1} * ^k w _2).

(数6)
#^decay ( w₁ *^k w₂ ) = #( w₁ *^k w₂ ) (1-P(w₁) - P(w₂) ) ^k

(Number 6)
# ^decay (w ₁ * ^k w ₂ ) = # (w ₁ * ^k w ₂ ) (1-P (w ₁ ) --P (w ₂ )) ^k

ここで #^decay は遠い単語ほど(1-P(w₁) - P(w₂) ) ^k だけ減衰させて数えることを表す記号である。これを使ってotsu-gramの近似を以下のように計算できる
Here, # ^decay is a symbol indicating that the farther the word is, the more _{it is attenuated by (1-P (w 1} ) --P (w ₂ )) ^{k and counted.} You can use this to calculate the approximation of otsu-gram as follows:

(数7)
C^PY[w₁,w₂] = Σ _{k < k max} #^decay ( w₁ *^k w₂ )
(Number 7)
C ^PY [w ₁ , w ₂ ] = Σ _{k <k max} # ^decay (w ₁ * ^k w ₂ )

この近似 C^PYをself pitman-yor-gram(PY)と呼ぶ。表1のPYはC^PYにcorrespondence analysisを適用して単語分散表現を計算した場合の精度を示したものである。また #( w₁ *^k w₂ )
ではなく #^decay ( w₁ *^k w₂ ) に対してotsu-gramを計算することもできる。この結果を表1のOG+PYで示す。
This approximate C ^PY is called self pitman-yor-gram (PY). The PY in Table 1 shows the accuracy when the correspondence analysis is applied to ^{C PY to calculate the word distribution expression.} Also # (w ₁ * ^k w ₂ )
You can also calculate otsu-gram for # ^decay (w ₁ * ^k w _{2) instead.} This result is shown by OG + PY in Table 1.

self pitman-yor-gramとtailcut-gramの両方の考え方を次式のように適用することもできる
The idea of both self pitman-yor-gram and tailcut-gram can also be applied as follows:

(数8)
#^decay-cut ( w₁ *^k w₂ ) = #^decay( w₁ *^k w₂ ) δ( #^decay( w₁ *^k w₂ ) ＞ P(w₁) P(w₂) L)
(Number 8)
# ^decay-cut (w ₁ * ^k w ₂ ) = # ^decay (w ₁ * ^k w ₂ ) δ (# ^decay (w ₁ * ^k w ₂ ) ＞ P (w ₁ ) P (w ₂ ) L)

この方法でotsu-gramを近似すると次式となる。
Approximating otsu-gram by this method gives the following equation.

(数9)
C^PY+TC[w₁,w₂] = Σ _{k < k max} #^decay-cut ( w₁ *^k w₂ )
(Number 9)
C ^{PY + TC} [w ₁ , w ₂ ] = Σ _{k <k max} # ^decay-cut (w ₁ * ^k w ₂ )

表1のPY+TCはC^PY+TCにcorrespondence analysisを適用して単語分散表現を計算した場合の精度を示したものである。また #( w₁ *^k w₂ )
ではなく #^decay-cut ( w₁ *^k w₂ ) に対してotsu-gramを計算することもできる。この結果を表1のOG+PY+TCで示す。

PY + TC in Table 1 shows the accuracy when the correspondence analysis is applied to ^{C PY + TC to calculate the word distribution expression.} Also # (w ₁ * ^k w ₂ )
You can also calculate otsu-gram for # ^decay-cut (w ₁ * ^k w _{2) instead.} This result is shown by OG + PY + TC in Table 1.

self pitman-yor-gramとtailcut-gramはotsu-gramから省略可能な要素を削除した非可逆圧縮とみなすことができる。OG,PY,TCのどの組合せについても単語分散表現を計算すると既存の方法であるword2vec、fastText、GloVeよりも大幅に精度が上回っているのが表1からわかる。なお背景技術では表1の値は説明を簡単にするために正解率と述べたが、正確には順位相関係数の値である。表1の正確な計算方法は、表1と同様の比較を行なっている非特許文献8を参照されたい。
self pitman-yor-gram and tailcut-gram can be regarded as lossy compression with optional elements removed from otsu-gram. It can be seen from Table 1 that the word distribution expression is significantly more accurate than the existing methods word2vec, fastText, and GloVe when calculating the word distribution expression for any combination of OG, PY, and TC. In the background technology, the values in Table 1 are described as the correct answer rate for the sake of simplicity, but to be exact, they are the values of the rank correlation coefficient. For the exact calculation method in Table 1, refer to Non-Patent Document 8 which makes the same comparison as in Table 1.

図10のcorrespondence analysisのアルゴリズムにotsu-gramを適用して単語分散表現を計算した例を表1に示す。既存法よりも大幅に精度が向上しているのがわかる。

Table 1 shows an example of calculating the word distribution expression by applying otsu-gram to the correlation analysis algorithm in Fig. 10. It can be seen that the accuracy is significantly improved compared to the existing method.

自然言語処理の特徴量として様々な処理に利用可能である。また自然言語処理以外の時系列データの特徴としても利用可能である。 It can be used for various processes as a feature of natural language processing. It can also be used as a feature of time series data other than natural language processing.

Claims

Matrix calculation on the premise of dense matrix, that is, sparse matrix, compressed data, file data, etc. that may cause memory overflow when expanded on memory as a dense matrix, which was described in the embodiment for carrying out the invention. When processing with a program, specify a data type different from the standard matrix data as a mark at the place where memory overflow occurs, and perform an operation so that the overmemory matrix is not expanded in the memory by operator overload to the data type different from the standard. The efficiency of the data type class initialization process and operator overload, which marks the location where memory overflow occurs, with a style algorithm that defines as a delay evaluation, its implementation, and calculation method, and the delay evaluation process for that purpose. A data type that enables efficient memory processing with almost no changes to the matrix calculation program that assumes the original dense matrix by describing the details of the processing to be performed, and the data type of delayed spmatrix in Fig. 4. , And the data type of delayed_file_matrix in FIG. 6, and the delayed memory matrix described in the embodiments for carrying out the invention.

When a combination of a matrix calculation program that assumes a sparse matrix and a matrix calculation program that assumes a dense matrix is combined, a huge dense matrix in which the sparse matrix is expanded occurs in the middle of the calculation, and sparse matrix processing is assumed. A style algorithm that applies the method of claim 1 to reduce memory usage when a part switches to a calculation that assumes a sparse matrix, its implementation, a calculation method, and a data type for that purpose.

Input sparse matrix data as a contingency table When calculating by combining correspondence analysis and randomized SVD, the contingency table of sparse matrix data, which is the point where a huge dense matrix with sparse matrix expanded in the middle of calculation, is averaged. An algorithm that reduces memory usage by applying the methods of claims 2 and 1 by specifying a data type different from the standard matrix data as a place to calculate the difference from the contingency table estimated from the probabilities. , And its implementation, and the algorithm of Figure 10, and its implementation, and the data type for it, and its data type. An algorithm and its implementation, and an algorithm that uses the delayedspmatrix data type shown in Fig. 4 by extending the safe_sparse_dot function to the safe_sparse_dot function in the randomized_svd function of the python scikit-learn library, and its implementation.

In order to describe the co-occurrence relationship of words as a matrix in natural language processing, the probability that two types of words appear at the same time at distant positions as shown in Fig. 1 is the x-axis of the distant distance. An algorithm that calculates a matrix that describes the co-occurrence relationship of two different words by applying Otsu's binarization to the histogram data with the probability of That is, the otsu-gra algorithm and its implementation, the algorithm in Fig. 2, and its implementation, and the algorithm that applies correspondence analysis to the co-occurrence relation matrix of words calculated at this time to calculate the word distributed representation, and its Mounting.

In the otsu-gram algorithm of claim 4, the word set W to be considered is determined in advance, and the product, sum, and sum of the _{matrix N k indicating the number of times any two words in the set appear k words apart.} An algorithm that calculates Otsu's binarization by the Adamar product, unit step function, and division for each matrix element and outputs the same matrix as the otsu-gram algorithm of claim 4 by matrix operation, and its implementation, and the algorithm in Fig. 3. And its implementation, and the algorithm in which the matrix was used as the overmemory matrix to reduce the memory usage by the method of claim 2, and its implementation, and the algorithm using the delayed_file_matrix data type of FIG. 6 as the overmemory matrix, and Its implementation, and an algorithm that uses the delayed_file_matrix data type in Figure 6 for the algorithm in Figure 3, and its implementation.

By storing the matrix data in memory as data compressed row by row and decompressing the matrix calculation row by row and executing it row by row, a large memory is required when compressing and decompressing the entire matrix data at once. An algorithm that avoids this problem, and its implementation, and the algorithm shown in Figure 8, and its implementation, and before row-by-row compression, calculate Huffman Coding, which can efficiently compress the entire matrix, and then row-by-row. Optimal to reduce the memory usage of existing matrix calculation by using the algorithm and its implementation that further improved the compression efficiency by compression, the algorithm shown in FIG. 9, and its implementation, and these compression and decompression processes as delay evaluation. The algorithms of claims 1, 2, 3, and 5 that perform the conversion, and their implementations.

Similar to claim 4 and claim 5, in order to describe the co-occurrence relationship of words as a matrix, each word appears independently for the probability that _{two kinds of words w 1} and w _{2 appear at different positions at the same time.} The value obtained by multiplying the probabilities P (w ₁ ) P (w ₂ ) is regarded as zero. The corrected one is regarded as the probability that two kinds of words w ₁ and w ₂ appear at the same time at a distance. , An algorithm that describes word co-occurrence relationships as a matrix by the following method, and its implementation, and an algorithm that applies correspondence analysis to the word co-occurrence relationship matrix calculated at this time to calculate word distribution expressions, and Its implementation.

The method of calculating the co-occurrence relationship of the first word is simply the sum of the _{probability that two kinds of words w 1} and w ₂ appear at the same time _{k words apart up to the preset length k max.} The method of regarding the co-occurrence relationship of words and the method of calculating the co-occurrence relationship of the second word are the same as in claims 4 and 5, for the side with the shorter length divided by Otsu's binarization. A method in which the value obtained by adding all the probabilities that two types of words w ₁ and w _{2 appear at the same time k words apart is regarded as a co-occurrence relationship of words.}

Similar to claims 4 and 5, in order to describe the co-occurrence of words as a matrix, each word is k words for the probability that _{two kinds of words w 1} and w _{2 appear at the same time k words apart.} Probability of not appearing alone in the interval of (1-P (w ₁ ) -P (w ₂ )) The ^{value attenuated by multiplying by k} is divided into two words w ₁ and w _{2 k words apart.} An algorithm that describes the co-occurrence of words as a matrix by the following method, assuming that they appear at the same time, and its implementation and the correspondence analysis of the co-occurrence of words calculated at this time are applied to the word distribution expression. The calculation algorithm, its implementation, and the _{probability that two types of words w 1} and w ₂ appear at the same time k words apart, multiplied by the probability that each word in claim 7 appears alone P (w ₁ ). P (w ₂₎ a small probability that the two words in a position away multiplied by the modified regarded as zero w ₁ and w ₂ are further subjected to processing is regarded as the probability of appearing the same time than is more distant Apply correspondence analysis to the algorithm that describes the co-occurrence of words as a matrix, and its implementation, and the co-occurrence matrix of words calculated at this time so that the word relationships are attenuated. An algorithm for calculating word-distributed expressions, and its implementation.

The method of calculating the co-occurrence relationship of the first word is simply the sum of the _{probability that two kinds of words w 1} and w ₂ appear at the same time _{k words apart up to the preset length k max.} The method of regarding the co-occurrence relationship of words and the method of calculating the co-occurrence relationship of the second word are the same as in claims 4 and 5, for the side with the shorter length divided by Otsu's binarization. A method in which the value obtained by adding all the probabilities that two types of words w ₁ and w _{2 appear at the same time k words apart is regarded as a co-occurrence relationship of words.}

A feature extraction algorithm that performs the same processing as in claims 4, 5, and 7, and 8 for time-series data other than natural language, and its implementation.