JP2011033806A

JP2011033806A - Language model compression device, access device of language model, language model compression method, access method of language model, language model compression program, and access program of language model

Info

Publication number: JP2011033806A
Application number: JP2009179625A
Authority: JP
Inventors: Hajime Tsukada; 元塚田; Hideki Isozaki; 秀樹磯崎; Taro Watanabe; 太郎渡辺
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-07-31
Filing date: 2009-07-31
Publication date: 2011-02-17
Anticipated expiration: 2029-07-31
Also published as: JP5349193B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide technology of efficiently accessing a language model by suppressing the amount of data of an n-gram language model as a learning result. <P>SOLUTION: The language model compression device 1 stores the n-gram language model in a language model storage section 5. A data structure conversion section 3 converts a pointer for indicating a first position of an (n+1)-gram in a data arrangement of the n-gram language model stored in the language model storage section 5, to fixed byte expression, and stores it in a conversion data storage section 6. A compression section 4 of pointer expression makes a virtual trie structure by providing a virtual route node in a tree structure of the n-gram language model stored in a conversion data storage section 6, and the pointer is compressed and converted to a level-order unary degree sequence (LOUDS) expression. The compressed and converted data is stored in a compression data storage section 7. A storage device (RAM) of a computer is mainly used for the storage section 7. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、単語をｎ文字単位で分割し、それぞれの単語列とその出現頻度（確率）を求めたｎ−ｇｒａｍ言語モデルを圧縮するための技術、および効率的にｎ−ｇｒａｍ言語モデルにアクセスするための技術に関する。 The present invention divides a word into units of n characters, and a technique for compressing an n-gram language model obtained from each word string and its appearance frequency (probability), and efficiently accessing the n-gram language model It relates to technology.

周知のように、音声認識や統計的機械翻訳などの分野では、非特許文献１に示すように、出力される言語(単語列)の尤もらしさをモデル化するために、単語ｎ個からなる単語列(ｎ−ｇｒａｍと呼ぶ)に対する条件付確率（式１）が広く用いられている。 As is well known, in the field of speech recognition and statistical machine translation, as shown in Non-Patent Document 1, in order to model the likelihood of an output language (word string), a word consisting of n words The conditional probability (Equation 1) for a sequence (called n-gram) is widely used.

単語列（式２）の尤もらしさ（式３）は、式１の条件付確率を用いて、式４のように近似的に求めることができる。 The likelihood (Formula 3) of the word string (Formula 2) can be approximately calculated as Formula 4 using the conditional probability of Formula 1.

このような式１の条件付確率を与えるモデルは、ｎ−ｇｒａｍ言語モデルと呼ばれている。「ｎ」の大きさとしては、音声認識では「３〜４」の値を、統計的機械翻訳では「４〜５」の値を用いることが多い。式１の条件付確率は、式５のように、再帰的に計算される。また、式２は、単語ｗ₁，ｗ₂，・・・，ｗ_nを省略標記したものである。 Such a model that gives the conditional probability of Equation 1 is called an n-gram language model. As the magnitude of “n”, a value of “3-4” is often used for speech recognition, and a value of “4-5” is often used for statistical machine translation. The conditional probability of Equation 1 is recursively calculated as Equation 5. In addition, Equation 2, the words w _1, w _2, ···, is obtained by omitting the title the w _n.

このような統計モデルを計算機（コンピュータ）で利用するためには、まず、学習データに現れる単語列（式２）に応じて、式６の平滑された確率、式７のバックオフ係数などの情報を格納すること、さらに単語列（式２）を入力として、前記各情報に効率的にアクセスできることが必要となる。 In order to use such a statistical model in a computer (computer), first, information such as the smoothed probability of Expression 6 and the backoff coefficient of Expression 7 according to the word string (Expression 2) appearing in the learning data. , And the word string (Equation 2) as an input must be able to be accessed efficiently.

そして、音声認識や統計的機械翻訳などｎ−ｇｒａｍのアプリケーションの性能を高めるためには、非特許文献２に示すように、大量の学習データを用いてこれらのパラメータを学習することはもとより、学習データを増やすことで、そこに現れる正しい単語列（式２）のバリエーション(ｎ−ｇｒａｍのバリエーション)を多数保持することが有効と知られている。 In order to improve the performance of n-gram applications such as speech recognition and statistical machine translation, as shown in Non-Patent Document 2, not only learning these parameters using a large amount of learning data, but also learning It is known that by increasing the data, it is effective to maintain a large number of variations (n-gram variations) of the correct word string (formula 2) appearing there.

また、ｎ−ｇｒａｍ言語モデルは、非特許文献３に示すように、木構造（データ構造）で表現できることが知られている。さらに木構造の一種であるトライ（ｔｒｉｅ）をコンパクトに表現する手法として、非特許文献４．５に示すように、ＬＯＵＤＳ（Ｌｅｖｅｌ−ＯｒｄｅｒＵｎａｒｙＤｅｇｒｅｅＳｅｑｕｅｎｃｅ）が知られている。ここでトライとは、１つのルートノードを持つ順序付き木構造の一種であり、プレフィックス木（ＰｒｅｆｉｘＴｒｅｅ）とも呼ばれている。 Further, it is known that the n-gram language model can be expressed by a tree structure (data structure) as shown in Non-Patent Document 3. Furthermore, as shown in Non-Patent Document 4.5, LOUDS (Level-Order Undegree Sequence Sequence) is known as a technique for compactly expressing a trie that is a kind of tree structure. Here, a trie is a kind of an ordered tree structure having one root node, and is also called a prefix tree (Prefix Tree).

Ｓ．Ｍ．Ｋａｔｚ， “ＥｓｔｉｍａｔｉｏｎｏｆＰｒｏｂａｂｉｌｉｔｉｅｓｆｒｏｍＳｐａｒｓｅＤａｔａｆｏｒｔｈｅＬａｎｇｕａｇｅＭｏｄｅｌＣｏｍｐｏｎｅｎｔｏｆａＳｐｅｅｃｈＲｅｃｏｇｎｉｚｅｒ，” ＩＥＥＥＴＲＡＮＳＡＣＴＩＯＮＳＯＮＡＣＯＵＳＴＩＣ，ＳＰＥＥＣＨ，ＡＮＤＳＩＧＮＡＬＰＲＯＣＥＳＳＩＮＧ，ＶＯＬ．ＡＳＳＰ−３５，ＮＯ．３，ＭＡＲＣＨ１９８７，ｐｐ．４００−４０１S. M.M. Katz, “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer,” IEEE TRANSACTIONS ON ACUUS. ASSP-35, NO. 3, MARCH 1987, pp. 400-401 Ｔ．Ｂｒａｎｔｓ，Ａ．Ｃ．Ｐｏｐａｔ，Ｐ．Ｘｕ，Ｆ．Ｊ．Ｏｃｈ，ａｎｄＪ．Ｄｅａｎ，ＬａｒｇｅＬａｎｇｕａｇｅＭｏｄｅｌｓｉｎＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎ，Ｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ２００７ＪｏｉｎｔＣｏｎｆｅｒｅｎｃｅｏｎＥｍｐｉｒｉｃａｌＭｅｔｈｏｄｓｉｎＮａｔｕｒａｌＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇａｎｄＣｏｍｐｕｔａｔｉｏｎａｌＮａｔｕｒａｌＬａｎｇｕａｇｅＬｅａｒｎｉｎｇ（ＥＭＮＬＰ−ＣｏＮＬＬ），ｐｐ．８５８−−８６７，２００７．T.A. Brants, A.B. C. Popat, P.M. Xu, F.A. J. et al. Och, and J.M. Dean, Large Language Models in Machine Translation, Proceedings of the 2007, Joint Conferencing on Empirical Principles in Natural Language Processing. 858--867, 2007. Ｂ．ＲａｊａｎｄＥ．Ｗ．Ｄ．Ｗｈｉｔｔａｋｅｒ． ”ＬＯＳＳＬＥＳＳＣＯＭＰＲＥＳＳＩＯＮＯＦＬＡＮＧＵＡＧＥＭＯＤＥＬＳＴＲＵＣＴＵＲＥＡＮＤＷＯＲＤＩＤＥＮＴＩＦＩＥＲＳ” ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆＩＣＡＳＳＰ，ｖｏｌｕｍｅ１，ｐａｇｅｓＩ≡３８８≡Ｉ≡３９１ｖｏｌ．１，Ａｐｒｉｌ２００３．B. Raj and E.M. W. D. Whitaker. “LOSSLESS COMPRESION OF LANGUAGE MODEL STRUCTURE AND WORD IDENTIFIERS” In Proceedings of ICASSP, volume 1, pages I≡388≡I≡391 vol. 1, April 2003. Ｏ’ＮｅｉｌＤｅｌｐｒａｔｔ，ＮａｉｌａＲａｈｍａｎ，ａｎｄＲａｊｅｅｖＲａｍａｎ． ”ＥｎｇｉｎｅｅｒｉｎｇｔｈｅＬＯＵＤＳＳｕｃｃｉｎｃｔＴｒｅｅＲｅｐｒｅｓｅｎｔａｔｉｏｎ” ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ５ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＷｏｒｋｓｈｏｐｏｎＥｘｐｅｒｉｍｅｎｔａｌＡｌｇｏｒｉｔｈｍｓ，ｐａｇｅｓ１３４≡１４５，２００６．O'Neil Delpratt, Naila Rahman, and Rajeev Raman. "Engineering the LOUDS Success Tree Representation" In Proceedings of the 5th International Workshop on Exponential Algorithms, pages 134≡145, 2006. ＧｕｙＪａｃｏｂｓｏｎ． ”Ｓｐａｃｅ−ｅｆｆｉｃｉｅｎｔＳｔａｔｉｃＴｒｅｅｓａｎｄＧｒａｐｈｓ” Ｉｎ３０ｔｈＡｎｎｕａｌＳｙｍｐｏｓｉｕｍｏｎＦｏｕｎｄａｔｉｏｎｓｏｆＣｏｍｐｕｔｅｒＳｃｉｅｎｃｅ，ｐａｇｅｓ５４９≡５５４，Ｎｏｖ１９８９．Guy Jacobson. “Space-Efficient Static Trees and Graphs” In 30th Annual Symposium on Foundations of Computer Science, pages 549≡554, Nov 1989. ＤｏｎｇＫｙｕｅＫｉｍ，ＪｏｏｎｇＣｈａｅＮａ，ＪｉＥｕｎＫｉｍ，ａｎｄＫｕｎｓｏｏＰａｒｋ． ”Ｅｆｆｉｃｉｅｎｔｉｍｐｌｅｍｅｎｔａｔｉｏｎｏｆｒａｎｋａｎｄｓｅｌｅｃｔｆｕｎｃｔｉｏｎｓｆｏｒｓｕｃｃｉｎｃｔｒｅｐｒｅｓｅｎｔａｔｉｏｎ” ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ５ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＷｏｒｋｓｈｏｐｏｎＥｘｐｅｒｉｍｅｎｔａｌＡｌｇｏｒｉｔｈｍｓ，ｐａｇｅｓ３１５≡３２７，２００５．Dong Kyue Kim, Joong Chae Na, Ji Eun Kim, and Kunsoo Park. “Efficient implementation of rank and select functions for successful re-presentation”, In Proceedings of the 5th International Working on Exponential 3

しかしながら、大量のデータで学習したｎ−ｇｒａｍ言語モデルは、非常に多種多様なｎ−ｇｒａｍを格納するため、モデル表現が巨大になってしまう。特に、ｎ−ｇｒａｍ言語モデルを使う際には効率的なアクセスが要求されるため、それを考慮したデータ構造は大きなものとなり易い。 However, since the n-gram language model learned from a large amount of data stores a very wide variety of n-grams, the model expression becomes huge. In particular, since efficient access is required when using the n-gram language model, the data structure considering it tends to be large.

その一方で効率的なアクセスを実現するためには、ｎ−ｇｒａｍ言語モデルを主記憶装置（ＲＡＭ）に格納することが好ましいが、現代の計算機（コンピュータ）をもってしても主記憶装置の記憶容量には限界がある。例えば、全Ｗｅｂデータで学習した巨大なｎ−ｇｒａｍを、従来法で表現すると４０ＧＢ以上の容量が必要となるため、一部の非常に高価な計算機を除いて主記憶装置に保持することは困難である。 On the other hand, in order to achieve efficient access, it is preferable to store the n-gram language model in the main memory (RAM), but even with a modern computer (computer), the storage capacity of the main memory Has its limits. For example, if a conventional method is used to express a huge n-gram learned from all Web data, a capacity of 40 GB or more is required. Therefore, it is difficult to store the huge n-gram in the main storage device except for some very expensive computers. It is.

また、ｎ−ｇｒａｍ言語モデルの木構造の表現は、ルートノードが存在しない（あるいは１つではない）。トライ構造をコンパクトに表現する方法は知られているが、ｎ−ｇｒａｍ言語モデルのデータ構造はトライではないため、トライ構造をコンパクトに表現する方法をｎ−ｇｒａｍ言語モデルの表現に応用するには、さらなる工夫が必要である。 In addition, the tree structure representation of the n-gram language model has no root node (or not one). Although a method for expressing the trie structure in a compact manner is known, since the data structure of the n-gram language model is not a trie, the method for expressing the trie structure in a compact manner can be applied to the representation of the n-gram language model. Further ingenuity is necessary.

本発明は、上述のような従来技術の問題点を解決するためになされたものであり、ｎ−ｇｒａｍ言語モデルのデータ量を抑制し、効率的にアクセス可能な技術を提供することを解決課題とする。 The present invention has been made in order to solve the above-described problems of the prior art, and provides a technology that can efficiently access data by suppressing the data amount of the n-gram language model. And

そこで、本発明は、自然言語処理システムなどで用いられるｎ−ｇｒａｍ言語モデルのデータ構造をトライ（木構造）に変換し、トライをコンパクトに表現するＬＯＵＤＳという手法によって、より少ないビット列で表現する。さらに、ｎ−ｇｒａｍの特性を考慮して、この手法を改良する。 Therefore, the present invention converts the data structure of an n-gram language model used in a natural language processing system or the like into a trie (tree structure) and expresses the trie with a smaller number of bit strings by a technique called LOUDS. Furthermore, this method is improved in consideration of the characteristics of n-gram.

本発明に係る言語モデル圧縮装置の一態様は、仮想的なルートノードを設けて、ｎ−ｇｒａｍ言語モデルの構造をトライ構造に変換するデータ構造変換手段と、前記データ構造変換手段にて変換されたトライ構造をＬＯＵＤＳ（Ｌｅｖｅｌ−ＯｒｄｅｒＵｎａｒｙＤｅｇｒｅｅＳｅｑｕｅｎｃｅ）表現に圧縮変換する圧縮手段と、を備える。 One aspect of the language model compression apparatus according to the present invention is provided with a virtual root node, a data structure conversion unit that converts the structure of an n-gram language model into a trie structure, and the data structure conversion unit Compression means for compressing and converting the trie structure into a LOUDS (Level-Order Unary Degree Sequence) expression.

本発明に係る言語モデルの他の態様は、ｎ−ｇｒａｍ言語モデルの最高次数の最初のノードの位置（ノードＩＤ）を記憶装置に記憶し、ｎ−ｇｒａｍ言語モデルの構造を表すポインタ表現を、最高次数のｎ−ｇｒａｍを削除した表現に変換するデータ構造変換手段を、備える。ここでは１−ｇｒａｍの個数Ｎ₁を記憶装置に記憶し、前記データ構造変換手段にて変換したｎ−ｇｒａｍ言語モデルの構造を、スーパーノード（トライ構造のルートノードを指す仮想的なルートノード）および最高次数のｎ−ｇｒａｍに対応したビットをもたないように拡張したＬＯＵＤＳ表現に圧縮変換する圧縮手段を、さらに備えてもよい。 In another aspect of the language model according to the present invention, the position (node ID) of the first node of the highest order of the n-gram language model is stored in a storage device, and a pointer expression representing the structure of the n-gram language model is stored. Data structure conversion means for converting the highest-order n-gram into a deleted expression is provided. Here, the number N ₁ of 1-grams is stored in the storage device, and the structure of the n-gram language model converted by the data structure conversion means is a super node (virtual root node indicating the root node of the tri structure). And compression means for compressing and converting the data into a LOUDS expression expanded so as not to have a bit corresponding to the highest order n-gram.

本発明に係る言語モデルのアクセス装置の一態様は、入力された単語列のｎ−ｇｒａｍの位置ｘに対して、親ノード「（ｎ−１）−ｇｒａｍ」（ｐａｒｅｎｔ）の位置を、式８により算出して出力する第１のアクセス手段と、
入力された単語列のｎ−ｇｒａｍの位置ｘに対して、子ノード「（ｎ＋１）−ｇｒａｍ」のうち単語ＩＤの最も小さい子ノード（ｆｉｒｓｔ＿ｃｈｉｌｄ）の位置を、式９により算出して出力する第２のアクセス手段と、を備える。 In one aspect of the language model access device according to the present invention, the position of the parent node “(n−1) -gram” (parent) is expressed by Expression 8 with respect to the n-gram position x of the input word string. First access means for calculating and outputting by
The position of the child node (first_child) having the smallest word ID among the child nodes “(n + 1) -gram” is calculated by Expression 9 with respect to the n-gram position x of the input word string and output. Two access means.

本発明に係る言語モデルのアクセス装置の他の態様は、入力された単語列のｎ−ｇｒａｍの位置ｘに対して、親ノード「（ｎ−１）−ｇｒａｍ」（ｐａｒｅｎｔ）の位置を、式１０により算出して出力する第１のアクセス手段と、入力された単語列のｎ−ｇｒａｍの位置ｘに対して、子ノード「（ｎ＋１）−ｇｒａｍ」のうち単語ＩＤの最も小さい子ノード（ｆｉｒｓｔ＿ｃｈｉｌｄ）の位置を、式１１により算出して出力する第２のアクセス手段と、を備える。 According to another aspect of the language model access device of the present invention, the position of the parent node “(n−1) -gram” (parent) is expressed with respect to the position x of the n-gram of the input word string. The first access means calculated and output by 10 and the child node (first_child) having the smallest word ID among the child nodes “(n + 1) -gram” with respect to the n-gram position x of the input word string The second access means for calculating and outputting the position of

本発明に係る言語モデル圧縮方法の一態様は、データ構造変換手段が、仮想的なルートノードを設けて、ｎ−ｇｒａｍ言語モデルの構造をトライ構造に変換するデータ構造変換ステップと、圧縮手段が、前記データ構造変換ステップにて変換されたトライ構造をＬＯＵＤＳ（Ｌｅｖｅｌ−ＯｒｄｅｒＵｎａｒｙＤｅｇｒｅｅＳｅｑｕｅｎｃｅ）表現に圧縮変換する圧縮ステップと、を有する。 One aspect of the language model compression method according to the present invention is a data structure conversion step in which the data structure conversion means provides a virtual root node to convert the structure of the n-gram language model into a trie structure, and the compression means And a compression step of compressing and converting the trie structure converted in the data structure conversion step into a LOUDS (Level-Order Unary Degree Sequence) expression.

本発明に係る言語モデル圧縮方法の他の態様は、データ構造変換手段が、前記ｎ−ｇｒａｍ言語モデルの最高次数の最初のノードの位置（ノードＩＤ）を記憶装置に記憶し、ｎ−ｇｒａｍ言語モデルの構造を表すポインタ表現を、最高次数のｎ−ｇｒａｍを削除した表現に変換するデータ構造変換ステップを、有する。ここでは圧縮手段が、１−ｇｒａｍの個数Ｎ₁を記憶装置に記憶させ、前記データ構造変換手段にて変換したｎ−ｇｒａｍ言語モデルの構造を、スーパーノード（トライ構造のルートノードを指す仮想的なルートノード）および最高次数のｎ−ｇｒａｍに対応したビットをもたないように拡張したＬＯＵＤＳ表現に圧縮変換する圧縮ステップを、さらに有してもよい。 In another aspect of the language model compression method according to the present invention, the data structure conversion means stores the position (node ID) of the first node of the highest degree of the n-gram language model in the storage device, and the n-gram language A data structure conversion step of converting the pointer expression representing the structure of the model into an expression from which the highest order n-gram is deleted. Here, the compression means stores the number N ₁ of 1-grams in a storage device, and the structure of the n-gram language model converted by the data structure conversion means is a super node (virtual node indicating the root node of the tri structure). And a compression step of compressing and converting the data into a LOUDS expression expanded so as not to have a bit corresponding to the highest order n-gram.

本発明に係る言語モデルのアクセス方法の一態様は、第１のアクセス手段が、入力された単語列のｎ−ｇｒａｍの位置ｘに対して、親ノード「（ｎ−１）−ｇｒａｍ」（ｐａｒｅｎｔ）の位置を、式８により算出して出力するステップと、第２のアクセス手段が、入力された単語列のｎ−ｇｒａｍの位置ｘに対して、子ノード「（ｎ＋１）−ｇｒａｍ」のうち単語ＩＤの最も小さい子ノード（ｆｉｒｓｔ＿ｃｈｉｌｄ）の位置を、式９により算出して出力するステップと、を有する。 In one aspect of the language model access method according to the present invention, the first access means performs the parent node “(n−1) -gram” (parent) with respect to the n-gram position x of the input word string. ) And calculating and outputting the position of () by the expression 8, and the second access means includes the child node “(n + 1) -gram” with respect to the position x of the n-gram of the input word string. And calculating and outputting the position of the child node (first_child) having the smallest word ID according to equation (9).

本発明に係る言語モデルのアクセス方法の他の態様は、第１のアクセス手段が、入力された単語列のｎ−ｇｒａｍの位置ｘに対して、親ノード「（ｎ−１）−ｇｒａｍ」（ｐａｒｅｎｔ）の位置を、式１０により算出して出力するステップと、第２のアクセス手段が、入力された単語列のｎ−ｇｒａｍの位置ｘに対して、子ノード「（ｎ＋１）−ｇｒａｍ」のうち単語ＩＤの最も小さい子ノード（ｆｉｒｓｔ＿ｃｈｉｌｄ）の位置を、式１１により算出して出力するステップと、を有する。 In another aspect of the language model access method according to the present invention, the first access means performs a parent node “(n−1) -gram” (with respect to the n-gram position x of the input word string. The position of parent) is calculated and output by Expression 10, and the second access means sets the child node “(n + 1) -gram” to the position x of the n-gram of the input word string. A step of calculating and outputting the position of the child node (first_child) having the smallest word ID according to equation (11).

なお、本発明は、前記各装置としてコンピュータを機能させるためのプログラムの態様としてもよい。このプログラムは記録媒体に記録した態様で配布・提供することができる。 In addition, this invention is good also as an aspect of the program for functioning a computer as said each apparatus. This program can be distributed and provided in a form recorded on a recording medium.

本発明によれば、ｎ−ｇｒａｍ言語モデルの構造を規定するポインタ表現のデータ量が抑制される。このことにより巨大なｎ−ｇｒａｍ言語モデルであっても、ポインタ表現のすべてもしくは大部分を、アクセスの遅いハードディスクドライブ装置に代わって，アクセスの高速なメモリ（ＲＡＭ）上に保持することが可能となる。本発明により巨大なｎ−ｇｒａｍ言語モデルを用いた機械翻訳や音声認識などにあたって、効率的にｎ−ｇｒａｍ言語モデルにアクセスすることが可能となり、処理の高速化に貢献する。 According to the present invention, the data amount of the pointer expression that defines the structure of the n-gram language model is suppressed. As a result, even with a huge n-gram language model, it is possible to hold all or most of the pointer representation on a high-speed memory (RAM) instead of a hard-disk device that is slow to access. Become. The present invention makes it possible to efficiently access an n-gram language model in machine translation and speech recognition using a huge n-gram language model, which contributes to speeding up of processing.

本発明の実施形態に係る言語モデル圧縮装置および言語モデルのアクセス装置の構成図。The lineblock diagram of the language model compression device and language model access device concerning an embodiment of the present invention. 同言語モデル記憶部に記憶されたｎ−ｇｒａｍ言語モデルのデータ構造図。The data structure figure of the n-gram language model memorize | stored in the same language model memory | storage part. 同データ構造変換部の変換後におけるｎ−ｇｒａｍ言語モデルのデータ構造図。The data structure figure of the n-gram language model after the conversion of the data structure conversion part. トライ構造に基づくポインタ表現を示す図。The figure which shows the pointer expression based on a trie structure. ｎ−ｇｒａｍ構造に基づくポインタ表現を示す図。The figure which shows the pointer expression based on an n-gram structure. トライ構造の構造図。Structure diagram of the trie structure. ｎ−ｇｒａｍの構造図。Structure diagram of n-gram.

以下、本発明の実施形態に係る言語モデル圧縮装置および言語モデルのアクセス装置を説明する。この両装置は、音声認識装置や機械翻訳機などの自然言語処理システムにおいてコンビネーションとして利用される。ここでは前記圧縮装置は、ｎ−ｇｒａｍの学習装置により生成されたｎ−ｇｒａｍ言語モデルのデータ量を抑制する一方、前記アクセス装置は音声認識や統計的機械翻訳にあたって前記圧縮装置でデータ容量を抑制されたｎ−ｇｒａｍ言語モデルにアクセスする。 Hereinafter, a language model compression device and a language model access device according to an embodiment of the present invention will be described. Both of these devices are used as a combination in a natural language processing system such as a speech recognition device or a machine translator. Here, the compression device suppresses the data amount of the n-gram language model generated by the n-gram learning device, while the access device suppresses the data capacity by the compression device for speech recognition and statistical machine translation. Access the generated n-gram language model.

具体的には前記各装置は、コンピュータにより構成され、通常のコンピュータのハードウェアリソース、例えばＣＰＵ，メモリ（ＲＡＭ）などの主記憶装置，ハードディスクドライブ装置などを備えている。 Specifically, each of the devices is configured by a computer and includes hardware resources of a normal computer, for example, a main storage device such as a CPU and a memory (RAM), a hard disk drive device, and the like.

このハードウェアリソースとソフトウェアリソース（ＯＳ，アプリケーションなど）との協働の結果、図１に示すように、前記圧縮装置１は、データ構造変換部３，ポインタ表現の圧縮部４，言語モデル記憶部５，変換データ記憶部６，圧縮データ記憶部７を実装する一方、前記アクセス装置２は、ｎ−ｇｒａｍアクセス部（ｆｉｒｓｔ＿ｃｈｉｌｄ）８，ｎ−ｇｒａｍアクセス部（ｐａｒｅｎｔ）９を実装する。なお、前記両装置１．２は、必ずしも複数のコンピュータで構成する必要はなく、単一のコンピュータで構成してもよい。 As a result of the cooperation between the hardware resource and the software resource (OS, application, etc.), as shown in FIG. 1, the compression apparatus 1 includes a data structure conversion unit 3, a pointer expression compression unit 4, a language model storage unit. 5, the conversion data storage unit 6 and the compressed data storage unit 7 are mounted, while the access device 2 mounts an n-gram access unit (first_child) 8 and an n-gram access unit (parent) 9. Both devices 1.2 are not necessarily composed of a plurality of computers, and may be composed of a single computer.

概略を説明すれば、前記各記憶部５〜７は、前記ハードディスクドライブ装置あるいは前記主記憶装置に構築されている。このうち前記言語モデル記憶部５は、大量の学習用データ（例えばＷｅｂデータなど）に基づく学習結果として生成された言語モデルを格納する。ここで格納される言語モデルは、通常の「ｎ−ｇｒａｍ言語モデルのデータ構造」形式で表現されているものとする。 In brief, each of the storage units 5 to 7 is constructed in the hard disk drive device or the main storage device. Among these, the language model storage unit 5 stores a language model generated as a learning result based on a large amount of learning data (for example, Web data). It is assumed that the language model stored here is expressed in the normal “data structure of n-gram language model” format.

前記データ構造変換部３は、矢印Ａに示すように、前記言語モデル記憶部５の言語モデルを入力とし、ポインタの固定バイト表現されたｎ−ｇｒａｍ言語モデルのデータ構造に変換する（データ変換ステップ）。この変換後のデータは、矢印Ｂに示すように、前記変換データ記憶部６に蓄積される。 As shown by an arrow A, the data structure conversion unit 3 receives the language model of the language model storage unit 5 and converts it into an n-gram language model data structure represented by a fixed byte of a pointer (data conversion step). ). The converted data is stored in the converted data storage unit 6 as indicated by an arrow B.

前記ポインタ表現の圧縮部４は、矢印Ｃに示すように、前記変換データ記憶部６の蓄積データを入力とし、該蓄積データにポインタ配列をコンパクトに表現する処理を加え、データ量を圧縮する（ポインタ表現の圧縮ステップ）。すなわち、ポインタのＬＯＵＤＳ表現されたｎ−ｇｒａｍ言語モデルのデータ構造に圧縮変換し、矢印Ｄに示すように、前記圧縮データ記憶部７に記憶させる。 As indicated by an arrow C, the pointer representation compression unit 4 receives the accumulated data of the converted data storage unit 6 as an input, adds a process for compactly representing the pointer array to the accumulated data, and compresses the data amount ( Pointer expression compression step). That is, the data is compressed and converted into a data structure of an n-gram language model in which pointers are expressed in LOUDS, and stored in the compressed data storage unit 7 as indicated by an arrow D.

前記アクセス装置２は、図示省略の入力手段を通じて与えられた単語列（ｎ−ｇｒａｍ）を入力とし、矢印Ｅ．Ｆに示すように、前記圧縮データ記憶部７の記憶データ、即ち圧縮表現されたｎ−ｇｒａｍ言語モデルのポインタにアクセスする（アクセスステップ）。具体的には、前記アクセス部８は、矢印Ｇに示すように、ｎ−ｇｒａｍ（単語列）の位置を入力とし、前記圧縮データ記憶部７を参照して、矢印Ｈに示すように、（ｎ＋１）−ｇｒａｍの位置を出力する。一方、前記アクセス部９は、矢印Ｉに示すように、ｎ−ｇｒａｍ（単語列）の位置を入力とし、前記圧縮データ記憶部７を参照して、矢印Ｊに示すように、（ｎ−１）−ｇｒａｍの位置を出力する。以下、前記両装置１．２の各機能ブロック３〜９の詳細を説明する。 The access device 2 receives a word string (n-gram) given through an input unit (not shown) as an input, As shown in F, the stored data of the compressed data storage unit 7, that is, the pointer of the compressed n-gram language model is accessed (access step). Specifically, the access unit 8 receives the position of an n-gram (word string) as indicated by an arrow G, and refers to the compressed data storage unit 7 as indicated by an arrow H ( The position of (n + 1) -gram is output. On the other hand, the access unit 9 receives the position of n-gram (word string) as indicated by an arrow I and refers to the compressed data storage unit 7 as indicated by an arrow J (n−1). ) -Gram output position. Hereinafter, the details of the functional blocks 3 to 9 of both the devices 1.2 will be described.

≪言語モデル圧縮装置１≫
（１）データ構造変換部３
まず、前記言語モデル記憶部５の格納データ、即ちｎ−ｇｒａｍのデータ構造を説明する。このｎ−ｇｒａｍは、図２のデータ構造で表現され（非特許文献２参照）、「１−ｇｒａｍ，２−ｇｒａｍ，３−ｇｒａｍ， …」は、それぞれ前記言語モデル記憶部５の別テーブルで表現される。 << Language Model Compression Device 1 >>
(1) Data structure conversion unit 3
First, data stored in the language model storage unit 5, that is, an n-gram data structure will be described. This n-gram is expressed by the data structure of FIG. 2 (see Non-Patent Document 2), and “1-gram, 2-gram, 3-gram,...” Is a separate table of the language model storage unit 5. Expressed.

このｎ−ｇｒａｍのテーブルの各列は、単語ｗ_nの単語ＩＤ（ｗｏｒｄｉｄ）、式６の平滑化された確率値（ｐｒｏｂａｂｉｌｉｔｙ）、式７のバックオフ係数（ｂａｃｋ−ｏｆｆ）、（Ｘ＋１）−ｇｒａｍの最初の位置を示すポインタ（ｐｏｉｎｔｅｒ）を有している（Ｘは１≦Ｘ≦ｎの整数）。各テーブルは、このような四つ組の配列として表現される。ポインタとしては、この配列のインデックスを用いることができる。例えば４バイト整数によってポインタを実現できる。 Each column of the table in the n-gram, the word w _n word ID of (word id), smoothed probability value of formula 6 (probability), the back-off factor of the formula 7 (back-off), ( X + 1) -It has a pointer indicating the first position of gram (X is an integer of 1 ≦ X ≦ n). Each table is represented as an array of such quadruples. An index of this array can be used as a pointer. For example, a pointer can be realized by a 4-byte integer.

ここでＸ−ｇｒａｍ（Ｘは１≦Ｘ≦ｎの整数）のポインタの指す先の連続した領域には、式１２の履歴を共有する（Ｘ＋１）−ｇｒａｍが、ｗ_x+1の単語ＩＤ順にソートされて格納される。 Here, in the previous continuous area pointed to by the pointer of X-gram (X is an integer of 1 ≦ X ≦ n), (X + 1) -gram sharing the history of Expression 12 is in the order of word IDs of w _{x + 1.} Sorted and stored.

この領域の終わりの境界は、Ｘ−ｇｒａｍの次のエントリのポインタが指す先で規定される。あるＸ−ｇｒａｍを探すには、それを構成する単語毎にバイナリ探索が実施されるため、合計Ｘ回のバイナリ探索が行われる。単語ＩＤを１−ｇｒａｍテーブルの各行の位置で定義することで、図２に示すように、１−ｇｒａｍの単語ＩＤの列は省略することができる。 The boundary of the end of this area is defined by the destination pointed by the pointer of the next entry of the X-gram. In order to search for a certain X-gram, a binary search is performed for each word constituting the X-gram, so a total of X binary searches are performed. By defining the word ID at the position of each row of the 1-gram table, the column of the word ID of 1-gram can be omitted as shown in FIG.

前記データ構造変換部３は、前記言語モデル記憶部５に格納されている図２のデータ構造を変換し、図３に示すように、単語ＩＤ、平滑化された確率値、バックオフ係数、ポインタの各々を別々の配列で表し、前記変換データ記憶部６に格納する。ここでは１−ｇｒａｍの後ろに２−ｇｒａｍの情報を、２−ｇｒａｍの後に３−ｇｒａｍの情報を連結し、該連結処理を順次に実施する。なお、オーダの境界(２−ｇｒａｍ，３−ｇｒａｍ，…の開始位置)を、別途主記憶装置に記憶し、各オーダの情報にアクセスできるようにする。 The data structure conversion unit 3 converts the data structure of FIG. 2 stored in the language model storage unit 5, and as shown in FIG. 3, a word ID, a smoothed probability value, a backoff coefficient, a pointer Are represented in separate arrays and stored in the converted data storage unit 6. Here, 2-gram information is connected after 1-gram, 3-gram information is connected after 2-gram, and the connection processing is sequentially performed. The order boundaries (start positions of 2-gram, 3-gram,...) Are separately stored in the main storage device so that information on each order can be accessed.

ポインタ配列としては、図４のトライ（ｔｒｉｅ）構造に基づく表現（第１形態）と、図５のｎ−ｇｒａｍ構造に基づく表現（第２形態）のいずれかを使用する。なお、図４．５は、それぞれ図６．７の木構造に対応したポインタ構造を表す。配列の要素となるポインタは、固定バイト（例えば４バイト）整数で表現されているものとする。 As the pointer array, either an expression based on the trie structure of FIG. 4 (first form) or an expression based on the n-gram structure of FIG. 5 (second form) is used. FIG. 4.5 shows a pointer structure corresponding to the tree structure of FIG. 6.7. It is assumed that the pointer that is an element of the array is expressed by a fixed byte (for example, 4 bytes) integer.

なお、第２形態においては、ｎ−ｇｒａｍ言語モデルの最高次数の最初のエントリの位置を記憶しておき、第1形態のポインタ配列ではそれ以降に格納されていた最高次数のｎ−ｇｒａｍのポインタを第２形態では格納しないものとする。ただし、最高次数ｎ−ｇｒａｍの情報を格納しないのは、ポインタ配列だけであり、その他の単語ＩＤ（ｗｏｒｄｉｄ）、平滑化された式６の確率値（ｐｒｏｂａｂｉｌｉｔｙ）、式７のバックオフ係数（ｂａｃｋ−ｏｆｆ）については、最高次数ｎ−ｇｒａｍの情報も格納する。 In the second form, the position of the first entry of the highest order of the n-gram language model is stored, and the pointer of the highest order n-gram stored after that in the pointer array of the first form. Are not stored in the second form. However, it is only the pointer array that does not store the information of the highest order n-gram, the other word ID (word id), the smoothed probability value (probability) of Equation 6, and the back-off coefficient of Equation 7 ( For back-off), information on the highest order n-gram is also stored.

（２）ポインタ表現の圧縮部４
前記圧縮部４は、前記変換データ記憶部６を参照して、前記データ構造変換部３で変換した各配列のうち、ポインタを示す配列をコンパクトに表現して、前記変換データ記憶部６に格納する。以下、各形態の具体的内容を説明する。 (2) Compression unit 4 for pointer expression
The compression unit 4 refers to the converted data storage unit 6, and compactly represents an array indicating a pointer among the arrays converted by the data structure conversion unit 3, and stores it in the converted data storage unit 6. To do. Hereinafter, the specific content of each form is demonstrated.

＜第１形態＞
まず、前記圧縮部４における第１形態について説明する。トライ（ｔｒｉｅ）とは、木構造の一種であり、１つルートノードを持つ木構造で表される。図２に示したｎ−ｇｒａｍのデータ構造は、ルートノードが１つではないためトライではないが、ここでは仮想的なルートノードを設けることにより、ｎ−ｇｒａｍのデータ構造をトライに擬制して、コンパクトなトライ表現法であるＬＯＵＤＳ（非特許文献４．５参照）を応用し、ポインタ配列を圧縮する。 <First form>
First, the 1st form in the said compression part 4 is demonstrated. A trie is a kind of tree structure and is represented by a tree structure having one root node. The n-gram data structure shown in FIG. 2 is not a trie because there is no single root node, but here a virtual root node is provided to simulate the n-gram data structure as a trie. The pointer array is compressed by applying LOUDS (see Non-Patent Document 4.5), which is a compact trie expression method.

詳細を説明すれば、ＬＯＵＤＳによる表現では、ルートノードから始まり、「１−ｇｒａｍ，２−ｇｒａｍ，…」の階層順で左から右、即ち幅優先の順序で、ノードにノードＩＤが割り当てられる。ここではｄ個（ｄ≧０）の子供（子ノード）を有するノード（親ノード）は、「１^d０」のビット列で表現される（「１^d」は、「１」がｄ個ならんでいるビット列を表している。）。仮想的なルートノードを指すさらなる仮想的なノードとしてスーパールートノードを設ける。スーパールートノードの子供はルートノード一つであるから、スーパールートノードは「１０」で表される。 More specifically, in the LOUDS expression, the node IDs are assigned to the nodes in the hierarchical order of “1-gram, 2-gram,...” From left to right, that is, in order of width priority. Here, a node (parent node) having d (d ≧ 0) children (child nodes) is represented by a bit string of “1 ^d 0” (“1 ^d ” includes ^d “1”). Represents a bit string.) A super root node is provided as a further virtual node indicating a virtual root node. Since the child of the super root node is one root node, the super root node is represented by “10”.

図６はトライ構造の一例を示している。このトライ構造例のＬＯＵＤＳビット列によるポインタ表現を表１に示す。 FIG. 6 shows an example of the trie structure. Table 1 shows pointer representations of the LOUDS bit string of this trie structure example.

前記トライ構造例によれば、ルートノード「０」は、４つの子供を持つことから、４つの「１」と終わりを表す１つの「０」とで表されている。ここでＭ個のノードを有するトライは、（Ｍ＋１）個の「０」と，Ｍ個の「１」とで表されるため、合計２Ｍ＋１ビットで表現される．
Ｘ−ｇｒａｍの個数をＮ_xと表し、式１３を用いれば、ｎ−ｇｒａｍ言語モデルでは式１４が成立する。したがって、ｎ−ｇｒａｍ言語モデルは、式１５のビット数でポインタが表現される。ここで圧縮表現されたポインタが前記圧縮データ記憶部７に記憶される。 According to the trie structure example, since the root node “0” has four children, the root node “0” is represented by four “1” s and one “0” representing the end. Here, a trie having M nodes is represented by (M + 1) “0” s and M “1s”, and thus is represented by a total of 2M + 1 bits.
If the number of X-grams is expressed as N _x and Expression 13 is used, Expression 14 is established in the n-gram language model. Therefore, in the n-gram language model, a pointer is expressed by the number of bits of Expression 15. Here, the pointer expressed in a compressed manner is stored in the compressed data storage unit 7.

以上のように圧縮表現されたｎ−ｇｒａｍ言語モデルにアクセスするためには、ＬＯＵＤＳビット列上に、ビット列中のｉ番目の「１」の位置を返す「ｓｅｌｅｃｔ₁（ｉ）」という操作を定義する。ここでビットの位置は、表１に示すように、「０」から始まっているものとする。同様にビット列中のｉ番目の「０」の位置を返す操作として「ｓｅｌｅｃｔ₀（ｉ）」を定義できる。 In order to access the n-gram language model expressed in a compressed manner as described above, an operation called “select ₁ (i)” that returns the position of the i-th “1” in the bit string is defined on the LOUDS bit string. . Here, it is assumed that the bit positions start from “0” as shown in Table 1. Similarly, “select ₀ (i)” can be defined as an operation for returning the i-th “0” position in the bit string.

「ｓｅｌｅｃｔ_b（ｉ）（ｂ＝０ｏｒ１）」は、非特許文献６などの手法を用いて、効率的に実現することができる。この操作を用いることで、ノードｘ（ノードＩＤ＝「ｘ」のノード）に対して親ノードＩＤを返す関数「ｐａｒｅｎｔ（ｘ）」や、ノードｘに対して最初の子供のＩＤを返す関数「ｆｉｒｓｔ＿ｃｈｉｌｄ（ｘ）」が式８．式９で実現される。 “Select _b (i) (b = 0 or 1)” can be efficiently realized by using a technique such as Non-Patent Document 6. By using this operation, a function “parent (x)” that returns the parent node ID to the node x (node ID = “x” node) or a function “that returns the first child ID to the node x” first_child (x) " This is realized by Equation 9.

例えば、ノード９の親ノードは、「ｐａｒｅｎｔ（９）＝ｓｅｌｅｃｔ（９＋１）−９−１＝１２−９−１＝２」から、ノード２と求められる。また、ノード９の最初の子ノードは、「ｆｉｒｓｔ＿ｃｈｉｌｄ（９）＝ｓｅｌｅｃｔ₀（９＋１）−９＝２３−９＝１４」となり，ノード１４であると求まる。 For example, the parent node of the node 9 is obtained as the node 2 from “parent (9) = select (9 + 1) −9-1 = 12−9-1 = 2”. Further, the first child node of the node 9 is “first_child (9) = select ₀ (9 + 1) −9 = 23−9 = 14”, and is determined to be the node 14.

ただし、「ｆｉｒｓｔ＿ｃｈｉｌｄ（ｘ）」が、０以上のノードＩＤを返すからといってノードｘが、子ノードを有することが保証されるとは限らない。ノードｘが子ノードを有する必要十分条件は、「ｆｉｒｓｔ＿ｃｈｉｌｄ（ｘ）≠ｆｉｒｓｔ＿ｃｈｉｌｄ（ｘ＋１）」である。またノードｘに係る子ノードのＩＤの範囲は、［ｆｉｒｓｔ＿ｃｈｉｌｄ（ｘ），ｆｉｒｓｔ＿ｃｈｉｌｄ（ｘ＋１））で求まる。前記ｎ−ｇｒａｍアクセス装置２では、以上の関数を利用してｎ−ｇｒａｍ言語モデルへのアクセスを行う。 However, just because “first_child (x)” returns a node ID of 0 or more, it is not guaranteed that the node x has a child node. The necessary and sufficient condition that the node x has child nodes is “first_child (x) ≠ first_child (x + 1)”. Further, the range of the ID of the child node related to the node x is obtained by [first_child (x), first_child (x + 1)). The n-gram access device 2 accesses the n-gram language model using the above functions.

＜第２形態＞
つぎに前記圧縮部４における第２形態を説明する。ここでは第１形態で用いたＬＯＵＤＳを、ｎ−ｇｒａｍの特性を利用して、さらにコンパクトに表現できるように拡張処理されている。 <Second form>
Next, a second form of the compression unit 4 will be described. Here, the LOUDS used in the first form is expanded so that it can be expressed more compactly using the characteristics of n-gram.

すなわち、図２および図３に示すように、ｎ−ｇｒａｍの場合はルートノードに格納する情報は存在しない。したがって、ルートノードを削除し、仮想的なスーパールートノードが直接１−ｇｒａｍの各ノードを指すようにする。ここではノードＩＤは、１−ｇｒａｍの最初のノードが「０」となるように番号付けるものとする。これで旧ルートノードの２ビットが削減される。 That is, as shown in FIGS. 2 and 3, in the case of n-gram, there is no information stored in the root node. Therefore, the root node is deleted so that the virtual super root node directly points to each node of 1-gram. Here, the node ID is numbered so that the first node of 1-gram is “0”. This reduces the 2 bits of the old root node.

また、ｎ−ｇｒａｍの場合、１〜ｎまでの階層をもつ構造をしており、最下層にある最高次数ｎのノードは子ノードを有していない。ここでＸ−ｇｒａｍのノードの個数をＮ_xとすると、ｎ−ｇｒａｍの最高次数のノード数はＮ_nであり、Ｎ_n個の「０」が冗長である。そこで、最高次数の最初のノードＩＤを前記主記憶装置に記憶しておき、ｎ−ｇｒａｍ言語モデルにアクセスする際には該ノードＩＤ以外は子ノードを有しないと判定することとする。これにより、Ｎ_n個の「０」、即ちＮ_nビットを消去することができる。 In the case of n-gram, it has a structure with a hierarchy of 1 to n, and the node of the highest order n in the lowest layer has no child nodes. Here, when the number of X-gram nodes is N _x , the number of nodes of the highest order of n-gram is N _n , and N _n “0” s are redundant. Therefore, the first node ID of the highest order is stored in the main storage device, and when accessing the n-gram language model, it is determined that there is no child node other than the node ID. Thereby, N _n “0” s, that is, N _n bits can be erased.

さらに、スーパールートノードは、式１６で表されるが、１−ｇｒａｍの個数Ｎ₁を記憶しておくことで、これは取り去ることができる。このように前記各ノードが存在しないものとして木構造を生成し、ビットを割り当てる。これでＮ₁＋１ビットが削減できる。これにより、第２形態における圧縮表現されたポインタは、スーパーノードに対応したビットと最高次数のｎ−ｇｒａｍに対応したビットをもたない表現（拡張したＬＯＵＤＳ表現）となる。この圧縮変換されたポインタを、前記圧縮データ記憶部７に記憶する。 Furthermore, the super root node, is represented by the formula 16, by storing the number N ₁ of 1-gram, which can be removed. In this way, a tree structure is generated and bits are assigned assuming that the nodes do not exist. This can reduce N ₁ +1 bits. Thus, the compressed pointer in the second form is an expression (extended LOUDS expression) having no bit corresponding to the super node and no bit corresponding to the n-gram of the highest order. The compression-converted pointer is stored in the compressed data storage unit 7.

第２形態において圧縮表現されたｎ−ｇｒａｍ言語モデルにアクセスするためには、第１形態の式８．式９を、式１０．式１１に書き換える。 In order to access the n-gram language model expressed in a compressed form in the second form, the expression 8. Equation 9 is replaced by Equation 10. Rewrite into Equation 11.

ただし、最高次数の最初のノードＩＤをＹとすると、「ｘ≧Ｙ」のときは、「ｓｅｌｅｃｔ₁（ｘ）＝ｓｅｌｅｃｔ₁（Ｙ−１）、ｓｅｌｅｃｔ₀（ｘ）＝ｓｅｌｅｃｔ₀（Ｙ−１）＋ｘ−Ｙ＋１」とする。 However, assuming that the first node ID of the highest order is Y, when “x ≧ Y”, “select ₁ (x) = select ₁ (Y−1), select ₀ (x) = select ₀ (Y−1) ) + X−Y + 1 ”.

図７に、トライ構造を刈り込みｎ−ｇｒａｍに最適化した構造を示す。この構造に最適化したＬＯＵＤＳビット列を表２に示す。 FIG. 7 shows a structure in which the trie structure is trimmed and optimized for n-gram. Table 2 shows the LOUDS bit string optimized for this structure.

図７および表２のノード８は、図６中のノード９に対応している。ここではＮ₁＝４であるため、「ｐａｒｅｎｔ（８）＝ｓｅｌｅｃｔ１（８＋１−４）＋４−８＝５＋４−８＝１」から、ノード８の親ノードはノード１と求められる。また、「ｆｉｒｓｔ＿ｃｈｉｌｄ（８）＝ｓｅｌｅｃｔ₀（８）＋４＋１−８＝１６＋４＋１−８＝１３」から、ノード８の最初の子ノードはノード１３と求められる。 The node 8 in FIG. 7 and Table 2 corresponds to the node 9 in FIG. Since N ₁ = 4 here, the parent node of the node 8 is obtained as the node 1 from “parent (8) = select1 (8 + 1−4) + 4−8 = 5 + 4−8 = 1”. Further, from “first_child (8) = select ₀ (8) + 4 + 1−8 = 16 + 4 + 1−8 = 13”, the first child node of the node 8 is obtained as the node 13.

ここでｎ−ｇｒａｍ言語モデルを第１形態に基づき圧縮した場合には式１５のビットでポインタ表現されていた。ここから２＋Ｎ_n＋Ｎ₁＋１ビットが削減されるため、第２形態の圧縮によれば、ｎ−ｇｒａｍ言語モデルは式１７のビット数でポインタが表現され、さらにデータ量を抑制することができる。 Here, when the n-gram language model is compressed based on the first form, the pointer expression is expressed by the bit of Expression 15. Since 2 + N _n + N ₁ +1 bits are reduced from here, according to the compression of the second form, the n-gram language model expresses a pointer with the number of bits of Expression 17, and can further suppress the amount of data.

≪ｎ−ｇｒａｍアクセス装置２≫
前記アクセス装置２は、前記入力手段を通じて与えられた単語列（式２）を入力として、前記圧縮データ記憶部７を参照する。ここでは前記圧縮装置１により圧縮表現されたｎ−ｇｒａｍモデルのポインタを辿ることにより、平滑化された確率（式６）およびバックオフ係数（式７）などへアクセスし、これらの値を出力する。 << n-gram access device 2 >>
The access device 2 refers to the compressed data storage unit 7 using the word string (formula 2) given through the input means as an input. Here, the smoothed probability (Equation 6) and the backoff coefficient (Equation 7) are accessed by tracing the pointer of the n-gram model compressed and expressed by the compression device 1, and these values are output. .

具体的には、前記アクセス部（ｆｉｒｓｔ＿ｃｈｉｌｄ）８は、「ｗ₁，．．．，ｗ_n」のｎ−ｇｒａｍの位置を入力として、「ｗ₁，．．．，ｗ_n，ｗ_n+1」という「（ｎ＋１）−ｇｒａｍ」のうち，「ｗ_n+1」の単語ＩＤが一番小さいものの位置を前記入力手段に返信する。このとき前記圧縮データ記憶部７の記憶データが、トライ構造に基づくポインタ表現（第１形態の圧縮結果）であれば、式９によって算出する一方、ｎ−ｇｒａｍ構造に基づくポインタ表現（第２形態の圧縮結果）であれば、式１１によって算出する。 Specifically, the access unit (first_child) 8 is "w _1, ..., w _n" as input the position of the n-gram of "w _1, ..., w _n, w _{n + 1} "(N + 1) -gram", the position of the word ID with the smallest word ID of "w _{n + 1} " is returned to the input means. At this time, if the stored data in the compressed data storage unit 7 is a pointer expression based on the trie structure (compression result of the first form), it is calculated by Expression 9, while a pointer expression based on the n-gram structure (second form) If the result of compression is calculated by the following equation (11).

また、前記アクセス部（ｐａｒｅｎｔ）９は、「ｗ₁，．．．，ｗ_n-1，ｗ_n」のｎ−ｇｒａｍの位置を入力として、「ｗ₁，．．．，ｗ_n-1」の（ｎ−１）−ｇｒａｍの位置を前記入力手段に返信する。このとき前記圧縮データ記憶部７の記憶データが、トライ構造に基づくポインタ表現（第１形態の圧縮結果）であれば、式８によって算出する一方、ｎ−ｇｒａｍ構造に基づくポインタ表現（第２形態の圧縮結果）であれば、式１０によって算出する。なお、式８〜式１１はプログラムなどに定義されているものとする。 Moreover, the access unit (parent) 9 is _{_{"w 1, ..., w n-}} 1, w n " as input the position of the n-gram of _{_{"w 1, ..., w n-}} 1 " The position of (n-1) -gram is returned to the input means. At this time, if the stored data in the compressed data storage unit 7 is a pointer expression based on the trie structure (compression result of the first form), it is calculated by Expression 8, while a pointer expression based on the n-gram structure (second form) (Compression result of (1)) is calculated according to Equation 10. It should be noted that Expressions 8 to 11 are defined in a program or the like.

このように前記両装置１．２によれば、ｎ−ｇｒａｍ言語モデルのポインタ配列のデータ量が抑制されるため、高価な計算機（コンピュータ）を使用することなく、汎用的な計算機の主記憶装置にポインタ配列の大部分を記憶でき、前記圧縮データ記憶部７をいわゆるオンメモリデータベースとして利用可能にする。 As described above, according to both the devices 1.2, since the data amount of the pointer array of the n-gram language model is suppressed, the main memory device of a general-purpose computer can be used without using an expensive computer (computer). Most of the pointer array can be stored in the memory, and the compressed data storage unit 7 can be used as a so-called on-memory database.

したがって、前記アクセス部８．９のポインタへのアクセス速度が向上し、ｎ−ｇｒａｍを用いた機械翻訳や音声認識などに際して、効率的にｎ−ｇｒａｍモデルにアクセス可能となる。加えて、頻繁に利用するｎ−ｇｒａｍ言語モデルの確率（式６）およびバックオフ係数（式７）などを記憶バッファなどにキャッシュしておけば、さらに処理を高速化することができる。 Accordingly, the access speed to the pointer of the access unit 8.9 is improved, and the n-gram model can be efficiently accessed during machine translation or speech recognition using the n-gram. In addition, if the probability of the frequently used n-gram language model (Equation 6) and the backoff coefficient (Equation 7) are cached in a storage buffer or the like, the processing can be further speeded up.

≪実験例≫
前記圧縮装置１の有効性を確認するために発明者達の実施したｎ−ｇｒａｍの圧縮実験を説明する。この実験には、表３に示すように、「ＥｎｇｌｉｓｈＧｉｇａｗｏｒｄ３ｒｄＥｄｉｔｉｏｎ」を用いて学習した「ＥｎｇｌｉｓｈＧｉｇａｗｏｒｄ５−ｇｒａｍ」と、ＬＤＣから公開されている「ＥｎｇｌｉｓｈＷｅｂ１Ｔ５−ｇｒａｍ」と、ＧＳＫから公開されている「ＪａｐａｎｅｓｅＷｅｂ１Ｔ７−ｇｒａｍ」とを用いた。 ≪Experimental example≫
An n-gram compression experiment conducted by the inventors to confirm the effectiveness of the compression apparatus 1 will be described. In this experiment, as shown in Table 3, “English Gigaword 5-gram” learned using “English Gigaword 3rd Edition”, “English Web 1T 5-gram” published by LDC, and GSK “Japan Web 1T 7-gram” published in Japan was used.

表３中の”ｃｏｕｎｔｓｉｚｅ（ｇｉｚｐ）”は、前記実験に用いた各ｎ−ｇｒａｍの大きさ（サイズ）を示している。ここではｎ−ｇｒａｍの頻度を格納したＡＳＣＩＩテキストファイルを、ｇｚｉｐで圧縮した結果のサイズを表している。 “Count size (gizp)” in Table 3 indicates the size (size) of each n-gram used in the experiment. Here, the size of an ASCII text file storing the frequency of n-gram is compressed with gzip.

表４は、前記実験におけるポインタ配列の圧縮結果を示している。表４中の”４−ｂｙｔｅＰｏｉｎｔｅｒ”は、「ｎ−ｇｒａｍ言語モデルのデータ構造（ポインタ固定バイト表現）」のポインタを４バイト整数で表現したときのポインタ配列を表している（前記データ変換部３で求めたポインタ配列に相当する）。 Table 4 shows the compression result of the pointer array in the experiment. “4-byte Pointer” in Table 4 represents a pointer array when the pointer of the “n-gram language model data structure (pointer fixed byte representation)” is represented by a 4-byte integer (the data conversion unit). This corresponds to the pointer array obtained in step 3).

また、”提案法”は、このポインタ配列をｎ−ｇｒａｍに最適化したＬＯＵＤＳビット列で表現した結果（第２形態の圧縮結果）を示している。この”提案法”によれば、表４の各列に示すように、ポインタを４バイト整数で表現する一般的な表現方法が約１／１０に圧縮されている。これによりポインタ配列のデータ量が効果的に低減されることが明らかとなった。 “Proposed method” indicates a result (a compression result of the second form) in which this pointer array is expressed by a LOUDS bit string optimized to n-gram. According to this "proposed method", as shown in each column of Table 4, a general expression method for expressing a pointer with a 4-byte integer is compressed to about 1/10. As a result, it has been clarified that the data amount of the pointer array is effectively reduced.

≪プログラムなど≫
本発明は、前記両装置１．２を構成する各部３〜９の一部もしくは全部として、コンピュータを機能させるプログラムとして構成することもできる。この場合には、前記データ変換ステップ、前記ポインタ表現の圧縮ステップ、アクセスステップの全ステップあるいは一部のステップをコンピュータに実行させる。 ≪Programs≫
The present invention can also be configured as a program that causes a computer to function as a part or all of the units 3 to 9 that constitute both the devices 1.2. In this case, the computer is caused to execute all or a part of the data conversion step, the pointer expression compression step, and the access step.

このプログラムは、Ｗｅｂサイトや電子メールなどネットワークを通じて提供することができる。また、前記プログラムは、ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，ＣＤ−Ｒ，ＣＤ−ＲＷ，ＤＶＤ−Ｒ，ＤＶＤ−ＲＷ，ＭＯ，ＨＤＤ，Ｂｌｕ−ｒａｙＤｉｓｋ（登録商標）などの記録媒体に記録して、保存・配布することも可能である。この記録媒体は、記録媒体駆動装置を利用して読み出され、そのプログラムコード自体が前記実施形態の処理を実現するので、該記録媒体も本発明を構成する。 This program can be provided through a network such as a website or e-mail. The program is recorded on a recording medium such as a CD-ROM, DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, MO, HDD, Blu-ray Disk (registered trademark). It is also possible to save and distribute. This recording medium is read using a recording medium driving device, and the program code itself realizes the processing of the above embodiment, so that the recording medium also constitutes the present invention.

１…言語モデル圧縮装置
２…言語モデルのアクセス装置
３…データ構造変換部（データ構造変換手段）
４…ポインタ表現の圧縮部（ポインタ表現の圧縮手段）
５…言語モデル記憶部
６…変換データ記憶部
７…圧縮データ記憶部
８…ｎ−ｇｒａｍアクセス部「ｆｉｒｓｔ＿ｃｈｉｌｄ」（第２のアクセス手段）
９…ｎ−ｇｒａｍアクセス部「ｐａｒｅｎｔ」（第１のアクセス手段） DESCRIPTION OF SYMBOLS 1 ... Language model compression apparatus 2 ... Language model access apparatus 3 ... Data structure conversion part (data structure conversion means)
4. Pointer expression compression unit (pointer expression compression means)
DESCRIPTION OF SYMBOLS 5 ... Language model memory | storage part 6 ... Conversion data memory | storage part 7 ... Compressed data memory | storage part 8 ... n-gram access part "first_child" (2nd access means)
9: n-gram access part “parent” (first access means)

Claims

An apparatus for compressing a model representation of an n-gram language model obtained from an appearance frequency of a word string composed of n consecutive words,
A data structure converting means for providing a virtual root node and converting the structure of the n-gram language model into a trie structure;
Compression means for compressing and converting the trie structure converted by the data structure conversion means into LOUDS (Level-Order Undegree Sequence Sequence) expression;
A language model compression apparatus comprising:

An apparatus for compressing a model representation of an n-gram language model obtained from an appearance frequency of a word string composed of n consecutive words,
The position (node ID) of the first node of the highest order of the n-gram language model is stored in the storage device, and the pointer expression representing the structure of the n-gram language model is changed to an expression obtained by deleting the highest order n-gram. Data structure conversion means to convert,
A language model compression apparatus comprising:

The language model compression apparatus according to claim 2,
The number of 1-grams N ₁ is stored in a storage device, and the structure of the n-gram language model converted by the data structure conversion means is a super node (virtual root node indicating the root node of the tri structure) and the highest Compression means for compressing and converting into a LOUDS expression expanded so as not to have bits corresponding to the n-gram of the order;
A language model compression apparatus, further comprising:

An apparatus for accessing an n-gram language model with reference to storage means for storing the pointer compressed and converted by the language model compression apparatus according to claim 1,
First access means for calculating and outputting the position of the parent node “(n−1) -gram” (parent) with respect to the position x of the n-gram of the input word string by Formula 8;

The position of the child node (first_child) having the smallest word ID among the child nodes “(n + 1) -gram” is calculated by Expression 9 with respect to the n-gram position x of the input word string and output. Two access means;

An access device for a language model, comprising:

An apparatus for accessing an n-gram language model with reference to storage means for storing the pointer compressed and converted by the language model compression apparatus according to claim 3,
A first access means for calculating and outputting the position of the parent node “(n−1) -gram” (parent) with respect to the position x of the n-gram of the input word string according to Expression 10;

The position of the child node (first_child) having the smallest word ID among the child nodes “(n + 1) -gram” is calculated by Expression 11 and output with respect to the n-gram position x of the input word string. Two access means;

An access device for a language model, comprising:

A method for compressing a model representation of an n-gram language model obtained from the frequency of occurrence of a word string consisting of n consecutive words,
A data structure converting means for providing a virtual root node and converting the structure of the n-gram language model into a trie structure;
A compression step in which a compression unit compresses and converts the trie structure converted in the data structure conversion step into a LOUDS (Level-Order Unary Degree Sequence) expression;
A language model compression method comprising:

A method for compressing a model representation of an n-gram language model obtained from the frequency of occurrence of a word string consisting of n consecutive words,
The data structure conversion means stores the position (node ID) of the first node of the highest order of the n-gram language model in the storage device, and stores a pointer expression representing the structure of the n-gram language model as the highest order n- A data structure conversion step that converts the gram into a deleted expression,
A language model compression method comprising:

The language model compression method according to claim 7, wherein
Compression means, stores the number N ₁ of 1-gram in the storage device, virtual route structure of the converted n-gram language model, which points to the root node of the super node (TRIE structure in the data structure converting means Node) and a compression step that compresses and converts to a LOUDS representation expanded to have no bits corresponding to the highest order n-grams,
A language model compression method further comprising:

A method for accessing an n-gram language model by referring to storage means for storing the pointer compressed and converted by the language model compression method according to claim 6,
A step in which the first access means calculates and outputs the position of the parent node “(n−1) -gram” (parent) according to Expression 8 with respect to the n-gram position x of the input word string. When,

The second access means calculates the position of the child node (first_child) having the smallest word ID among the child nodes “(n + 1) -gram” with respect to the position x of the n-gram of the input word string as shown in Expression 9 Calculating and outputting by

A method for accessing a language model, comprising:

A method for accessing an n-gram language model with reference to storage means for storing the pointer compressed and converted by the language model compression method according to claim 8,
A step in which the first access means calculates and outputs the position of the parent node “(n−1) -gram” (parent) with respect to the position x of the n-gram of the input word string according to Equation 10; When,

The second access means calculates the position of the child node (first_child) having the smallest word ID among the child nodes “(n + 1) -gram” with respect to the position x of the n-gram of the input word string as shown in Expression 11 Calculating and outputting by

A language model access method characterized by comprising:

The language model compression program for functioning a computer as each means which comprises the language model compression apparatus of any one of Claims 1-3.

A language model access program for causing a computer to function as each means constituting the language model access apparatus according to claim 4.