JP2016133956A

JP2016133956A - Morpheme analysis model generation device, morpheme analysis model generation method, and program

Info

Publication number: JP2016133956A
Application number: JP2015007608A
Authority: JP
Inventors: 慶内海; Kei Uchiumi; 塚原　裕史; Yasushi Tsukahara; 裕史塚原
Original assignee: Denso IT Laboratory Inc
Current assignee: Denso IT Laboratory Inc
Priority date: 2015-01-19
Filing date: 2015-01-19
Publication date: 2016-07-25

Abstract

PROBLEM TO BE SOLVED: To provide a morpheme analysis model generation device capable of estimating spaced-wording and a part of speech by teacherless learning.SOLUTION: A morpheme analysis model generation device 1 includes a learning data storage part 18 for storing a plurality of characters as data for learning, and a learning part 11 for repeatedly carrying out processing for updating parameters of a morpheme analysis model while executing sampling for spaced-wording of each sentence and mapping of a part of speech to each word candidate constituting the spaced-wording. The morpheme analysis model is, in a hidden semi Markov model in which a character string is an observed value and a word boundary in the character string and a part of speech are hidden classes, a non-parametric Bayesian model in which a stochastic process is applied to the prior distribution of the word n-gram probability of each part of speed and the prior distribution of a part-of-speech n-gram probability as a transition probability.SELECTED DRAWING: Figure 2

Description

本発明は、自然言語処理において用いられる形態素解析モデルを生成する技術に関し、特に、教師なし学習によって、解析対象文の分かち書きと品詞の推定を同時に行うための形態素解析モデルを生成する手法に関するものである。 The present invention relates to a technique for generating a morpheme analysis model used in natural language processing, and more particularly to a technique for generating a morpheme analysis model for simultaneously performing analysis segmentation and part-of-speech estimation by unsupervised learning. is there.

形態素解析は、計算機を用いた自然言語処理の基礎技術であり、自然言語処理技術を用いた様々な応用タスクの前処理として利用される。形態素解析手法には、規則によるものと確率モデルに基づくものがある。確率モデルに基づく手法は、さらに、パラメータの学習にアノテーション済みのコーパスを用いる教師あり学習手法と、テキストのみから分かち書きを行う教師なし学習に基づくものとに分けられる。 Morphological analysis is a basic technology for natural language processing using a computer, and is used as preprocessing for various application tasks using natural language processing technology. There are two morphological analysis methods, one based on rules and the other based on probability models. The method based on the probabilistic model is further divided into a supervised learning method using an annotated corpus for parameter learning and a method based on unsupervised learning in which text is written only from text.

教師あり学習に基づく手法には、ＨＭＭ（Hidden Markov Model、隠れマルコフモデル）を用いた形態素解析を行うＣｈａＳｅｎや、ＣＲＦ（Conditional Random Fields）を用いた形態素解析を行うＭｅＣａｂがある。教師なし学習に基づく手法には、最小記述原理に基づいた日本語話し言葉の単語分割手法（松原ら）や、ノンパラメトリックベイズモデルに基づく手法（特許文献１）が提案されている。また、ノンパラメトリックベイズモデルとＣＲＦによる学習を組み合わせた半教師あり学習手法も提案されている（特許文献２）。 Methods based on supervised learning include ChaSen that performs morphological analysis using HMM (Hidden Markov Model) and MeCab that performs morphological analysis using CRF (Conditional Random Fields). As a technique based on unsupervised learning, a Japanese spoken word segmentation technique (Matsubara et al.) Based on the minimum description principle and a technique based on a nonparametric Bayes model (Patent Document 1) have been proposed. A semi-supervised learning method combining non-parametric Bayes model and learning by CRF has also been proposed (Patent Document 2).

特開２０１０−１７０２５２号公報JP 2010-170252 A 特開２０１２−１４６２６３号公報JP 2012-146263 A

近年増加し続けているウェブテキストでは、これまで自然言語処理が対象としてきた書き言葉以外にも、話し言葉等の表現が頻繁に出現する。教師あり学習の手法では、分かち書きの学習に教師データとしてアノテーション済みのコーパスが必要とある。しかし、人手によるアノテーションはコストが大きく、急速に増加し、変化し続ける話し言葉の表現に対応させることは困難である。これに対して、教師なし学習では、ある評価尺度が最良となるように、パラメータを更新することで、分かち書きを行えるようになる。しかし、従前の教師なし学習に基づく形態素解析は、分かち書きをいかに精度よく行うかにのみ注力しており、品詞推定まで含めて行える手法は提案されていない。そのため、品詞の推定を行う場合には、上述のような従来技術を用いて分かち書きを行った後、さらに別の手法を用いて品詞推定を行わなければならなかった。 In web texts that have been increasing in recent years, expressions such as spoken words frequently appear in addition to written words that have been targeted by natural language processing. In the supervised learning method, an annotated corpus is required as supervised data for the learning of split writing. However, manual annotation is costly and increases rapidly, making it difficult to accommodate the changing expressions of spoken language. On the other hand, in unsupervised learning, it is possible to perform the writing by updating the parameters so that a certain evaluation scale is the best. However, the conventional morphological analysis based on unsupervised learning focuses only on how to perform segmentation with high accuracy, and a method that can include part-of-speech estimation has not been proposed. Therefore, when estimating the part of speech, the part of speech must be estimated by using another method after performing the writing using the conventional technique as described above.

例えば、「この先生きのこれるか」という文に対して、従来技術によって分かち書きを行う場合、品詞は考慮されず、その一方で、学習によって、「先生」や「きのこ」といった単語の存在が認識されるため、「この／先生／きのこ／れる／の／か」という解析結果が得られうる。しかし、もし、「きのこ」という名詞の直後に、「れる」という動詞・接尾は接続する可能性は低いという、品詞間の接続関係をも学習によって取得できるのであれば、「この／先／生き／のこれる／の／か」という、人間の感覚に近い正解を得ることができる。すなわち、分かち書きと品詞推定の学習を同時に行うことができれば、これらを別個に行うよりも、分かち書きの精度も向上させることができると考えられる。 For example, when using the conventional technology to share the sentence “This is the future”, the part of speech is not taken into account, but the words such as “teacher” and “mushroom” are recognized by learning. Therefore, an analysis result of “this / teacher / mushroom / re ///” can be obtained. However, if it is possible to acquire the connection relation between parts of speech that the verb / suffix “re” is less likely to be connected immediately after the noun “mushroom” by learning, “this / previous / living” It is possible to obtain a correct answer that is close to the human sense of “// this //”. In other words, it is considered that if the writing of parts and part-of-speech estimation can be performed at the same time, the precision of the writing of parts can be improved as compared with the case where these are performed separately.

本発明は、上記の問題に鑑みてなされたものであり、教師なし学習によって、分かち書きのみならず、品詞推定まで行うための形態素解析モデルを生成する形態素解析モデル生成装置等を提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a morphological analysis model generation device that generates a morphological analysis model for performing not only splitting but also part-of-speech estimation by unsupervised learning. And

本発明の形態素解析モデル生成装置は、教師なし学習によって、解析対象文の分かち書きと品詞の推定を行うための形態素解析モデルを生成する形態素解析モデル生成装置であって、前記形態素解析モデルは、文字列を観測値とし、前記文字列における単語境界及び品詞を隠れクラスとする隠れセミマルコフモデルにおいて、生起確率である品詞ごとの単語ｎ−ｇｒａｍ確率の事前分布、及び、遷移確率である品詞ｎ−ｇｒａｍ確率の事前分布に確率過程を適用したノンパラメトリックベイズモデルであり、学習用のデータとして複数の文を記憶した学習データ記憶部と、前記学習データ記憶部に記憶された各文の分かち書きと当該分かち書きを構成する各単語候補に対する品詞の対応付けのサンプリングを行いながら、前記形態素解析モデルにおける前記品詞ごとの単語ｎ−ｇｒａｍ確率及び前記品詞ｎ−ｇｒａｍ確率のパラメータの更新を行う処理を、所定の収束条件を満たすまで繰り返し行う学習部とを備えた構成を有している。 The morpheme analysis model generation device of the present invention is a morpheme analysis model generation device that generates a morpheme analysis model for performing analysis of sentence segmentation and part-of-speech estimation by unsupervised learning, wherein the morpheme analysis model is a character In a hidden semi-Markov model in which a string is an observation value and a word boundary and part of speech in the character string are hidden classes, a prior distribution of word n-gram probabilities for each part of speech that is an occurrence probability and a part of speech n− that is a transition probability A non-parametric Bayes model in which a probabilistic process is applied to a prior distribution of gram probabilities, a learning data storage unit storing a plurality of sentences as learning data, and a division of each sentence stored in the learning data storage unit While sampling part-of-speech correspondence for each word candidate constituting the segmentation, the morphological analysis model The processing for updating the parameters of word n-gram probabilities, and the part-of-speech n-gram probability for each part of speech has a configuration that includes a learning unit repeatedly performed until a predetermined convergence condition is satisfied in.

この構成によれば、分かち書きと、分かち書きを構成する単語候補に対する品詞の対応付けのサンプリングを行いながら、品詞ごとの単語ｎ−ｇｒａｍ確率及び品詞ｎ−ｇｒａｍ確率のパラメータを更新するという学習によって、分かち書きのみならず品詞推定まで行うための形態素解析モデルを生成することができる。 According to this configuration, by performing segmentation and learning to update the parameters of the word n-gram probability and the part-of-speech n-gram probability for each part of speech while sampling the association of the part of speech with the word candidates constituting the segmentation. It is possible to generate a morphological analysis model for performing not only the part-of-speech estimation.

本発明の形態素解析モデル生成装置において、前記学習部は、前記各文について、品詞をパラメータとして含むラティスを生成し、

t：カレントの単語候補の終了位置
k：カレントの単語候補の長さ
z：カレントの単語候補の品詞
r：接続される１つ前の単語候補の品詞
j：接続される１つ前の単語候補の長さ
で表される前向き確率に従って、１つ前の単語候補及び当該単語候補に対応する品詞を選択しながら、前記サンプリングを行ってよい。 In the morphological analysis model generation device of the present invention, the learning unit generates a lattice including a part of speech as a parameter for each sentence,

t: End position of current word candidate
k: Current word candidate length
z: Part of speech of the current word candidate
r: Part of speech of the previous word candidate to be connected
j: The sampling may be performed while selecting the previous word candidate and the part of speech corresponding to the word candidate according to the forward probability represented by the length of the previous word candidate to be connected.

ラティスとは、文頭から文末までの、隣接する単語候補を網羅的に接続した経路のことである。また、前向き確率とは、ある単語候補に到達するまでの経路の確率の総和である。上記構成によれば、品詞をも考慮した前向き確率に従って、１つ前の単語候補及び当該単語候補の品詞が選択され、これが繰り返されて、ある文の形態素解析のサンプリング結果とされ、このサンプリング結果に基づいて、パラメータが更新される。このように、前向き確率に従って、順次サンプリングを行う手法は、blocked Gibbs sampling（ブロックドギブスサンプリング）という、MCMC（マルコフ連鎖モンテカルロ法）の１種であり、マルコフ連鎖が求めたい分布に収束することが知られている。したがって、かかる構成により、精度の良い形態素解析モデルを生成することができる。 Lattice is a path that comprehensively connects adjacent word candidates from the beginning to the end of the sentence. Further, the forward probability is a total sum of probabilities of routes until reaching a certain word candidate. According to the above configuration, the previous word candidate and the part of speech of the word candidate are selected according to the forward probability that also considers the part of speech, and this is repeated to obtain the sampling result of the morphological analysis of a certain sentence. The parameters are updated based on In this way, the method of sequentially sampling according to the forward probability is one type of MCMC (Markov chain Monte Carlo method) called blocked Gibbs sampling, and the Markov chain can converge to the distribution that it wants to find. Are known. Therefore, with this configuration, it is possible to generate a highly accurate morphological analysis model.

本発明の形態素解析モデル生成装置において、前記品詞ごとの単語ｎ―ｇｒａｍ確率は、ＮＰＹＬＭ（Nested Pitman-Yor Language Model）に基づいて算出され、前記品詞ｎ―ｇｒａｍ確率は、ＨＰＹＬＭ（Hierarchical Pitman-Yor Language Model）に基づいて算出されてよい。 In the morphological analysis model generation device of the present invention, the word n-gram probability for each part of speech is calculated based on NPYLM (Nested Pitman-Yor Language Model), and the part of speech n-gram probability is HPYLM (Hierarchical Pitman-Yor (Language Model).

この構成のように、前記品詞ごとの単語ｎ−ｇｒａｍ確率は、ＮＰＹＬＭ（Nested Pitman-Yor Language Model）に基づいて算出され、前記品詞ｎ−ｇｒａｍ確率は、ＨＰＹＬＭ（Hierarchical Pitman-Yor Language Model）に基づいて算出されることにより、学習データに出現しない、様々な単語ｎ−ｇｒａｍ、品詞ｎ−ｇｒａｍに対して、適切な推定値を付与することができるので、精度のよい形態素解析モデルを生成することができる。 As in this configuration, the word n-gram probability for each part of speech is calculated based on NPYLM (Nested Pitman-Yor Language Model), and the part of speech n-gram probability is calculated based on HPYLM (Hierarchical Pitman-Yor Language Model). Since it is possible to assign appropriate estimation values to various words n-grams and parts-of-speech n-grams that do not appear in the learning data, the morphological analysis model with high accuracy is generated. be able to.

本発明の形態素解析モデル生成方法は、教師なし学習によって、解析対象文の分かち書きと品詞の推定を行うための形態素解析モデルを生成する形態素解析モデル生成方法であって、前記形態素解析モデルは、文字列を観測値とし、前記文字列における単語境界及び品詞を隠れクラスとする隠れセミマルコフモデルにおいて、生起確率である品詞ごとの単語ｎ−ｇｒａｍ確率の事前分布、及び、遷移確率である品詞ｎ−ｇｒａｍ確率の事前分布に確率過程を適用したノンパラメトリックベイズモデルであり、学習用のデータとして複数の文の入力を受け付けて、前記複数の文を学習データ記憶部に記憶するステップと、前記学習データ記憶部に記憶された各文の分かち書きと当該分かち書きを構成する各単語候補に対する品詞の対応付けのサンプリングを行いながら、前記形態素解析モデルにおける前記品詞ごとの単語ｎ−ｇｒａｍ確率及び前記品詞ｎ−ｇｒａｍ確率のパラメータの更新を行う処理を、所定の収束条件を満たすまで繰り返し行うステップとを備えた構成を有している。 The morpheme analysis model generation method of the present invention is a morpheme analysis model generation method for generating a morpheme analysis model for performing segmentation of a sentence to be analyzed and estimating a part of speech by unsupervised learning, wherein the morpheme analysis model is a character In a hidden semi-Markov model in which a string is an observation value and a word boundary and part of speech in the character string are hidden classes, a prior distribution of word n-gram probabilities for each part of speech that is an occurrence probability and a part of speech n− that is a transition probability a non-parametric Bayes model in which a probabilistic process is applied to a prior distribution of gram probabilities, receiving a plurality of sentences as learning data, and storing the plurality of sentences in a learning data storage unit; and the learning data The sentence segmentation stored in the storage unit and the part-of-speech association support for each word candidate constituting the segmentation. A step of repeatedly performing a process of updating a word n-gram probability for each part of speech and a parameter of the part of speech n-gram probability in the morphological analysis model until a predetermined convergence condition is satisfied while performing pulling. have.

本発明のプログラムは、教師なし学習によって、解析対象文の分かち書きと品詞の推定を行うための形態素解析モデルを生成するためのプログラムであって、前記形態素解析モデルは、文字列を観測値とし、前記文字列における単語境界及び品詞を隠れクラスとする隠れセミマルコフモデルにおいて、生起確率である品詞ごとの単語ｎ−ｇｒａｍ確率の事前分布、及び、遷移確率である品詞ｎ−ｇｒａｍ確率の事前分布に確率過程を適用したノンパラメトリックベイズモデルであり、コンピュータに、学習用のデータとして複数の文の入力を受け付けて、前記複数の文を学習データ記憶部に記憶するステップと、前記学習データ記憶部に記憶された各文の分かち書きと当該分かち書きを構成する各単語候補に対する品詞の対応付けのサンプリングを行いながら、前記形態素解析モデルにおける前記品詞ごとの単語ｎ−ｇｒａｍ確率及び前記品詞ｎ−ｇｒａｍ確率のパラメータの更新を行う処理を、所定の収束条件を満たすまで繰り返し行うステップと、を実行させる。 The program of the present invention is a program for generating a morpheme analysis model for performing analysis of sentence segmentation and part-of-speech estimation by unsupervised learning, wherein the morpheme analysis model uses a character string as an observation value, In the hidden semi-Markov model in which the word boundary and the part of speech in the character string are hidden classes, the prior distribution of the word n-gram probability for each part of speech that is the occurrence probability and the prior distribution of the part of speech n-gram probability that is the transition probability A non-parametric Bayes model to which a stochastic process is applied, the computer receiving input of a plurality of sentences as learning data, and storing the plurality of sentences in a learning data storage unit; and the learning data storage unit A sample of each stored sentence segmentation and a part-of-speech correspondence for each word candidate constituting the segmentation Performing the process of updating the word n-gram probability and the part-of-speech n-gram probability parameter for each part of speech in the morphological analysis model until a predetermined convergence condition is satisfied. .

本発明によれば、教師なし学習によって、分かち書きのみならず品詞推定まで行うための形態素解析モデルを生成することができる。 According to the present invention, it is possible to generate a morpheme analysis model for performing not only separation but also part-of-speech estimation by unsupervised learning.

本発明の実施の形態における形態素解析モデル生成装置の構成を示すブロック図The block diagram which shows the structure of the morphological analysis model production | generation apparatus in embodiment of this invention 本発明の実施の形態における形態素解析モデルのグラフィカルモデルを示す図The figure which shows the graphical model of the morphological analysis model in embodiment of this invention 隠れマルコフモデルに基づく一般的な品詞推定のグラフィカルモデルを示す図Diagram showing a graphical model of general part-of-speech estimation based on hidden Markov models 本発明の実施の形態における形態素解析モデル生成装置の動作フロー図Operation flow diagram of morphological analysis model generation device in embodiment of the present invention 本実施の形態における、サンプリング処理のフロー図Flow chart of sampling processing in this embodiment 本実施の形態における、ラティスの一例を示す図The figure which shows an example of the lattice in this Embodiment

以下、本発明の実施の形態の形態素解析生成装置について、図面を参照しながら説明する。 Hereinafter, a morphological analysis generation apparatus according to an embodiment of the present invention will be described with reference to the drawings.

図１は、本発明の実施の形態の形態素解析モデル生成装置１の構成を示すブロック図である。図１に示すように、形態素解析モデル生成装置１は、入力部１０、学習部１１、出力部１６、記憶部１７を備える。学習部１１は、サンプリング部１２、パラメータ更新部１３、ソート部１４、ハイパーパラメータ決定部１５を備え、記憶部１７は、学習データ記憶部１８及びモデル記憶部１９を備える。 FIG. 1 is a block diagram showing a configuration of a morphological analysis model generation device 1 according to an embodiment of the present invention. As illustrated in FIG. 1, the morphological analysis model generation device 1 includes an input unit 10, a learning unit 11, an output unit 16, and a storage unit 17. The learning unit 11 includes a sampling unit 12, a parameter update unit 13, a sorting unit 14, and a hyper parameter determination unit 15, and the storage unit 17 includes a learning data storage unit 18 and a model storage unit 19.

入力部１０は、学習データである複数の文の入力を受け付ける。学習部１１は、サンプリング部１２によって、学習データである各文の分かち書きと当該分かち書きを構成する各単語候補に対する品詞の対応付けのサンプリングを行いながら、パラメータ更新部１３によって、形態素解析モデルのパラメータの更新を行う処理を、所定の収束条件を満たすまで繰り返し行う。なお、ソート部１４及びハイパーパラメータ決定部１５の機能並びに、サンプリング部１２及びパラメータ更新部１３の詳細な機能については、形態素解析モデル生成装置１の動作フロー図に基づいて説明する。 The input unit 10 receives input of a plurality of sentences that are learning data. The learning unit 11 uses the parameter updating unit 13 to sample the parameters of the morphological analysis model while sampling the sentence segmentation as learning data and the part-of-speech association with each word candidate constituting the segmentation by the sampling unit 12. The update process is repeated until a predetermined convergence condition is satisfied. Note that the functions of the sorting unit 14 and the hyper parameter determination unit 15 and the detailed functions of the sampling unit 12 and the parameter update unit 13 will be described based on an operation flow diagram of the morphological analysis model generation device 1.

出力部１６は、学習部１１によって、更新されたモデルを取得して、モデル記憶部１９に出力する。学習データ記憶部１８は、入力部１０によって入力された学習データを記憶する。モデル記憶部１９は、学習部１１にて生成され、更新された、形態素解析モデルを記憶する。 The output unit 16 acquires the updated model by the learning unit 11 and outputs the model to the model storage unit 19. The learning data storage unit 18 stores the learning data input by the input unit 10. The model storage unit 19 stores the morphological analysis model generated and updated by the learning unit 11.

なお、図１に示す形態素解析モデル生成装置１は、ＣＰＵ、ＲＡＭ、ＲＯＭ、ＨＤＤ等を備えたコンピュータにより実現される。ＣＰＵがＲＯＭに記憶されたプログラムを読み出して実行することにより、学習部１１の機能が実現される。このような形態素解析モデル生成装置１を実現するためのプログラムも本発明の範囲に含まれる。 Note that the morphological analysis model generation device 1 illustrated in FIG. 1 is realized by a computer including a CPU, a RAM, a ROM, an HDD, and the like. The function of the learning unit 11 is realized by the CPU reading and executing the program stored in the ROM. A program for realizing such a morphological analysis model generation apparatus 1 is also included in the scope of the present invention.

ところで、ある文ｓについての形態素解析結果をｗとすれば、形態素解析をいかに精度よく行うか、という問題は、確率ｐ（ｗ｜ｓ）を最大化するようなｗを推定する問題に定式化することができる。本発明の実施の形態において、形態素解析とは、分かち書きと、品詞推定とをまとめて行うことを意味するので、ｗは、ｗ＝｛ｗ₁，ｗ₂，・・・，ｗ_M、ｚ₁，ｚ₂，・・・，ｚ_M｝（ただし、ｗ_nは単語、ｚ_nは品詞）となる。ここで、ある単語の出現確率及び品詞の出現確率は、文脈（すなわち、当該単語の１つ前までの単語の並び及び品詞の並び）に依存すると仮定すると、確率ｐ（ｗ｜ｓ）は、ｈ_i＝ｗ₁，・・・，ｗ_i，ｚ₁・・・，ｚ_iとして、

と書き換えることができる。上式右辺の、ｐ（ｗ_i，ｚ_i｜ｈ_i-1）はさらに、ベイズ則により、

と変形することができる。この式（１）が、本発明の実施の形態において、学習により生成・更新する形態素解析のモデルである。 By the way, if the morpheme analysis result for a certain sentence s is w, the problem of how accurately the morpheme analysis is performed is formulated as a problem of estimating w that maximizes the probability p (w | s). can do. In the embodiment of the present invention, the morphological analysis means that the segmentation and the part-of-speech estimation are performed together, so w is w = {w ₁ , w ₂ ,..., W _M , z ₁ , Z ₂ ,..., Z _M } (where w _n is a word and z _n is a part of speech). Here, assuming that the appearance probability of a certain word and the appearance probability of a part of speech depend on the context (that is, the arrangement of the word and the part of speech before the word), the probability p (w | s) is _{_{h i = w 1, ···,}} w i, z 1 ···, as z _i,

Can be rewritten. P (w _i , z _i | h _i-1 ) on the right side of the above equation is

And can be transformed. This equation (1) is a morphological analysis model generated and updated by learning in the embodiment of the present invention.

式（１）において、ｉ番目の単語はＮ−１個前までの単語列とｉ番目の品詞のみに依存し、ｉ番目の品詞は、Ｎ−１個前までの品詞列のみに依存すると仮定すると、

と書き換えることができる。式（２）の右辺が示すのは、品詞ごとの単語ｎ−ｇｒａｍ確率であり、式（３）の右辺が示すのは、品詞ｎ−ｇｒａｍ確率である。したがって、本発明の実施の形態において、式（１）で示される形態素解析モデルの学習は、品詞ごとの単語ｎ−ｇｒａｍ確率及び品詞ｎ−ｇｒａｍ確率のパラメータの学習を意味することになる。 In formula (1), it is assumed that the i-th word depends only on the word sequence up to N−1 and the i-th part of speech, and the i-th part of speech depends only on the part-of-speech sequence up to N−1. Then

Can be rewritten. The right side of equation (2) shows the word n-gram probability for each part of speech, and the right side of equation (3) shows the part of speech n-gram probability. Therefore, in the embodiment of the present invention, learning of the morphological analysis model represented by Expression (1) means learning of the word n-gram probability and the part-of-speech n-gram probability parameter for each part of speech.

図２は、本発明の実施の形態における形態素解析モデルのグラフィカルモデルを示す図、図３は、隠れマルコフモデルに基づく、一般的な品詞推定のグラフィカルモデルを示す図である。図３に示すように、従来の教師なし学習による形態素解析手法によれば、まず、分かち書きを行った後、つまり、単語を観測値とできるようにした後で、品詞の推定を行わなければならなかった。これに対して、本発明の実施の形態において、形態素解析とは、文字列である文ｓについて、分かち書きと品詞推定をまとめて行うことを意味する。したがって、図２のグラフィカルモデルは、文字列を観測値とし、単語分割及び品詞を隠れクラスとする隠れセミマルコフモデルになっている。また、図３においては、カレント品詞は１つ前の品詞のみに依存して確率的に決まる状態遷移を繰り返しながら、カレント品詞によって単語が確率的に一つに決まる。これに対して、図２のグラフィカルモデルにおいては、カレント単語は、カレント品詞及びｎ個（図２においては２個）前までの単語に依存して確率的に決まるｎ−ｇｒａｍモデル（図２においては２−ｇｒａｍ）であり、また、品詞も、ｎ個（図２においては２個）前までの品詞に依存して確率的に決まるｎ−ｇｒａｍモデルとなっている。すなわち、本発明の実施の形態において、生成・更新する形態素解析モデルは、文字列を観測値とし、文字列における単語境界及び品詞を隠れクラスとする隠れセミマルコフモデルにおいて、生起確率が品詞ごとの単語ｎ−ｇｒａｍ確率であり、遷移確率が品詞ｎ−ｇｒａｍ確率となっている。 FIG. 2 is a diagram showing a graphical model of a morphological analysis model in the embodiment of the present invention, and FIG. 3 is a diagram showing a general model of part-of-speech estimation based on a hidden Markov model. As shown in FIG. 3, according to the conventional morphological analysis method based on unsupervised learning, first, after parting, that is, after making a word an observation value, the part of speech must be estimated. There wasn't. On the other hand, in the embodiment of the present invention, the morphological analysis means that the writing and the part-of-speech estimation are collectively performed for the sentence s that is a character string. Therefore, the graphical model in FIG. 2 is a hidden semi-Markov model in which character strings are observed values and word division and part of speech are hidden classes. Further, in FIG. 3, the current part of speech is probabilistically determined to be one word by the current part of speech while repeating state transitions that are determined probabilistically depending on only the previous part of speech. On the other hand, in the graphical model of FIG. 2, the current word has an n-gram model (in FIG. 2) that is determined probabilistically depending on the current part of speech and the previous n words (two in FIG. 2). Is also an n-gram model that is determined probabilistically depending on the part of speech up to n (two in FIG. 2) previous parts of speech. That is, in the embodiment of the present invention, the morphological analysis model to be generated / updated is a hidden semi-Markov model in which a character string is an observation value and a word boundary and part of speech in the character string are hidden classes. It is a word n-gram probability, and the transition probability is a part-of-speech n-gram probability.

さらに、本発明の実施の形態においては、品詞ごとの単語ｎ−ｇｒａｍ確率の事前分布、及び、品詞ｎ−ｇｒａｍ確率の事前分布として、階層ＰＹＰ（Pitman-Yor Process）を適用する。階層ＰＹＰを言語モデルに適用する方法については、Yee Whye The. の「A Bayesian Interpretation of Interpolated Kneser-Ney. Technical Report TRA2/06, School of Computing, NUS, 2006」に説明されているため、ここでは、本発明への適用に必要な限度で簡単に説明する。まず、生起確率については、本発明の実施の形態においては、品詞ｚごとに単語ｎ−ｇｒａｍが生成されるとの仮定の下に、単語ｎ−ｇｒａｍの生成確率をＥ_zi ^Nとして、階層ＰＹＰを適用する。学習データに出現しない、様々な単語ｎ−ｇｒａｍに対して、適切な推定値を付与するためのスムージングを考慮すると、ある単語ｎ−ｇｒａｍは、より低次のｎ−ｇｒａｍでバックオフされ、バックオフが単語ユニグラム（１−ｇｒａｍ）まで到達すると、文字ｎ−ｇｒａｍでバックオフされるようにすればよい。以上より、

となる。 Furthermore, in the embodiment of the present invention, a hierarchical PYP (Pitman-Yor Process) is applied as a prior distribution of word n-gram probabilities for each part of speech and a prior distribution of part of speech n-gram probabilities. How to apply hierarchical PYP to language models is explained in Yee Whye The.'S “A Bayesian Interpretation of Interpolated Kneser-Ney. Technical Report TRA2 / 06, School of Computing, NUS, 2006”. A brief description will be given to the extent necessary for application to the present invention. First, regarding the occurrence probability, in the embodiment of the present invention, on the assumption that the word n-gram is generated for each part of speech z, the generation probability of the word n-gram is set as E _zi ^N , and the hierarchy PYP. Apply. Considering smoothing to give an appropriate estimate for various word n-grams that do not appear in the training data, a word n-gram is backed off with a lower order n-gram, When OFF reaches the word unigram (1-gram), the character n-gram may be backed off. From the above,

It becomes.

以上の分析に基づき、本発明の実施の形態では、品詞ごとの単語ｎ−ｇｒａｍ確率に、単語ｎ−ｇｒａｍモデルと文字ｎ−ｇｒａｍモデルを階層化した言語モデルであるＮＰＹＬＭ（Nested Pitman-Yor Language Model）を適用する。これにより、式（２）は、

と書き換えることができる。ここで、ｃは頻度、ｈは、文脈で、品詞ｚｉの下での単語列ｗ_i-_N+1 ⁱ-¹を示す。ｔ_|h|は、文脈ｈにおいて、親の文脈から単語ｗ_iが生成されたとみなされた回数を表し、文脈後ｔのＣＲＰ（中華料理店過程）によってデータから最適化される。また、において、θ_|h|、ｄ_|h|は、ＰＹＰのハイパーパラメータであり、ディスカウントとスムージングの強度を調整する。 Based on the above analysis, in the embodiment of the present invention, NPYLM (Nested Pitman-Yor Language) is a language model in which the word n-gram model and the character n-gram model are hierarchized in the word n-gram probability for each part of speech. Apply Model). As a result, equation (2) becomes

Can be rewritten. Here, c is a frequency, and h is a context, and indicates a word string w _i _{−N + 1} ⁱ ⁻¹ under the part of speech zi. t _{| h |} represents the number of times the word w _i is considered to have been generated from the parent context in context h and is optimized from the data by CRP (Chinese Restaurant Process) after context t. In addition, θ _{| h |} and d _{| h |} are hyper parameters of PYP, and adjust the intensity of discount and smoothing.

本発明の実施の形態においては、品詞ｎ−ｇｒａｍ確率についても、同様にＰＹＰを適用するが、品詞ｎ−ｇｒａｍ確率については、単語ｎ−ｇｒａｍと文字ｎ−ｇｒａｍの階層構造を考慮する必要がない。したがって、

であり、品詞ｎ−ｇｒａｍ確率には、ＨＰＹＬＭ（Hierarchical Pitman- Yor Language Model）と同様になる。これにより、式（３）は、

と書き換えることができる。なお、ｅ_|h|、η_|h|は、ハイパーパラメータである。 In the embodiment of the present invention, PYP is similarly applied to the part-of-speech n-gram probability, but for the part-of-speech n-gram probability, it is necessary to consider the hierarchical structure of the word n-gram and the letter n-gram. Absent. Therefore,

The part-of-speech n-gram probability is the same as that of HPYLM (Hierarchical Pitman-Yor Language Model). Thereby, Formula (3) becomes

Can be rewritten. Note that e _{| h |} and η _{| h |} are hyperparameters.

学習部１１は、式（４）における

及び、式（５）における

を、学習によって更新し、式（４）及び式（５）のモデルを更新する。以下では、品詞ごとの単語ｎ−ｇｒａｍのパラメータを、Θｚ（ｚ＝１、２・・・Ｚ）とし、式（４）をＮＰＹＬＭｚとする。また、品詞ｎ−ｇｒａｍのパラメータを、Φとし、式（５）をＨＰＹＬＭとする。なお、学習部１１は、ハイパーパラメータについても、更新を行う。 The learning unit 11 uses the equation (4)

And in equation (5)

Is updated by learning, and the models of equations (4) and (5) are updated. Hereinafter, the parameter of the word n-gram for each part of speech is Θz (z = 1, 2,... Z), and the equation (4) is NPYLMz. Also, let the part-of-speech n-gram parameter be Φ, and let Equation (5) be HPYLM. Note that the learning unit 11 also updates the hyper parameters.

図４は、本発明の実施の形態の形態素解析モデル生成装置１における、形態素解析モデルの生成及び更新手順の概要を示すフローチャートである。形態素解析モデル生成装置１は、まず、モデルの生成及び更新に用いられる学習データの入力を受け付け、入力された学習データＳを学習データ記憶部１７に記憶する（ステップＳ１０）。ここで、学習データＳは、Ｍ個の、文Ｓ_iを要素とし、Ｓ＝｛Ｓ₁、Ｓ₂、・・・Ｓ_M｝である。 FIG. 4 is a flowchart showing an outline of the procedure for generating and updating the morphological analysis model in the morphological analysis model generating apparatus 1 according to the embodiment of the present invention. First, the morphological analysis model generation device 1 accepts input of learning data used for generating and updating a model, and stores the input learning data S in the learning data storage unit 17 (step S10). Here, the learning data S includes M sentences S _i as elements, and S = {S ₁ , S ₂ ,... S _M }.

次に、形態素解析モデル生成装置１は、サンプリング部１２にて、学習データに含まれる各文Ｓｉについて、分かち書きと品詞対応づけの初期サンプリングを行う（ステップＳ１１）。パラメータ更新部１３は、ステップＳ１１にて得られた初期サンプリング結果を用いて、ＮＰＹＬＭｚのパラメータΘｚ（ｚ＝１、２・・・Ｚ）、及び、ＨＰＹＬＭのパラメータΦの初期値を設定し、初期（形態素解析）モデルを生成する（ステップＳ１２）。 Next, in the morphological analysis model generation device 1, the sampling unit 12 performs initial sampling of segmentation and part-of-speech association for each sentence Si included in the learning data (step S <b> 11). The parameter updating unit 13 sets initial values of the NPYLMz parameter Θz (z = 1, 2,... Z) and the HPYLM parameter Φ using the initial sampling result obtained in step S11. (Morphological analysis) A model is generated (step S12).

次に、ソート部１４は、Ｓの対照群ΣＭから、１つの要素をサンプリングして、学習データＳの要素の並べ替えを行う（ステップＳ１３）。パラメータ更新部１３は、並べ替え後の各文Ｓσ＝｛Ｓσ₁，Ｓσ₂，・・・Ｓσ_M｝について、１つずつ、直前のＮＰＹＬＭｚ及びＨＰＹＬＭから、Ｓσ（ｉ）の分かち書きサンプリング結果ｗ（Ｓσｉ）及び品詞サンプリング結果ｚ（Ｓσ（ｉ））のデータを除去して、パラメータΘｚ（ｚ＝１、２・・・Ｚ）、及び、パラメータΦを更新する（ステップＳ１４）。サンプリング部１３は、ステップＳ１４にて更新されたパラメータを用いたモデルＮＰＹＬＭｚ及びＨＰＹＬＭによって、文Ｓσ（ｉ）について、新たな分かち書きと品詞対応付けのサンプリングを行う（ステップＳ１５）。なお、ステップＳ１５の具体的な処理内容については、後述する。ステップＳ１５にて得られたサンプリング結果に基づいて、パラメータ更新部１４は、さらに、パラメータの更新を行う（ステップＳ１６）。 Next, the sorting unit 14 samples one element from the S control group ΣM and rearranges the elements of the learning data S (step S13). For each sentence Sσ = {Sσ ₁ , Sσ ₂ ,... Sσ _M } after the rearrangement, the parameter update unit 13 determines the sampling result w ( The data of Sσi) and the part-of-speech sampling result z (Sσ (i)) are removed, and the parameter Θz (z = 1, 2,... Z) and the parameter Φ are updated (step S14). The sampling unit 13 performs sampling of new segmentation and part-of-speech association for the sentence Sσ (i) using the models NPYLMz and HPYLM using the parameters updated in step S14 (step S15). The specific processing content of step S15 will be described later. Based on the sampling result obtained in step S15, the parameter update unit 14 further updates the parameters (step S16).

そして、パラメータ更新部１３は、ＮＰＹＬＭｚ及びＨＰＹＬＭのパラメータ更新に使用していない文Ｓσ（ｉ）があるか否かを判断する（ステップＳ１７）。ＮＰＹＬＭｚ及びＨＰＹＬＭのパラメータ更新に使用していない文Ｓσ（ｉ）がある場合（ステップＳ１７にてＹｅｓ）の場合には、ステップＳ１４に戻る。 Then, the parameter updating unit 13 determines whether there is a sentence Sσ (i) that is not used for updating the parameters of NPYLMz and HPYLM (step S17). If there is a sentence Sσ (i) that is not used to update the parameters of NPYLMz and HPYLM (Yes in step S17), the process returns to step S14.

全ての文Ｓσ（ｉ）に基づいて、パラメータの更新が終了した場合には（ステップＳ１７にてＮｏ）、ハイパーパラメータ決定部１５は、モデルＮＰＹＬＭｚ及びＨＰＹＬＭのハイパーパラメータθ及びｄを、それぞれ、ベータ分布、ガンマ分布に基づき決定する（ステップＳ１８）。なお、ハイパーパラメータの推定については、Yee Whye The. の「A Bayesian Interpretation of Interpolated Kneser-Ney. Technical Report TRA2/06, School of Computing, NUS, 2006」に説明されている。ソート部１４、サンプリング部１２、パラメータ更新部１３による以上の処理は、所定の収束条件を満たす（ステップＳ１９にてＹｅｓ）まで、繰り返し行われる。なお、所定の収束条件とは、例えば、ステップＳ１３からステップＳ１７までの処理を所定回数以上行うことや、パープレキシティーが所定値又は所定範囲内の値であることなど、種々の条件を設定してよい。以上の処理によって、本実施の形態における形態素解析モデル生成装置１は、教師なし学習によってパラメータの更新を行いながら、モデルＮＰＹＬＭｚ及びＨＰＹＬＭを生成する。 If the parameter update is completed based on all the sentences Sσ (i) (No in step S17), the hyperparameter determination unit 15 sets the hyperparameters θ and d of the models NPYLMz and HPYLM to beta respectively. The distribution is determined based on the gamma distribution (step S18). Hyperparameter estimation is described in Yee Whye The. “A Bayesian Interpretation of Interpolated Kneser-Ney. Technical Report TRA2 / 06, School of Computing, NUS, 2006”. The above processing by the sorting unit 14, the sampling unit 12, and the parameter updating unit 13 is repeatedly performed until a predetermined convergence condition is satisfied (Yes in step S19). The predetermined convergence condition is set by various conditions, for example, the process from step S13 to step S17 is performed a predetermined number of times or more, and the perplexity is a predetermined value or a value within a predetermined range. It's okay. Through the above processing, the morphological analysis model generation device 1 according to the present embodiment generates models NPYLMz and HPYLM while updating parameters by unsupervised learning.

図５は、図４のステップＳ１５における、処理の流れを示す動作フロー図である。サンプリング部１２は、まず、文Ｓの単語ラティスを生成する（ステップＳ２０）。単語ラティスとは、文頭から文末までの、隣接する単語候補を網羅的に接続した経路を意味し、１つの経路を決定することで、分かち書きが１つ作成されることになる。本実施の形態においては、分かち書きと品詞の対応付けのサンプリングを同時に行う。したがって、ステップＳ２１において生成されるラティスも、品詞を考慮したラティスとなる。例えば、ある文Ｓが、「１杯で３度の香り」であり、単語候補の最大長を２文字とすると、ステップＳ２０で生成されるラティスは、図６に示すように、１つの単語候補に対して、全ての品詞（品詞ｉｄ）が対応付けられたものとなる。 FIG. 5 is an operation flowchart showing the flow of processing in step S15 of FIG. First, the sampling unit 12 generates a word lattice of the sentence S (step S20). The word lattice means a path in which adjacent word candidates are comprehensively connected from the beginning of the sentence to the end of the sentence, and one division is created by determining one path. In the present embodiment, the sampling of the division and part-of-speech association is performed simultaneously. Therefore, the lattice generated in step S21 is also a lattice considering the part of speech. For example, if a sentence S is “scent of 3 times per cup” and the maximum length of a word candidate is 2 characters, the lattice generated in step S20 is one word candidate as shown in FIG. In contrast, all parts of speech (part of speech id) are associated with each other.

次に、サンプリング部１２は、ステップＳ２０にて生成したラティスにおいて、文末から順次、単語候補の前向き確率を計算する（ステップＳ２１）。前向き確率とは、その単語に到達するまでの経路の確率の総和である。本実施の形態においては、単語分割と品詞の対応付けのサンプリングをまとめて行うため、単語と品詞の同時確率を求める必要がある。そのため、ステップＳ２１において算出される前向き確率も、品詞を考慮したものとなり、具体的には、次式に従って算出される。

ここで、ｚは現在の品詞、ｒは１つ前の品詞、ｔはカレントの単語候補の終了位置、ｋはカレントの単語候補の長さ、ｊは接続される１つ前の単語候補長さである。なお、図６において、「香り」という単語に着目すると、長さ（文字数）は「２」なので、当該単語の終了位置である「８」から長さ２を引いた、「６」に位置するノードである「度の」及び「の」が、その前の単語候補として接続可能である。したがって、カレント単語「香り」の前向き確率は、１つ前の単語が「度の」または「の」である各品詞ごとのノードの前向き確率と各ノードからカレント単語への単語ｎ−ｇｒａｍ確率と品詞ｎ−ｇｒａｍ確率により算出される。 Next, the sampling unit 12 calculates forward probabilities of word candidates sequentially from the end of the sentence in the lattice generated in step S20 (step S21). The forward probability is the sum of the probabilities of the path to reach the word. In the present embodiment, since sampling of word division and part-of-speech association is performed together, it is necessary to obtain the simultaneous probability of a word and part of speech. Therefore, the forward probability calculated in step S21 also takes into account the part of speech, and is specifically calculated according to the following equation.

Here, z is the current part of speech, r is the previous part of speech, t is the end position of the current word candidate, k is the length of the current word candidate, and j is the length of the previous word candidate to be connected. It is. In FIG. 6, focusing on the word “fragrance”, the length (number of characters) is “2”, so it is located at “6”, which is obtained by subtracting length 2 from “8” which is the end position of the word. The nodes “degree” and “no” can be connected as the previous word candidates. Therefore, the forward probability of the current word “scent” is the forward probability of the node for each part of speech where the previous word is “degree” or “no”, and the word n-gram probability from each node to the current word. Calculated based on the part-of-speech n-gram probability.

そして、カレント単語候補の前向き確率に基づいて、１つ前の単語境界の位置と品詞を、次の確率に従って、ランダムに選択する（ステップＳ２２）。

このようにして、文頭まで到達するまで（ステップＳ２３にてＹｅｓ）、ステップＳ２２の処理を繰り返し、カレント単語候補の終了位置と品詞を順次選択する。文頭まで到達した場合には、それまで選択された単語境界の位置と、品詞とを組み合わせて、文Ｓの分かち書きと品詞の対応付けのサンプリング結果として、パラメータ更新部１３に出力する（ステップＳ２４）。 Based on the forward probability of the current word candidate, the position and the part of speech of the previous word boundary are randomly selected according to the following probabilities (step S22).

In this way, the process of step S22 is repeated until the beginning of the sentence is reached (Yes in step S23), and the end position and part of speech of the current word candidate are sequentially selected. When the sentence head is reached, the position of the word boundary selected so far and the part of speech are combined and output to the parameter updating unit 13 as a sampling result of the sentence S segmentation and part of speech correspondence (step S24). .

以上のように、本実施の形態の形態素解析モデル生成装置１によれば、隠れセミマルコフモデルに基づいて、品詞ごとの単語ｎ−ｇｒａｍ確率の事前分布にＮＰＹＬＭを適用し、品詞ｎ−ｇｒａｍ確率の事前分布にＨＰＹＬＭを適用した形態素解析モデルの学習を行う。この学習は、具体的には、各文の分かち書きと当該分かち書きを構成する各単語候補に対する品詞の対応付けのサンプリングを行いながら、品詞ごとの単語ｎ−ｇｒａｍ確率及び品詞ｎ−ｇｒａｍ確率のパラメータの更新を行う処理を、所定の収束条件を満たすまで繰り返し行うことで、実現される。したがって、本発明の形態素解析モデル生成装置１によれば、教師なし学習によって、分かち書きと品詞推定を同時に、精度よく行うための形態素解析モデルを生成することができる。 As described above, according to the morphological analysis model generation device 1 of the present embodiment, based on the hidden semi-Markov model, NPYLM is applied to the prior distribution of word n-gram probabilities for each part of speech, and the part of speech n-gram probability. A morphological analysis model in which HPYLM is applied to the prior distribution is learned. More specifically, this learning is performed by sampling the sentence segmentation and the part-of-speech association for each word candidate constituting the segmentation, while determining the word n-gram probability and the part-of-speech n-gram probability parameter for each part of speech. This is realized by repeatedly performing the update process until a predetermined convergence condition is satisfied. Therefore, according to the morpheme analysis model generation apparatus 1 of the present invention, it is possible to generate a morpheme analysis model for accurately performing the division writing and the part-of-speech estimation simultaneously by unsupervised learning.

訓練データ１万文、テストデータ１千文を含む、日本語コーパスを用いて、本発明の実施の形態の形態素解析モデル生成装置１によって生成されたモデルの評価を行った。すると、従来手法であるＮＰＹＬＭによって分かち書きを行った後に、ＢＨＭＭ（Bayesian HMM）によって品詞推定を行った場合と比べて、下表のように、分割精度、分割再現率、品詞推定精度のいずれにおいても優位性が認められ、特に、分割精度及び品詞推定精度において、顕著な効果が確認された。

The model generated by the morphological analysis model generation apparatus 1 according to the embodiment of the present invention was evaluated using a Japanese corpus including 10,000 training data and 1000 test data. Then, compared to the case where part-of-speech estimation is performed by BHMM (Bayesian HMM) after segmentation by NPYLM, which is the conventional method, as shown in the table below, in any of segmentation accuracy, segmentation recall, and part-of-speech estimation accuracy Superiority was recognized, and in particular, remarkable effects were confirmed in segmentation accuracy and part-of-speech estimation accuracy.

上記の実施の形態では、品詞ごとの単語ｎ−ｇｒａｍ確率及び品詞ｎ−ｇｒａｍ確率の事前分布に、それぞれ、ＮＰＹＬＭ、ＨＰＹＬＭを適用する場合について説明したが、単語ｎ−ｇｒａｍの事前分布として、ＨＰＹＬＭを適用してもよい。また、品詞ごとの単語ｎ−ｇｒａｍ確率及び品詞ｎ−ｇｒａｍ確率の事前分布に、このような階層ＰＹＰを適用しなくてもよく、ベルヌイモデルやディレクエプロセス等の、他のノンパラメトリックベイズモデルを適用してもよい。 In the above embodiment, the case where NPYLM and HPYLM are applied to the prior distribution of the word n-gram probability and the part-of-speech n-gram probability for each part of speech has been described. However, as the prior distribution of the word n-gram, HPYLM is used. May be applied. Moreover, it is not necessary to apply such a hierarchical PYP to the prior distribution of the word n-gram probability and the part-of-speech n-gram probability for each part of speech, and other nonparametric Bayes models such as Bernoulli model and Dije process. You may apply.

また、上記の実施の形態では、図４のステップＳ１５、Ｓ１６において、１文ずつ、直前のＮＰＹＬＭｚ及びＨＰＹＬＭから、Ｓσ（ｉ）の分かち書きサンプリング結果ｗ（Ｓσｉ）及び品詞サンプリング結果ｚ（Ｓσ（ｉ））のデータの除去と、サンプリング結果を追加する場合について説明したが、データの除去とサンプリング結果の追加は、複数の文についてまとめて行ってもよい。これは、上記の実施の形態で用いたギブスサンプリングの手法については、並列化できることが知られており、パラメータΘｚ、Φを固定した時、文ｓ（ｉ）に対する形態素解析結果ｗ（ｉ）は、それぞれ独立とみなすことができるからである。このように、データ除去及びサンプリング結果の追加を並列化することで、モデルの学習を高速に行うことができる。 Further, in the above embodiment, in steps S15 and S16 of FIG. 4, one sentence at a time, the immediately preceding NPYLMz and HPYLM, the segmented sampling result w (Sσi) and the part of speech sampling result z (Sσ (i) of Sσ (i). In the above description, the case of removing data and adding the sampling result has been described. However, the removal of data and the addition of the sampling result may be collectively performed for a plurality of sentences. It is known that the Gibbs sampling method used in the above embodiment can be parallelized, and when parameters Θz and Φ are fixed, the morphological analysis result w (i) for the sentence s (i) is This is because they can be regarded as independent. In this way, model removal can be performed at high speed by parallelizing data removal and addition of sampling results.

また、本発明は、半教師あり学習に適用することもできる。具体的には、ラベル付きデータを一部に含む学習データを用いてサンプリングを行い、ラベル付きデータについては、常に正解データがサンプリングされるように固定しておけばよい。 The present invention can also be applied to semi-supervised learning. Specifically, sampling is performed using learning data including a part of labeled data, and the labeled data may be fixed so that correct data is always sampled.

本発明は、教師なし学習によって、分かち書きのみならず、品詞推定まで行うための形態素解析モデルを生成することができるという効果を有し、形態素解析モデル生成装置等として有用である。 INDUSTRIAL APPLICABILITY The present invention has an effect that it can generate a morpheme analysis model for performing not only separation but also part-of-speech estimation by unsupervised learning, and is useful as a morpheme analysis model generation device or the like.

１形態素解析モデル生成装置
１０入力部
１１学習部
１２サンプリング部
１３パラメータ更新部
１４ソート部
１５ハイパーパラメータ決定部
１６出力部
１７記憶部
１８学習データ記憶部
１９モデル記憶部 DESCRIPTION OF SYMBOLS 1 Morphological analysis model generation apparatus 10 Input part 11 Learning part 12 Sampling part 13 Parameter update part 14 Sort part 15 Hyper parameter determination part 16 Output part 17 Storage part 18 Learning data storage part 19 Model storage part

Claims

A morpheme analysis model generation device that generates a morpheme analysis model for performing analysis of sentence segmentation and part-of-speech estimation by unsupervised learning,
The morphological analysis model is a hidden semi-Markov model in which a character string is an observed value and a word boundary and part of speech in the character string are hidden classes, and a prior distribution of word n-gram probabilities for each part of speech that is an occurrence probability, and A non-parametric Bayes model that applies a stochastic process to a prior distribution of part-of-speech n-gram probabilities that are transition probabilities,
A learning data storage unit storing a plurality of sentences as learning data;
While sampling the sentence segmentation stored in the learning data storage unit and the part-of-speech correspondence with each word candidate constituting the segmentation, the word n-gram probability for each part-of-speech and the part-of-speech in the morphological analysis model a learning unit that repeatedly performs the process of updating the parameter of the n-gram probability until a predetermined convergence condition is satisfied;
A morphological analysis model generation device comprising:

The learning unit generates a lattice including a part of speech as a parameter for each sentence,

t: End position of current word candidate
k: Current word candidate length
z: Part of speech of the current word candidate
r: Part of speech of the previous word candidate to be connected
The sampling is performed while selecting the previous word candidate and the part of speech corresponding to the word candidate according to a forward probability represented by the length of the previous word candidate to be connected. Morphological analysis model generation device.

The word n-gram probability for each part of speech is calculated based on NPYLM (Nested Pitman-Yor Language Model), and the part of speech n-gram probability is calculated based on HPYLM (Hierarchical Pitman-Yor Language Model). Item 4. The morphological analysis model generation device according to Item 1.

A morpheme analysis model generation method for generating a morpheme analysis model for performing segmentation of a sentence to be analyzed and estimating a part of speech by unsupervised learning,
The morphological analysis model is a hidden semi-Markov model in which a character string is an observed value and a word boundary and part of speech in the character string are hidden classes, and a prior distribution of word n-gram probabilities for each part of speech that is an occurrence probability, and A non-parametric Bayes model that applies a stochastic process to a prior distribution of part-of-speech n-gram probabilities that are transition probabilities,
Receiving a plurality of sentences as learning data, and storing the plurality of sentences in a learning data storage unit;
While sampling the sentence segmentation stored in the learning data storage unit and the part-of-speech correspondence with each word candidate constituting the segmentation, the word n-gram probability for each part-of-speech and the part-of-speech in the morphological analysis model repeatedly performing the process of updating the parameter of the n-gram probability until a predetermined convergence condition is satisfied,
A morphological analysis model generation method comprising:

A program for generating a morphological analysis model for performing analysis of sentence segmentation and part-of-speech estimation by unsupervised learning,
The morphological analysis model is a hidden semi-Markov model in which a character string is an observed value and a word boundary and part of speech in the character string are hidden classes, and a prior distribution of word n-gram probabilities for each part of speech that is an occurrence probability, and A non-parametric Bayes model that applies a stochastic process to a prior distribution of part-of-speech n-gram probabilities that are transition probabilities,
On the computer,
Receiving a plurality of sentences as learning data, and storing the plurality of sentences in a learning data storage unit;
While sampling the sentence segmentation stored in the learning data storage unit and the part-of-speech correspondence with each word candidate constituting the segmentation, the word n-gram probability for each part-of-speech and the part-of-speech in the morphological analysis model repeatedly performing the process of updating the parameter of the n-gram probability until a predetermined convergence condition is satisfied,
A program that executes