JP5704732B2

JP5704732B2 - Program particle complementation program, apparatus, server, and method for target sentence

Info

Publication number: JP5704732B2
Application number: JP2014010827A
Authority: JP
Inventors: 池田　和史; 和史池田; 服部　元; 元服部; 一則松本; 小野　智弘; 智弘小野
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2014-01-23
Filing date: 2014-01-23
Publication date: 2015-04-22
Anticipated expiration: 2030-08-26
Also published as: JP2014067458A

Description

本発明は、インターネットに公開されるＷｅｂサイトに記述された文章情報の内容を解析する技術に関する。 The present invention relates to a technique for analyzing the contents of text information described on a website published on the Internet.

インターネットの普及により、ブログ、掲示板又はクチコミコメントを公開するＷｅｂサイトに、様々なテキストが記述されている。「ブログ」(Weblog)とは、一般的に個人によって運営され、時事ニュースや専門的トピックスに関する自らの意見を表明するために、日記的に更新することができるサイトをいう。また、「掲示板」とは、様々なテーマについて、他人と議論を逐次に交換するためのサイトをいう。更に、「クチコミコメント」とは、人の噂のような、物事の評判などに関するコメントを記述することができるサイトをいう。これらのサイトの普及により、一般のユーザが、インターネットで自由に情報発信できるようになった。 With the spread of the Internet, various texts are described on blogs, bulletin boards, or Web sites that publish reviews. A “blog” is a site that is generally run by an individual and can be updated in a diary to express their opinions on current news and specialized topics. A “bulletin board” is a site for sequentially exchanging discussions with other people on various themes. Furthermore, “review comments” refers to a site where comments about things such as people's rumors can be described. With the spread of these sites, general users can freely send information on the Internet.

これら文章情報を内容的に解析することによって、不特定多数の一般ユーザの意見を収集することができる。例えば、商品及びサービスに関する評判解析や、違法・有害情報をフィルタリングするための情報検索に適用できる。このような文章情報の解析には、係り受け解析や格解析を用いて、形態素間の関係を取得する技術を要する。しかしながら、文章情報の中に、助詞落ちや倒置によって記載された文章は、係り受け解析や格解析に失敗する場合がある。 By analyzing these text information in detail, opinions of an unspecified number of general users can be collected. For example, the present invention can be applied to reputation analysis regarding products and services and information retrieval for filtering illegal / harmful information. Such analysis of sentence information requires a technique for acquiring a relationship between morphemes using dependency analysis or case analysis. However, a sentence described by particle dropping or inversion in the sentence information may fail in dependency analysis or case analysis.

従来技術として、助詞落ちや倒置のような書き言葉特有の傾向を統計的に分析し、助詞落ちや倒置がある文章の係り受け解析精度を向上させる技術がある（例えば非特許文献１参照）。この技術は、例えば「助詞落ちがある名詞は、直後の述語にかかる可能性が高い」というようなヒューリスティック(heuristic)なルールによって判定する。 As a conventional technique, there is a technique for statistically analyzing a tendency unique to a written word such as particle dropping or inversion and improving dependency analysis accuracy of a sentence having particle dropping or inversion (for example, see Non-Patent Document 1). This technique is determined based on a heuristic rule such as “a noun with a particle dropping is highly likely to be applied to the immediately following predicate”.

また、ＭＥ（最大エントロピー法）に基づくモデルを用いて、係り受け解析精度を向上させる技術もある（例えば非特許文献２参照）。更に、口語文書を文単位に分割し、「節」と称される細かい単位に分類することで、係り受け解析精度を向上させる技術もある（例えば非特許文献３参照）。 There is also a technique for improving dependency analysis accuracy using a model based on ME (maximum entropy method) (see, for example, Non-Patent Document 2). Furthermore, there is a technique for improving dependency analysis accuracy by dividing a colloquial document into sentence units and classifying them into fine units called “sections” (see, for example, Non-Patent Document 3).

山本幹雄、小林聡、中川聖一、「音声対話文における助詞落ち・倒置の分析と解析手法」、情報処理学会論文誌、vol.33 No.11、pp.1322-1330、1992、[online]、［平成２２年８月２４日］、インターネット＜URL:http://ci.nii.ac.jp/naid/110002723414＞Mikio Yamamoto, Satoshi Kobayashi, Seiichi Nakagawa, “Analysis and Analysis of Particle Dropping and Inversion in Spoken Dialogue”, Transactions of Information Processing Society of Japan, vol.33 No.11, pp.1322-1330, 1992, [online] [August 24, 2010] Internet <URL: http://ci.nii.ac.jp/naid/110002723414> 内元清貴、関根聡、井佐原均、「最大エントロピー法に基づくモデルを用いた日本語係り受け解析」、情報処理学会論文誌2001、[online]、［平成２２年８月２４日］、インターネット＜URL:http://ci.nii.ac.jp/naid/110002725061＞Kiyotaka Uchimoto, Satoshi Sekine, Hitoshi Isahara, “Japanese dependency analysis using a model based on the maximum entropy method”, IPSJ Journal 2001, [online], [August 24, 2010], Internet < URL: http://ci.nii.ac.jp/naid/110002725061> 大野誠寛、松原茂樹、柏岡秀紀、稲垣康善、「節の始端検出に基づく独話文の係り受け解析」、情報処理学会論文誌2009、[online]、［平成２２年８月２４日］、インターネット＜URL:http://ci.nii.ac.jp/naid/110006549568＞Masahiro Ohno, Shigeki Matsubara, Hideki Tsujioka, Yasuyoshi Inagaki, “Dependency analysis of monologue sentences based on the beginning detection of clauses”, IPSJ Transactions 2009, [online], [August 24, 2010], Internet <URL: http://ci.nii.ac.jp/naid/110006549568>

しかしながら、インターネット上の電子掲示板やブログなど、不特定多数の一般ユーザによって記述された文章情報は、口語表現などが多く含まれる。そのために、係り受け解析や格解析の精度が低下するという課題があった。このような精度低下の原因の多くは、「ラーメン食べた」（ラーメンを食べた）や「足速いね」（足が速いね）のような助詞落ち表現に基づくものである。 However, sentence information written by an unspecified number of general users, such as electronic bulletin boards and blogs on the Internet, includes many colloquial expressions. For this reason, there is a problem that the accuracy of dependency analysis and case analysis decreases. Many of the causes of such a decrease in accuracy are based on particle dropping expressions such as “I ate ramen” (I ate ramen) and “I have fast feet” (I have fast feet).

非特許文献１に記載された技術によれば、助詞落ちのある名詞は、高い確率で直後の動詞に係ると判定する。しかしながら、口語表現の場合、「名詞＋動詞」の形態の文章であっても、助詞落ちではない文章も多数存在する。結局、不特定多数の一般ユーザにおける口語表現を含む文章情報の場合、文章情報の解析精度は低下してしまう。 According to the technique described in Non-Patent Document 1, a noun with a particle dropping is determined to be related to the immediately following verb with a high probability. However, in the case of colloquial expressions, there are many sentences that are not particles, even if they are sentences in the form of “noun + verb”. Eventually, in the case of sentence information including colloquial expressions for an unspecified number of general users, the analysis accuracy of the sentence information is lowered.

また、非特許文献２に記載された技術によれば、係り受け解析結果が付与された学習文書を要するため、人手によるラベル作業が必要となる。更に、非特許文献３に記載された技術によれば、助詞落ちを含む文書の解析精度を向上させる効果は少ないと考えられる。 In addition, according to the technique described in Non-Patent Document 2, a learning document to which a dependency analysis result is assigned is required, so that manual labeling is required. Furthermore, according to the technique described in Non-Patent Document 3, it is considered that the effect of improving the analysis accuracy of a document including particle dropping is small.

そこで、本発明は、対象文章情報について助詞落ちの有無を検出する共に、落ちた助詞を補完することによって、対象文章情報の解析精度を向上させることができる助詞落ち補完プログラム、装置、サーバ及び方法を提供することを目的とする。 Therefore, the present invention detects a particle dropping presence / absence in the target sentence information and complements the dropped particle, thereby improving the analysis accuracy of the target sentence information, apparatus, server, and method The purpose is to provide.

本発明によれば、助詞落ち表現を含む対象文章情報に対して、該助詞を補完するようにコンピュータを機能させる助詞補完プログラムであって、
助詞落ち表現を含まない基準文章情報を蓄積した基準文章蓄積手段と、
基準文章情報から、助詞を削除することによって助詞落ち文章情報を生成する助詞落ち文章生成手段と、
助詞落ち文章情報を正例データとし、基準文章情報を負例データとして、２クラスのパターン識別器を構成すると共に、該パターン識別器を用いて、入力された対象文章情報が助詞落ち表現か否かを識別し、当該助詞落ち表現に対応する複数の助詞有り表現の候補を抽出する識別エンジン手段と
候補となる助詞有り表現毎に、基準文章蓄積手段に蓄積された基準文章情報の中における出現頻度を計数する出現頻度計数手段と、
出現頻度が最も高い助詞有り表現における助詞を、対象文章情報に対して補完する助詞落ち補完手段と
してコンピュータを機能させることを特徴とする。 According to the present invention, there is a particle complement program for causing a computer to function to complement a particle for target sentence information including a particle missing expression,
A reference sentence storage means for storing reference sentence information that does not include a particle omission expression;
A particle missing sentence generating means for generating particle missing sentence information by deleting a particle from the reference sentence information;
2 class pattern discriminators are constructed using particle missing sentence information as positive example data and reference sentence information as negative example data . Whether or not the input target sentence information is a particle missing expression using the pattern classifier And an identification engine means for extracting a plurality of particle candidate expressions corresponding to the particle dropping expression
Appearance frequency counting means that counts the appearance frequency in the reference sentence information stored in the reference sentence storage means for each candidate particle expression,
The computer is caused to function as a particle dropping complement means for complementing the particle in the expression with the particle having the highest appearance frequency with respect to the target sentence information .

本発明の助詞補完プログラムにおける他の実施形態によれば、
識別エンジン手段は、サポートベクタマシン(Support Vector Machine)に基づくもの、又は、ルールベースに基づくもの、であるようにコンピュータを更に機能させることも好ましい。 According to another embodiment of the particle complement program of the present invention,
It is also preferred that the computer further functions so that the identification engine means is based on a Support Vector Machine or based on a rule base.

本発明の助詞補完プログラムにおける他の実施形態によれば、
基準文章情報は、公用的に公開されており、信用ある特定ユーザによって記述された文章情報であり、
対象文章情報は、私用的に公開されており、不特定多数のユーザによって記述された文章情報である
ようにコンピュータを更に機能させることも好ましい。 According to another embodiment of the particle complement program of the present invention,
The standard sentence information is publicly available and is sentence information described by a specific user who is trusted,
It is also preferable that the target text information is open to the public and that the computer further functions to be text information described by an unspecified number of users.

本発明によれば、助詞落ち表現を含む対象文章情報に対して、該助詞を補完する文章解析装置であって、
助詞落ち表現を含まない基準文章情報を蓄積した基準文章蓄積手段と、
基準文章情報から、助詞を削除することによって助詞落ち文章情報を生成する助詞落ち文章生成手段と、
助詞落ち文章情報を正例データとし、基準文章情報を負例データとして、２クラスのパターン識別器を構成すると共に、該パターン識別器を用いて、入力された対象文章情報が助詞落ち表現か否かを識別し、当該助詞落ち表現に対応する複数の助詞有り表現の候補を抽出する識別エンジン手段と
候補となる助詞有り表現毎に、基準文章蓄積手段に蓄積された基準文章情報の中における出現頻度を計数する出現頻度計数手段と、
出現頻度が最も高い助詞有り表現における助詞を、対象文章情報に対して補完する助詞落ち補完手段と
を有することを特徴とする。 According to the present invention, for a target sentence information including a particle removal expression, a sentence analysis device that complements the particle,
A reference sentence storage means for storing reference sentence information that does not include a particle omission expression;
A particle missing sentence generating means for generating particle missing sentence information by deleting a particle from the reference sentence information;
2 class pattern discriminators are constructed using particle missing sentence information as positive example data and reference sentence information as negative example data . Whether or not the input target sentence information is a particle missing expression using the pattern classifier And an identification engine means for extracting a plurality of particle candidate expressions corresponding to the particle dropping expression
Appearance frequency counting means that counts the appearance frequency in the reference sentence information stored in the reference sentence storage means for each candidate particle expression,
A particle omission complementing means for complementing a particle in an expression with a particle having the highest appearance frequency with respect to target sentence information is provided.

本発明によれば、助詞落ち表現を含む対象文章情報を他の公開サーバからネットワークを介して取得し、該助詞を補完する文章解析サーバであって、
助詞落ち表現を含まない基準文章情報を蓄積した基準文章蓄積手段と、
基準文章情報から、助詞を削除することによって助詞落ち文章情報を生成する助詞落ち文章生成手段と、
助詞落ち文章情報を正例データとし、基準文章情報を負例データとして、２クラスのパターン識別器を構成すると共に、該パターン識別器を用いて、入力された対象文章情報が助詞落ち表現か否かを識別し、当該助詞落ち表現に対応する複数の助詞有り表現の候補を抽出する識別エンジン手段と
候補となる助詞有り表現毎に、基準文章蓄積手段に蓄積された基準文章情報の中における出現頻度を計数する出現頻度計数手段と、
出現頻度が最も高い助詞有り表現における助詞を、対象文章情報に対して補完する助詞落ち補完手段と
を有することを特徴とする。 According to the present invention, it is a sentence analysis server that acquires target sentence information including a particle removal expression from another public server via a network, and complements the particle,
A reference sentence storage means for storing reference sentence information that does not include a particle omission expression;
A particle missing sentence generating means for generating particle missing sentence information by deleting a particle from the reference sentence information;
2 class pattern discriminators are constructed using particle missing sentence information as positive example data and reference sentence information as negative example data . Whether or not the input target sentence information is a particle missing expression using the pattern classifier And an identification engine means for extracting a plurality of particle candidate expressions corresponding to the particle dropping expression
Appearance frequency counting means that counts the appearance frequency in the reference sentence information stored in the reference sentence storage means for each candidate particle expression,
A particle omission complementing means for complementing a particle in an expression with a particle having the highest appearance frequency with respect to target sentence information is provided.

本発明によれば、コンピュータを搭載した装置を用いて、助詞落ち表現を含む対象文章情報に対して、該助詞を補完する助詞落ち補完方法であって、
助詞落ち表現を含まない基準文章情報を蓄積した基準文章蓄積部を有し、
基準文章情報から、助詞を削除することによって助詞落ち文章情報を生成する第１のステップと、
助詞落ち文章情報を正例データとし、基準文章情報を負例データとして、２クラスのパターン識別器を構成する第２のステップと、
パターン識別器を用いて、入力された対象文章情報が助詞落ち表現か否かを識別し、当該助詞落ち表現に対応する複数の助詞有り表現の候補を抽出する第３のステップと、
候補となる助詞有り表現毎に、基準文章蓄積部に蓄積された基準文章情報の中における出現頻度を計数する第４のステップと、
出現頻度が最も高い助詞有り表現における助詞を、対象文章情報に対して補完する第５のステップと
を有することを特徴とする。 According to the present invention, there is a particle omission complementing method for complementing the particle with respect to the target sentence information including the particle omission expression using an apparatus equipped with a computer,
It has a reference sentence storage unit that stores reference sentence information that does not include particle removal expressions,
A first step of generating particle missing sentence information by deleting particles from the reference sentence information;
A second step of constructing a two-class pattern discriminator, with the particle missing sentence information as positive example data and the reference sentence information as negative example data ;
A third step of identifying whether or not the input target sentence information is a particle dropping expression using a pattern discriminator, and extracting a plurality of particle-with-expression candidates corresponding to the particle dropping expression;
A fourth step of counting the frequency of appearance in the reference sentence information stored in the reference sentence storage unit for each candidate particle with expression;
A fifth step of complementing the particle in the expression with a particle having the highest appearance frequency with respect to the target sentence information .

本発明の助詞落ち補完プログラム、装置、サーバ及び方法によれば、対象文章情報について助詞落ちの有無を検出する共に、落ちた助詞を補完することによって、対象文章情報の解析精度を向上させることができる。特に、本発明によれば、既存の新聞文書のみを対象文章情報として用いることできるので、解析精度が向上し、且つ、汎用性が高いという効果を有する。 According to the particle omission completion program, the apparatus, the server, and the method of the present invention, it is possible to improve the analysis accuracy of the target sentence information by detecting the presence or absence of the particle in the target sentence information and complementing the dropped particle. it can. In particular, according to the present invention, since only existing newspaper documents can be used as target sentence information, the analysis accuracy is improved and the versatility is high.

本発明における助詞落ち補完プログラムの機能構成図である。It is a functional block diagram of the particle omission complementation program in this invention. 品詞及び品詞細分類を表す品詞体系図である。It is a part of speech system diagram showing a part of speech and a part of speech fine classification. 文章情報から助詞落ちを検出し且つ補完する説明図である。It is explanatory drawing which detects and complements a particle omission from sentence information. 品詞列から助詞落ちを検出し且つ補完する説明図である。It is explanatory drawing which detects and complements particle omission from a part of speech sequence. 本発明における文章解析サーバを有するシステム構成図である。It is a system block diagram which has a text analysis server in this invention. 本発明におけるシーケンス図である。It is a sequence diagram in the present invention.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明における助詞落ち補完プログラムの機能構成図である。 FIG. 1 is a functional configuration diagram of a particle dropping complement program according to the present invention.

本発明の助詞補完プログラムは、助詞落ち表現を含む対象文章情報に対して、その助詞を補完するようにコンピュータを機能させる。ここで、助詞補完プログラムは、「助詞落ち検出機能」と「助詞補完機能」とに区分される。 The particle complement program of the present invention causes the computer to function to complement the particle with respect to the target sentence information including the particle dropping expression. Here, the particle complement program is divided into a “particle missing detection function” and a “particle complement function”.

図１によれば、助詞落ち補完プログラム１は、基準文章蓄積部１０と、「助詞落ち検出機能」としての助詞落ち文章生成部１１及び識別エンジン部１２とを有する。 According to FIG. 1, the particle missing complement program 1 includes a reference sentence accumulating unit 10, a particle missing sentence generating unit 11 and an identification engine unit 12 as a “particle missing detection function”.

基準文章蓄積部１０は、助詞落ち表現を含まない基準文章情報を蓄積する。「基準文章情報」は、公用的に公開されており、信用ある特定ユーザによって記述された文章情報である。例えば、新聞記事の文章のように、助詞落ち表現が少なく、且つ、形態素解析精度が高い文章の集合であることが好ましい。 The reference sentence storage unit 10 stores reference sentence information that does not include a particle dropping expression. The “reference text information” is text information that is publicly disclosed and described by a specific user who is trusted. For example, it is preferably a set of sentences such as newspaper article sentences with few particles dropped and high morphological analysis accuracy.

助詞落ち文章生成部１１は、基準文章情報から、助詞を削除することによって助詞落ち文章情報を生成する。助詞落ち文章生成部１１は、基準文章情報を形態素に分割し、形態素の品詞が「助詞」となるものを削除する。「形態素」とは、文章の構成要素のうち、意味を持つ最小の単位をいう。また、「単語」毎に「品詞」が登録された辞書を要する。 The particle missing sentence generation unit 11 generates particle missing sentence information by deleting the particle from the reference sentence information. The particle dropping sentence generation unit 11 divides the reference sentence information into morphemes, and deletes those whose morpheme part of speech becomes “particles”. A “morpheme” refers to the smallest meaningful unit among the constituent elements of a sentence. Further, a dictionary in which “part of speech” is registered for each “word” is required.

識別エンジン部１２は、助詞落ち文章情報を正例データとし、基準文章情報を負例データとして学習データベースを生成する。識別エンジン部１２は、サポートベクタマシン(Support Vector Machine)に基づくものであってもよいし、ルールベースに基づくものであってもよい（例えばＣ４．５）。 The identification engine unit 12 generates a learning database using the particle missing sentence information as positive example data and the reference sentence information as negative example data. The identification engine unit 12 may be based on a support vector machine or a rule base (for example, C4.5).

サポートベクタマシンの識別エンジンによれば、明確なルールを生成せず、外見上ブラックボックスである。正例データ及び負例データは、サポートベクトルとして生成される。「サポートベクタマシン」とは、教師有り学習を用いる識別アルゴリズムであって、パターン認識に適用される。サポートベクタマシンは、線形入力素子を用いて２クラス（正例／負例）のパターン識別器を構成し、線形入力素子のパラメータを学習する。 According to the identification engine of the support vector machine, it does not generate a clear rule and looks black box. Positive example data and negative example data are generated as support vectors. The “support vector machine” is an identification algorithm using supervised learning and is applied to pattern recognition. The support vector machine configures two classes (positive examples / negative examples) of pattern discriminators using linear input elements, and learns parameters of the linear input elements.

また、ルールベースの識別エンジンによれば、正例データ及び負例データから明確なルールを生成する。「Ｃ４．５」とは、クラス分類に用いるための決定木を生成するアルゴリズムであって、統計学的クラス分類器である。これは、情報エントロピの概念を用いて、正例データ及び負例データのセットから決定木を生成する。 Further, according to the rule-based identification engine, a clear rule is generated from positive example data and negative example data. “C4.5” is an algorithm for generating a decision tree for use in class classification, and is a statistical class classifier. This uses a concept of information entropy to generate a decision tree from a set of positive example data and negative example data.

識別エンジン部１２は、学習データベースを生成した後、助詞落ち表現を含む対象文章情報を入力する。「対象文章情報」は、私用的に公開されており、不特定多数のユーザによって記述された文章情報である。例えば、インターネット上の電子掲示板やブログなどに記述された文章情報であって、口語表現などが多く含まれるものである。 After generating the learning database, the identification engine unit 12 inputs target sentence information including a particle dropping expression. “Target sentence information” is privately disclosed and is sentence information described by an unspecified number of users. For example, it is text information described on an electronic bulletin board or blog on the Internet, and includes many colloquial expressions.

識別エンジン部１２は、学習データベースを用いて、入力された対象文章情報から、助詞落ち箇所を特定する。ここで、当該助詞落ち表現に対応する１つ以上の助詞有り表現の候補が抽出される。助詞有り表現は、１つの候補に限られず、複数の候補が抽出されてもよい。 The identification engine unit 12 specifies a particle dropping point from the input target sentence information using the learning database. Here, one or more candidate expressions with particles corresponding to the particle dropping expression are extracted. The expression with particles is not limited to one candidate, and a plurality of candidates may be extracted.

また、図１の助詞落ち補完プログラムにおける「助詞落ち検出機能」は、品詞抽出部１５を更に有するものであってもよい。識別エンジン部１２は、文章情報（単語列）そのものを学習し且つ識別することなく、その品詞列のみを学習し且つ識別することによって、学習効果を高めることができる。具体的には、記憶容量が少なく且つ演算量も少なくなることが期待できる。 Further, the “participant drop detection function” in the particle dropout supplement program of FIG. 1 may further include a part of speech extraction unit 15. The identification engine unit 12 can enhance the learning effect by learning and identifying only the part of speech string without learning and identifying the sentence information (word string) itself. Specifically, it can be expected that the storage capacity is small and the calculation amount is also small.

図２は、品詞抽出部における品詞体系を表す説明図である。 FIG. 2 is an explanatory diagram showing a part of speech system in the part of speech extraction unit.

品詞抽出部１５は、文章情報を形態素に分割し、形態素毎に品詞体系を対応付ける。即ち、文章情報の形態素列を、品詞列に変換する。「品詞列」は、複数の品詞の列からなる。「品詞」は、図２のように、品詞自体と、１つ以上の品詞細分類とによって表される。 The part-of-speech extraction unit 15 divides sentence information into morphemes and associates a part-of-speech system with each morpheme. That is, the morpheme string of sentence information is converted into a part of speech string. The “part of speech string” is composed of a plurality of part of speech strings. As shown in FIG. 2, “part of speech” is represented by the part of speech itself and one or more part of speech subcategories.

図１によれば、助詞落ち文章生成部１１から出力された正例データの助詞落ち文章情報は、品詞抽出部１５によって助詞落ち品詞列に変換される。そして、その助詞落ち品詞列は、正例データとして識別エンジン部１２へ入力される。また、基準文章蓄積部１０から出力された負例データの基準文章情報も、品詞抽出部１５によって助詞有り品詞列に変換される。そして、その助詞有り品詞列は、負例データとして識別エンジン部１２へ入力される。これによって、識別エンジン部１２は、品詞列に基づく学習データベースを生成することができる。 According to FIG. 1, the particle missing sentence information of the positive example data output from the particle missing sentence generation unit 11 is converted into a particle missing part of speech string by the part of speech extraction unit 15. The particle part-of-speech string is input to the identification engine unit 12 as positive example data. In addition, the reference sentence information of the negative example data output from the reference sentence storage unit 10 is also converted into a part-of-speech string with a particle by the part-of-speech extraction unit 15. Then, the part-of-speech string with the particle is input to the identification engine unit 12 as negative example data. Thereby, the identification engine unit 12 can generate a learning database based on the part of speech string.

同様に、品詞抽出部１５は、識別エンジン部１２へ入力すべき対象文章情報も、品詞列に変換する。品詞列に基づく対象文章情報が、識別エンジン部１２へ入力される。これによって、識別エンジン部１２は、品詞列に基づく学習データベースを用いて、候補となる助詞有り品詞列を抽出することができる。 Similarly, the part-of-speech extraction unit 15 also converts target sentence information to be input to the identification engine unit 12 into a part-of-speech string. Target sentence information based on the part-of-speech string is input to the identification engine unit 12. As a result, the identification engine unit 12 can extract a candidate part-of-speech string with a candidate using a learning database based on the part-of-speech string.

図１によれば、「助詞補完機能」として、出現頻度計数部１３及び助詞落ち補完部１４を更に有する。 According to FIG. 1, the “particle auxiliary function” further includes an appearance frequency counting unit 13 and a particle dropping complementing unit 14.

出現頻度計数部１３は、候補となる助詞有り表現毎に、基準文章蓄積部１０に蓄積された基準文章情報の中における出現頻度を計数する。また、識別エンジンが品詞列に基づくものである場合、候補となる助詞有り品詞列毎に、基準文章蓄積部１０に蓄積された基準文章情報の中における出現頻度を計数する。 The appearance frequency counting unit 13 counts the appearance frequency in the reference sentence information stored in the reference sentence storage unit 10 for each candidate particle expression. When the identification engine is based on a part of speech string, the appearance frequency in the reference sentence information stored in the reference sentence storage unit 10 is counted for each candidate part-of-speech part with a particle.

助詞落ち補完部１４は、出現頻度が最も高い助詞有り表現における助詞を、対象文章情報に対して補完する。また、識別エンジンが品詞列に基づくものである場合、出現頻度が最も高い助詞有り品詞列における助詞を、対象文章情報に対して補完する。 The particle omission complementing unit 14 complements the target sentence information with the particle in the expression with the particle having the highest appearance frequency. Further, when the identification engine is based on a part-of-speech string, the particle in the part-of-speech part-with-participant string having the highest appearance frequency is supplemented to the target sentence information.

図３は、文章情報に基づく助詞落ち文章の補完を表す説明図である。 FIG. 3 is an explanatory diagram showing complementation of particle dropping sentences based on sentence information.

（Ｓ３１）基準文章蓄積部１０に、以下の２つの基準文章情報が蓄積されているとする。
「ラーメンを食べていた」
「足が速くて追いつかない」
これら基準文章情報は、正例データとして助詞落ち文章生成部１１へ出力され、負例データとして識別エンジン部１２へ出力される。 (S31) It is assumed that the following two pieces of reference text information are stored in the reference text storage unit 10.
"I was eating ramen"
"I can't catch up because my feet are fast"
The reference sentence information is output as positive example data to the particle dropping sentence generation unit 11 and is output as negative example data to the identification engine unit 12.

（Ｓ３２）助詞落ち文章生成部１１は、基準文章情報から助詞を削除した助詞落ち文章情報を生成する。
「ラーメン（を）食べていた」 ->「ラーメン食べていた」
「足（が）速くて追いつかない」->「足速くて追いつかない」
生成された助詞落ち文章情報は、正例データとして識別エンジン部１２へ出力される。 (S32) The particle missing sentence generation unit 11 generates particle missing sentence information obtained by deleting the particle from the reference sentence information.
"I was eating ramen"->"I was eating ramen"
"Foot is fast and can't catch up"->"Foot is fast and can't catch up"
The generated particle missing sentence information is output to the identification engine unit 12 as positive example data.

（Ｓ３３）識別エンジン部１２は、正例データの基準文章情報と、負例データの基準文章情報とから、学習データベースを生成する。 (S33) The identification engine unit 12 generates a learning database from the reference sentence information of the positive example data and the reference sentence information of the negative example data.

（Ｓ３４）以下の３つの対象文章情報が、識別エンジン部１２に入力されたとする。
「ラーメン食べちゃった」 (S34) It is assumed that the following three pieces of target sentence information are input to the identification engine unit 12.
"I ate ramen"

（Ｓ３５）識別エンジン部１２は、候補となる助詞有り表現を出力する。助詞有り表現は、例えば以下のようなパターンに該当するものであることが好ましい。
（パターン１）「名詞」＋「補完する助詞」＋「動詞」
（パターン２）「名詞」＋「補完する助詞」
（パターン３）「補完する助詞」＋「動詞」
例えば、「ラーメン食べちゃった」については、以下のような助詞有り表現を、出現頻度計数部１３へ出力する。
「ラーメン＋を＋食べる」
「ラーメン＋を」
「を＋食べる」
「ラーメン＋が＋食べる」
「ラーメン＋が」
「が＋食べる」
・・・・・ (S35) The identification engine unit 12 outputs a candidate particle existence expression. The expression with particles preferably corresponds to the following pattern, for example.
(Pattern 1) “Noun” + “Complementary particle” + “Verb”
(Pattern 2) “Noun” + “Complementary particle”
(Pattern 3) “Participant to complement” + “Verb”
For example, for “Ramen has been eaten”, the following expression with a particle is output to the appearance frequency counting unit 13.
"Eat ramen +"
"Ramen +"
"Eat +"
"Ramen + eat +"
"Ramen + is"
"Ga + eat"
...

（Ｓ３６）出現頻度計数部１３は、候補となる助詞有り表現について、基準文章蓄積部１０を用いて出現頻度を計数する。
「ラーメン＋を＋食べる」：１０回
「ラーメン＋を」：５０回
「を＋食べる」：１００回
「ラーメン＋が＋食べる」：０回
「ラーメン＋が」：３回
「が＋食べる」：５回
・・・・・
尚、前述したパターンに応じて、それぞれの出現頻度に重み付けることも好ましい。例えば、パターン１は、パターン２及び３よりも大きく重み付ける。 (S36) The appearance frequency counting unit 13 uses the reference sentence storage unit 10 to count the appearance frequency of the candidate particles with expressions.
“Eat Ramen +”: 10 times “Ramen + Eat”: 50 times “Eat + Eat”: 100 times “Ramen + Eat +”: 0 times “Ramen + Eat”: 3 times “Eat + Eat”: 5 times
In addition, it is also preferable to weight each appearance frequency according to the pattern mentioned above. For example, pattern 1 is weighted more than patterns 2 and 3.

（Ｓ３７）助詞落ち補完部１４は、出現頻度が最も高い助詞有り表現における助詞を、対象文章情報に対して補完する。例えば以下のように補完される。
「ラーメン食べちゃった」->「ラーメンを食べちゃった」 (S37) The particle dropping complement complementing unit 14 complements the target sentence information with the particle in the expression with the particle having the highest appearance frequency. For example, it is complemented as follows.
"I ate ramen"->"I ate ramen"

図４は、品詞列に基づく助詞落ち文章の補完を表す説明図である。 FIG. 4 is an explanatory diagram showing complementation of particle dropping sentences based on the part of speech string.

（Ｓ４１）前述の図３のＳ３１と同様の基準文章情報が、正例データとして助詞落ち生成部１１へ出力され、負例データとして品詞抽出部１５へ出力される。 (S41) The reference text information similar to S31 in FIG. 3 is output as positive example data to the particle dropping unit 11 and as negative example data to the part of speech extraction unit 15.

（Ｓ４２）前述の図３のＳ３２と同様に、助詞落ち文章生成部１１は、基準文章情報から助詞を削除した助詞落ち文章情報を生成する。生成された助詞落ち文章情報は、正例データとして品詞抽出部１５へ出力される。 (S42) As in S32 of FIG. 3 described above, the particle missing sentence generation unit 11 generates particle missing sentence information in which the particle is deleted from the reference sentence information. The generated particle missing sentence information is output to the part of speech extraction unit 15 as positive example data.

（Ｓ４３）品質抽出部１５は、助詞落ち文章生成部１１から入力された助詞落ち文章情報を、助詞落ち品詞列に変換する。例えば、以下のように変換される。
「ラーメン食べていた」
->「（名詞・一般）（動詞・自立）」
「足速くて追いつかない」
->「（名詞・一般）（形容詞・自立）（動詞・自立）」
そして、助詞落ち品詞列は、正例データとして識別エンジン部１２へ入力される。 (S43) The quality extraction unit 15 converts the particle missing sentence information input from the particle missing sentence generation unit 11 into a particle missing part of speech string. For example, it is converted as follows.
"I was eating ramen"
->"(Noun / general) (verb / independence)"
"I can't catch up with my feet fast"
->"(Noun / general) (adjective / independence) (verb / independence)"
Then, the particle part-of-speech part string is input to the identification engine unit 12 as positive example data.

（Ｓ４４）品質抽出部１５は、基準文章蓄積部１０から入力された基準文章情報を、助詞有り品詞列に変換する。例えば、以下のように変換される。
「ラーメンを食べていた」
->「（名詞・一般）＋を＋（動詞・自立）」
「足が速くて追いつかない」
->「（名詞・一般）＋が＋（形容詞・自立）（動詞・自立）」
助詞有り品詞列は、負例データとして識別エンジン部１２へ入力される。 (S44) The quality extraction unit 15 converts the reference sentence information input from the reference sentence storage unit 10 into a part-of-speech string with a particle. For example, it is converted as follows.
"I was eating ramen"
->"(Noun / general) + + (verb / independence)"
"I can't catch up because my feet are fast"
-> “(Noun / general) + ga + (adjective / independence) (verb / independence)”
The part-of-speech string with particles is input to the identification engine unit 12 as negative example data.

（Ｓ４５）識別エンジン部１２は、正例データの助詞落ち品詞列と、負例データの助詞有り品詞列とから、学習データベースを生成する。 (S45) The identification engine unit 12 generates a learning database from the particle part-of-speech sequence of positive example data and the part-of-speech sequence with particle of negative example data.

（Ｓ４６）以下の３つの対象文章情報が、識別エンジン部１２に入力されたとする。
「ラーメン食べちゃった」 (S46) It is assumed that the following three pieces of target sentence information are input to the identification engine unit 12.
"I ate ramen"

（Ｓ４７）識別エンジン部１２は、候補となる助詞有り品詞列を出力する。
例えば、「ラーメン食べちゃった」については、以下のような助詞有り品詞列を、出現頻度計数部１３へ出力する。
「（名詞・一般）＋を＋（動詞・自立）」
「（名詞・一般）＋を」
「を＋（動詞・自立）」
「（名詞・一般）＋が＋（動詞・自立）」
「（名詞・一般）＋が」
「が＋（動詞・自立）」
・・・・・ (S47) The identification engine unit 12 outputs a candidate part-of-speech string with a candidate particle.
For example, for “Ramen has been eaten”, the following part-of-speech string with particles is output to the appearance frequency counting unit 13.
“(Noun / general) +” + (verb / independence) ”
"(Noun / general) +"
"O + (verb, independence)"
“(Noun / general) + ga + (verb / independence)”
"(Noun / general) +"
"Ga + (verb, independence)"
...

（Ｓ４８）出現頻度計数部１３は、候補となる助詞有り品詞列について、基準文章蓄積部１０における出現頻度を計数する。
「（名詞・一般）（助詞・格助詞一般）（動詞・自立）」：１５００回
「（名詞・一般）（助詞・格助詞一般）」：９００回
「（助詞・格助詞一般）（動詞・自立）」：４５００回
「（名詞・一般）（助詞・係助詞）（動詞・自立）」：１０回
「（名詞・一般）（助詞・係助詞）」：２００回
「（助詞・係助詞）（動詞・自立）」：３５０回
・・・・・
尚、基準文章蓄積部１０が、基準文章情報に基づく品詞列を予め蓄積しているものであってもよいし、品詞抽出部１５が負例データとして出力した助詞有り品詞列を予め蓄積しているものであってもよい。 (S48) The appearance frequency counting unit 13 counts the appearance frequency in the reference sentence storage unit 10 for candidate particle part-of-speech strings with candidates.
“(Noun / general) (particle / case particle in general) (verb / independence)” 1500 times “(noun / general) (particle / case particle in general)”: 900 times “(particle / case particle in general) (verb / "Independence)": 4500 times "(Noun / general) (particles / corresponding particles) (verb / independence)": 10 times "(Noun / general) (particles / corresponding particles)": 200 times (Verb / independence) ”: 350 times
In addition, the reference sentence storage unit 10 may store the part of speech sequence based on the reference sentence information in advance, or the particle part of speech with a particle output by the part of speech extraction unit 15 as negative example data may be stored in advance. It may be.

（Ｓ４９）助詞落ち補完部１４は、出現頻度が最も高い助詞有り品詞列における助詞を、対象文章情報に対して補完する。例えば以下のように補完される。
「ラーメン食べちゃった」->「ラーメンを食べちゃった」 (S49) The particle dropping complementing unit 14 complements the target sentence information with the particles in the particle part-of-speech sequence having the highest appearance frequency. For example, it is complemented as follows.
"I ate ramen"->"I ate ramen"

図５は、本発明における文章解析サーバのシステム構成図である。 FIG. 5 is a system configuration diagram of the sentence analysis server according to the present invention.

図５によれば、文章解析サーバ２は、通信インタフェース部と、助詞落ち補完機能部と、文章内容解析部とを有する。文章解析サーバ２は、通信インタフェース部を介してインターネットに接続する。 According to FIG. 5, the sentence analysis server 2 includes a communication interface unit, a particle dropping complement function unit, and a sentence content analysis unit. The sentence analysis server 2 is connected to the Internet via a communication interface unit.

また、図５によれば、文章解析サーバ２は、インターネットを介して、Ｗｅｂサーバ３と通信することができる。また、Ｗｅｂサーバ３は、投稿者用端末４から接続される。 Moreover, according to FIG. 5, the text analysis server 2 can communicate with the Web server 3 via the Internet. The Web server 3 is connected from the poster terminal 4 for contributors.

Ｗｅｂサーバ３は、投稿者用端末４から受信した、解析対象文章であるブログテキスト及びクチコミコメントのようなＷｅｂ文書を公開する。文章解析サーバ２は、インターネットを介して、Ｗｅｂサーバ３から、そのＷｅｂ文書を解析対象文章として取得する。 The Web server 3 publishes Web documents such as blog text and word-of-mouth comments, which are analysis target sentences, received from the poster terminal 4. The sentence analysis server 2 acquires the Web document as an analysis target sentence from the Web server 3 via the Internet.

助詞落ち補完機能部は、通信インタフェース部を介して、対象文章情報を受信する。その対象文章情報に対して助詞落ちを補完する。助詞落ちが補完された対象文章情報は、文章内容解析部へ出力される。文章内容解析部は、様々な観点から文章内容を解析し、対象文章情報を特定カテゴリに分類することもできる。 The particle omission complementation function unit receives the target sentence information via the communication interface unit. Complement the particle omission for the target sentence information. The target sentence information supplemented with the particle omission is output to the sentence content analysis unit. The text content analysis unit can analyze text content from various viewpoints and classify target text information into a specific category.

図６は、本発明におけるシステムのシーケンス図である。 FIG. 6 is a sequence diagram of the system according to the present invention.

（Ｓ６０１）基準文章情報から、助詞を削除することによって助詞落ち文章情報を生成する。
（Ｓ６０２）識別エンジンが、品詞列の学習データベースを生成する場合、正例データの助詞落ち文章情報を助詞落ち品詞列に変換し、負例データの基準文章情報を助詞有り品詞列に変換する。
（Ｓ６０３）識別エンジンは、助詞落ち文章情報を正例データとし、基準文章情報を負例データとして、学習データベースを生成する。 (S601) The particle missing sentence information is generated by deleting the particle from the reference sentence information.
(S602) When the learning engine generates a part-of-speech string learning database, it converts the particle missing sentence information of the positive example data into a particle missing part-of-speech string and converts the reference sentence information of negative example data into a part-of-speech string with a particle.
(S603) The identification engine generates a learning database using the missing particle information as positive example data and the reference sentence information as negative example data.

（Ｓ６１１）投稿者用端末４は、対象文章情報であるブログテキストをＷｅｂサーバ３へ投稿する。対象文章情報は、助詞落ち表現を含むとする。
（Ｓ６１２）文章解析サーバ２は、Ｗｅｂサーバ３から対象文章情報（「ラーメン食べた」）を受信する。 (S611) The terminal 4 for contributors posts the blog text that is the target sentence information to the Web server 3. It is assumed that the target sentence information includes a particle dropping expression.
(S612) The sentence analysis server 2 receives the target sentence information (“I ate ramen”) from the Web server 3.

（Ｓ６１３）識別エンジンが、品詞列の学習データベースを生成している場合、対象文章情報を品詞列に変換する。
（Ｓ６１４）識別エンジンが、対象文章情報の助詞落ち表現を特定し、その候補となる助詞有り表現を出力する。
（Ｓ６１５）候補となる助詞有り表現について、基準文章蓄積部を用いて出現頻度を計数する。
（Ｓ６１６）出現頻度が最も高い助詞有り表現における助詞を、対象文章情報に対して補完する。
（Ｓ６１７）助詞落ち表現が補完された対象文章情報に基づいて、文章内容の解析処理が実行される。 (S613) When the identification engine has generated a part-of-speech string learning database, the target sentence information is converted into a part-of-speech string.
(S614) The identification engine specifies a particle dropping expression of the target sentence information, and outputs a candidate particle-with expression.
(S615) The frequency of appearance is counted by using the reference sentence accumulating unit for the candidate particles with expressions.
(S616) The particle in the expression with the particle having the highest appearance frequency is supplemented with respect to the target sentence information.
(S617) The sentence content analysis process is executed based on the target sentence information supplemented with the particle missing expression.

以上、詳細に説明したように、本発明の助詞落ち補完プログラム、装置、サーバ及び方法によれば、対象文章情報について助詞落ちの有無を検出する共に、落ちた助詞を補完することによって、対象文章情報の解析精度を向上させることができる。特に、本発明によれば、既存の新聞文書のみを対象文章情報として用いることできるので、解析精度が向上し、且つ、汎用性が高いという効果を有する。 As described above in detail, according to the particle omission complementation program, the apparatus, the server, and the method of the present invention, the target sentence is detected by detecting the presence or absence of the particle omission in the object sentence information and complementing the dropped particle. Information analysis accuracy can be improved. In particular, according to the present invention, since only existing newspaper documents can be used as target sentence information, the analysis accuracy is improved and the versatility is high.

本発明では、一般ユーザによって記述された文章に頻繁に見られる口語的な表現（助詞落ち表現）に対して、係り受け解析精度を低下させる要因である助詞落ちを発見し且つ補完することができる。これによって、口語的な文章を、自然で読みやすい文章に訂正する。 In the present invention, it is possible to find and complement a particle dropping, which is a factor that lowers dependency analysis accuracy, with respect to a colloquial expression (particle dropping expression) frequently seen in sentences written by general users. . This corrects colloquial sentences into natural and easy-to-read sentences.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 Various changes, modifications, and omissions of the above-described various embodiments of the present invention can be easily made by those skilled in the art. The above description is merely an example, and is not intended to be restrictive. The invention is limited only as defined in the following claims and the equivalents thereto.

１助詞落ち補完プログラム
１０基準文章蓄積部
１１助詞落ち文章生成部
１２識別エンジン部
１３出現頻度計数部
１４助詞落ち補完部
１５品詞抽出部
２文章解析サーバ
３Ｗｅｂサーバ
４投稿用端末 DESCRIPTION OF SYMBOLS 1 Particle removal complement 10 Reference sentence storage part 11 Particle drop sentence generation part 12 Identification engine part 13 Appearance frequency counting part 14 Particle drop completion part 15 Part of speech extraction part 2 Text analysis server 3 Web server 4 Posting terminal

Claims

A particle complement program for causing a computer to function to complement a particle for target sentence information including a particle dropping expression,
A reference sentence storage means for storing reference sentence information that does not include a particle omission expression;
From the reference sentence information, a particle missing sentence generating means for generating particle missing sentence information by deleting a particle,
The particle sentence sentence information is used as positive example data, the reference sentence information is used as negative example data, and a two-class pattern classifier is formed. An identification engine means for identifying whether or not the expression is present, and extracting a plurality of candidate particles with corresponding particles corresponding to the particle dropping expression ;
Appearance frequency counting means for counting the appearance frequency in the reference sentence information stored in the reference sentence storage means for each candidate expression with particles,
A particle complement program for causing a computer to function as a particle drop complement means for complementing the particle in the expression with a particle having the highest appearance frequency with respect to the target sentence information .

The identification engine unit, based on the support vector machine (Support Vector Machine), or particle complementing program according to claim 1, characterized in that those based on rule-based, the computer to further function as a.

The reference sentence information is publicly available and is sentence information described by a specific user who is trusted,
3. The particle complementing according to claim 1, wherein the target sentence information is publicly disclosed and further functions as a sentence information written by an unspecified number of users. program.

A sentence analysis device that complements the particle with respect to target sentence information including a particle omission expression,
A reference sentence storage means for storing reference sentence information that does not include a particle omission expression;
From the reference sentence information, a particle missing sentence generating means for generating particle missing sentence information by deleting a particle,
The particle sentence sentence information is used as positive example data, the reference sentence information is used as negative example data, and a two-class pattern classifier is formed. An identification engine means for identifying whether or not the expression is present, and extracting a plurality of candidate particles with corresponding particles corresponding to the particle dropping expression ;
Appearance frequency counting means for counting the appearance frequency in the reference sentence information stored in the reference sentence storage means for each candidate expression with particles,
A sentence analysis apparatus comprising: a particle omission complementing means for complementing a particle in an expression with a particle having the highest appearance frequency with respect to the target sentence information .

A sentence analysis server that obtains target sentence information including a particle omission expression from another public server via a network, and complements the particle;
A reference sentence storage means for storing reference sentence information that does not include a particle omission expression;
From the reference sentence information, a particle missing sentence generating means for generating particle missing sentence information by deleting a particle,
The particle sentence sentence information is used as positive example data, the reference sentence information is used as negative example data, and a two-class pattern classifier is formed. An identification engine means for identifying whether or not the expression is present, and extracting a plurality of candidate particles with corresponding particles corresponding to the particle dropping expression ;
Appearance frequency counting means for counting the appearance frequency in the reference sentence information stored in the reference sentence storage means for each candidate expression with particles,
A sentence analysis server , comprising: a particle omission complementing means for complementing a particle in an expression with a particle having the highest appearance frequency with respect to the target sentence information .

Using a device equipped with a computer, for a target sentence information including a particle dropping expression, a particle dropping completion method for complementing the particle,
It has a reference sentence storage unit that stores reference sentence information that does not include particle removal expressions,
A first step of generating particle missing sentence information by deleting a particle from the reference sentence information;
A second step of configuring a two-class pattern discriminator with the particle missing sentence information as positive example data and the reference sentence information as negative example data ;
A third step of extracting a plurality of candidate particles with a particle corresponding to the particle dropping expression when the input target text information is identified as a particle dropping expression using the pattern classifier; ,
A fourth step of counting the frequency of appearance in the reference sentence information stored in the reference sentence storage unit for each candidate particle expression that is a candidate;
A particle omission complementing method comprising: a fifth step of complementing a particle in an expression with a particle having the highest appearance frequency with respect to the target sentence information .