JP2003202893A

JP2003202893A - Corpus processor for generating statistical language model, and method and program thereof

Info

Publication number: JP2003202893A
Application number: JP2001401616A
Authority: JP
Inventors: Hisayoshi Nagae; 尚義永江
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2001-12-28
Filing date: 2001-12-28
Publication date: 2003-07-18
Anticipated expiration: 2021-12-28
Also published as: JP3725470B2

Abstract

<P>PROBLEM TO BE SOLVED: To prevent secret information included in a database or existent corpus from leaking out. <P>SOLUTION: A text analysis part 102 outputs a morpheme analytic result of a text. A storage part 104 stores a mask rule in which information for masking a person's name is described. A secret information mask part 103 masks the morpheme analytic result according to the mask rule. Consequently, no secret information is included in the morpheme analytic result from the secret information mask part 103. A corpus totaling part 105 totals morpheme analytic results to generate a corpus excluding the secret information. According to the corpus, a statistical language model is generated. Accordingly, no secret information leaks out in speech recognition processing using the statistical language model. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識処理に際
して秘匿すべき個人情報等の流出を防止するようにした
統計的言語モデルを作成するためのコーパス処理装置及
び方法並びにプログラムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a corpus processing device, method and program for creating a statistical language model for preventing leakage of personal information or the like to be kept secret during voice recognition processing.

【０００２】[0002]

【従来の技術】近年、音声で日本語の文章を入力するこ
とができる日本語ディクテーションシステムが実用化さ
れてきた。このシステムでは音響的な正しさだけでな
く、統計的言語モデルと呼ばれる単語組の出現確率の情
報を使用することで高い認識性能を実現している。ここ
で、統計的言語モデルを構成する単語組の出現確率は、
コーパスと呼ばれる大量のテキストに対する統計的な手
法を使用した統計情報の集計によって得られる。2. Description of the Related Art In recent years, a Japanese dictation system capable of inputting Japanese sentences by voice has been put into practical use. This system realizes high recognition performance by using not only acoustic correctness, but also information on the appearance probability of word pairs called statistical language model. Here, the occurrence probabilities of the word pairs that make up the statistical language model are
It is obtained by aggregating statistical information using a statistical method for a large amount of text called a corpus.

【０００３】従って、統計的言語モデルにおける単語組
の出現確率は、言語モデルを作成するために使用したコ
ーパス（テキスト）中の文章の内容に強く依存すること
になる。即ち、ディクテーションシステムでは、統計的
言語モデル作成に使用したテキスト中の文章に出現する
表現については認識しやすく、逆に、出現しない表現に
ついては認識しづらい。このため、利用者が統計的言語
モデル作成に使用したテキスト中に出現しない文章を音
声入力した場合には、テキスト中に出現する単語組に音
響的に近い表記列に誤認識してしまうことも少なくな
い。Therefore, the probability of occurrence of a word set in a statistical language model strongly depends on the content of sentences in the corpus (text) used to create the language model. That is, in the dictation system, it is easy to recognize expressions that appear in sentences in the text used to create the statistical language model, and conversely, it is difficult to recognize expressions that do not appear. Therefore, when the user inputs a sentence that does not appear in the text used for statistical language model creation by voice, it may be erroneously recognized as a notation string acoustically similar to the word set that appears in the text. Not a few.

【０００４】なお、これらの技術については、下記文献
に詳述されている。These techniques are described in detail in the following documents.

【０００５】「確率モデルによる音声認識」中川聖一
著電子情報通信学会 {ISBN4-88552-072-X} 「音声言語処理」北研二著森北出版 {ISBN4-627-8
2380-0} 「確率的言語モデル」北研二著東京大学出版会 {I
SBN4-13-065404-7}"Speech recognition by probabilistic model" Seiichi Nakagawa The Institute of Electronics, Information and Communication Engineers {ISBN4-88552-072-X} "Spoken Language Processing" Kenji Kita Morikita Publishing {ISBN4-627-8
2380-0} "Probabilistic language model" Kenji Kita The University of Tokyo Press {I
SBN4-13-065404-7}

【０００６】[0006]

【発明が解決しようとする課題】このように、コーパス
（テキスト）中に出現した全ての表現は、統計的言語モ
デルの単語組の出現確率に直接反映される。従って、デ
ィクテーションシステムではコーパス（テキスト）中に
出現した表現が認識結果として出現しやすい。As described above, all the expressions appearing in the corpus (text) are directly reflected in the appearance probability of the word set of the statistical language model. Therefore, in the dictation system, the expression that appears in the corpus (text) is likely to appear as a recognition result.

【０００７】このため、統計的言語モデル作成に使用し
たコーパス（テキスト）中に、プライベートな個人情報
等が含まれている場合には、個人情報の内容が誤認識結
果として出現してしまう虞があり、個人情報が他人に露
見してしまう可能性がある。For this reason, when the corpus (text) used to create the statistical language model contains private personal information or the like, the content of the personal information may appear as an erroneous recognition result. Yes, personal information may be exposed to others.

【０００８】例えば、医療関係の音声認識処理を行う場
合には、医療関係のコーパスを用いた方が認識精度を向
上させることができる。ところが、例えばカルテ等のよ
うに、医療関係のコーパス中には個人情報が含まれるこ
とが多く、音声認識処理の結果によって個人情報が露見
してしまうことがあるという問題があった。For example, in the case of performing medical-related voice recognition processing, the recognition accuracy can be improved by using a medical-related corpus. However, there is a problem that personal information is often included in a medical corpus such as a medical record, and the personal information may be exposed depending on the result of the voice recognition process.

【０００９】本発明は、統計的言語モデル用のテキスト
を収集する過程又は収集後のテキストから統計的言語モ
デルを作成する過程において、統計情報集計に使用する
テキスト中の秘匿情報に関する表記情報を除去すること
により、プライベートな個人情報が誤って認識結果中に
出力されることを防止することができる統計的言語モデ
ルを作成するためのコーパス処理装置及び方法並びにプ
ログラムを提供することを目的とする。According to the present invention, in the process of collecting texts for a statistical language model or in the process of creating a statistical language model from the collected texts, the notation information relating to the confidential information in the texts used for statistical information aggregation is removed. By doing so, it is an object of the present invention to provide a corpus processing device, method, and program for creating a statistical language model that can prevent private personal information from being erroneously output in a recognition result.

【００１０】[0010]

【課題を解決するための手段】本発明の請求項１に係る
統計的言語モデルを作成するためのコーパス処理装置
は、テキストデータを形態素解析して形態素解析結果を
出力するテキスト解析部と、秘匿情報に含まれる単語を
マスクするための所定のマスクルールに従って、前記形
態素解析結果をマスクする秘匿情報マスク部と、前記秘
匿情報マスク部によってマスクされた形態素解析結果を
コーパスとして集積するコーパス集積部と、前記コーパ
ス集積部によって集積されたコーパスから統計情報を収
集するコーパス統計集計部とを具備したものであり、本
発明の請求項２に係る統計的言語モデルを作成するため
のコーパス処理装置は、テキストデータを形態素解析し
て形態素解析結果を出力するテキスト解析部と、秘匿情
報に含まれる単語を他の単語に置換するための所定の置
換ルールに従って、前記形態素解析結果を置換する秘匿
情報置換部と、前記秘匿情報置換部によって置換された
形態素解析結果をコーパスとして集積するコーパス集積
部と、前記コーパス集積部によって集積されたコーパス
から統計情報を収集するコーパス統計集計部とを具備し
たものである。A corpus processing device for creating a statistical language model according to claim 1 of the present invention includes a text analysis unit for morphologically analyzing text data and outputting a morphological analysis result, and a secret According to a predetermined mask rule for masking words included in information, a secret information mask unit that masks the morphological analysis result, and a corpus accumulation unit that accumulates the morphological analysis result masked by the secret information mask unit as a corpus. And a corpus statistical aggregation unit that collects statistical information from the corpus accumulated by the corpus accumulating unit, and the corpus processing device for creating the statistical language model according to claim 2 of the present invention comprises: A text analysis unit that morphologically analyzes text data and outputs a morphological analysis result, and a word included in confidential information According to a predetermined replacement rule for replacing with a word, a secret information replacement unit that replaces the morpheme analysis result, a corpus accumulation unit that accumulates the morpheme analysis result replaced by the secret information substitution unit as a corpus, and the corpus. And a corpus statistical aggregation unit that collects statistical information from the corpus accumulated by the accumulation unit.

【００１１】本発明の請求項１において、テキスト解析
部による形態素解析結果は秘匿情報マスク部に与えられ
る。秘匿情報マスク部は、秘匿情報に含まれる単語をマ
スクするための所定のマスクルールに従って、前記形態
素解析結果をマスクする。コーパス集積部は、秘匿情報
に含まれる単語がマスクされた形態素解析結果をコーパ
スとして集積する。コーパス統計集計部は、集積された
コーパスに基づいて、統計情報を収集する。統計情報は
秘匿情報がマスクされており、統計情報を用いた音声認
識処理において秘匿情報が漏出することが防止される。In claim 1 of the present invention, the result of morphological analysis by the text analysis unit is given to the secret information mask unit. The secret information masking unit masks the morphological analysis result according to a predetermined mask rule for masking the words included in the secret information. The corpus accumulating unit accumulates the morphological analysis result in which the words included in the secret information are masked as a corpus. The corpus statistics aggregating unit collects statistical information based on the accumulated corpus. The confidential information is masked in the statistical information, and the confidential information is prevented from leaking in the voice recognition processing using the statistical information.

【００１２】本発明の請求項２において、テキスト解析
部による形態素解析結果は秘匿情報置換部に与えられ
る。秘匿情報置換部は、秘匿情報に含まれる単語を他の
単語に置換するための所定の置換ルールに従って、前記
形態素解析結果を置換する。コーパス集積部は、秘匿情
報に含まれる単語が置換された形態素解析結果をコーパ
スとして集積する。コーパス統計集計部は、集積された
コーパスに基づいて、統計情報を収集する。統計情報は
秘匿情報が他の単語に置換されており、統計情報を用い
た音声認識処理において秘匿情報が漏出することが防止
される。In claim 2 of the present invention, the result of morphological analysis by the text analysis unit is given to the secret information replacement unit. The secret information replacement unit replaces the morphological analysis result according to a predetermined replacement rule for replacing a word included in the secret information with another word. The corpus accumulating unit accumulates the morphological analysis result in which the words included in the secret information are replaced as a corpus. The corpus statistics aggregating unit collects statistical information based on the accumulated corpus. In the statistical information, the confidential information is replaced with another word, and the confidential information is prevented from leaking in the voice recognition process using the statistical information.

【００１３】なお、装置に係る本発明は方法に係る発明
としても成立する。It should be noted that the present invention relating to the apparatus also holds true as an invention relating to the method.

【００１４】また、装置に係る本発明は、コンピュータ
に当該発明に相当する処理を実行させるためのプログラ
ムとしても成立する。The present invention relating to the apparatus is also realized as a program for causing a computer to execute the processing corresponding to the present invention.

【００１５】[0015]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態について詳細に説明する。図１は本発明の第１
の実施の形態に係る統計的言語モデルを作成するための
コーパス処理装置を示すブロック図である。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 shows the first of the present invention.
2 is a block diagram showing a corpus processing device for creating a statistical language model according to the embodiment of FIG.

【００１６】本実施の形態は、統計的言語モデル用のテ
キストを収集する過程で、統計情報集計に使用するテキ
スト中の個人情報等の秘匿すべき情報（以下、秘匿情報
という）に関する表記情報を事前に除去することによ
り、プライベートな個人情報等が誤ってディクテーショ
ンシステム等の認識結果中に出力されることを防止する
ものである。In this embodiment, in the process of collecting texts for a statistical language model, notation information about confidential information (hereinafter referred to as confidential information) such as personal information in the texts used for statistical information aggregation is displayed. By removing in advance, private personal information or the like is prevented from being erroneously output in the recognition result of the dictation system or the like.

【００１７】データベース１００には、各種データが蓄
積されている。例えば、病院等の医療関係では、データ
ベース１００に蓄積される情報としては、カルテやレン
トゲンの撮影結果等の情報が考えられる。データベース
１００内の情報中には、統計的言語モデルを作成するた
めに必要なテキストを含む文章データも含まれる。そし
て、文章データ中には本来秘匿すべき情報、例えば個人
情報に関する表記情報が記述されていることがある。Various kinds of data are stored in the database 100. For example, in the medical field such as a hospital, as the information accumulated in the database 100, information such as a medical record and a radiographic result can be considered. The information in the database 100 also includes sentence data including text necessary for creating the statistical language model. In the text data, information that should be kept secret, for example, notation information about personal information may be described.

【００１８】テキスト抽出部１０１は、データが蓄積さ
れているデータベース１００からコーパスとして利用可
能なテキスト情報を抽出してテキスト解析部１０２に供
給するようになっている。The text extraction unit 101 extracts text information that can be used as a corpus from the database 100 in which data is stored and supplies it to the text analysis unit 102.

【００１９】テキスト解析部１０２は、テキスト抽出部
１０１によって抽出されたテキストを形態素解析した結
果を出力する。例えば、テキスト解析部１０２は、テキ
スト中の文章を語に分割し、分割した語（見出し）とそ
の品詞との組で構成される形態素解析結果を得る。な
お、ここで使用する形態素解析の手法は、テキストを見
出しと品詞の組で構成される形態素に分割する公知の手
法のいずれを使用してもよい。The text analysis unit 102 outputs the result of morphological analysis of the text extracted by the text extraction unit 101. For example, the text analysis unit 102 divides a sentence in a text into words, and obtains a morphological analysis result composed of pairs of the divided words (headings) and their parts of speech. The morphological analysis method used here may be any known method for dividing the text into morphemes composed of pairs of headings and parts of speech.

【００２０】テキスト解析部１０２の解析結果は秘匿情
報マスク部１０３に供給される。秘匿情報マスク部１０
３は記憶部１０４に記憶されているマスクルールを読出
し、読出したマスクルールに従って、テキスト解析部１
０２からの形態素解析結果をマスクするようになってい
る。マスクルールは、個人情報等の秘匿情報をコーパス
作成時に削除するための情報を記述したものである。The analysis result of the text analysis unit 102 is supplied to the secret information mask unit 103. Secret information mask unit 10
3 reads the mask rule stored in the storage unit 104, and according to the read mask rule, the text analysis unit 1
The result of morphological analysis from No. 02 is masked. The mask rule describes information for deleting confidential information such as personal information when creating a corpus.

【００２１】図２は記憶部１０４に記憶されているマス
クルールを説明するための説明図である。FIG. 2 is an explanatory diagram for explaining the mask rules stored in the storage unit 104.

【００２２】図２の例では、マスクルールは品詞と見出
しとの組の情報として蓄積されている。図２は２つのレ
コードが蓄積された例を示している。図２中の＊（アス
タリスク）は全ての見出しについての制約条件がないこ
と、即ち、その品詞中に含まれる全ての単語をマスクす
ることを示している。図２の例は品詞が人名：姓氏の全
ての見出し（単語）をマスクすると共に、品詞が人名：
名前の全ての見出し（単語）をマスクする処理を示して
いる。In the example of FIG. 2, the mask rule is stored as information on a set of a part of speech and a headline. FIG. 2 shows an example in which two records are accumulated. The * (asterisk) in FIG. 2 indicates that there is no constraint condition for all headings, that is, all words included in the part of speech are masked. In the example of FIG. 2, the part of speech masks all the headlines (words) of the surname: surname and the part of speech represents the person's name:
It shows a process of masking all headings (words) of a name.

【００２３】例えば、見出しの欄にアスタリスクに代え
て「田中」を記述することにより、品詞が人名：姓氏の
うち「田中」という単語のみをマスクするようにするこ
とができる。For example, by writing "Tanaka" in the heading column instead of an asterisk, it is possible to mask only the word "Tanaka" in the part of speech of the person's name: surname.

【００２４】なお、図２の例では、記憶部１０４に形態
素の品詞情報に基づいて記述したマスクルールを記憶さ
せた例を示しているが、マスクルールを見出し又は見出
しと品詞の組み合わせによって表現することも可能であ
る。Although the example of FIG. 2 shows an example in which the storage unit 104 stores the mask rule described based on the part-of-speech information of the morpheme, the mask rule is expressed by a headline or a combination of the headline and the part-of-speech. It is also possible.

【００２５】秘匿情報マスク部１０３はマスクルールに
従ってマスクした形態素解析結果をコーパス集積部１０
５に出力する。コーパス集積部１０５は、入力されたマ
スク後の形態素解析結果を記憶部１０６に与えて、統計
的言語モデル作成用の秘匿情報排除済みコーパスとして
蓄積させる。こうして、記憶部１０６には、マスクルー
ルに従って秘匿情報が除去されたコーパスが蓄積され
る。なお、記憶部１０６に記憶させるコーパスの形式と
しては、形態素解析結果であってもよく、また、単なる
テキストの形式であってもよい。The secret information masking unit 103 masks the morphological analysis result masked in accordance with the mask rule, and outputs the result to the corpus accumulation unit 10.
Output to 5. The corpus accumulating unit 105 gives the input morphological analysis result after the masking to the storage unit 106 and accumulates it as a secret information excluded corpus for creating a statistical language model. In this way, the storage unit 106 stores the corpus from which the secret information is removed according to the mask rule. The corpus format stored in the storage unit 106 may be a morphological analysis result or a simple text format.

【００２６】コーパス統計集計部１０７は、記憶部１０
６に蓄積されたコーパスを読出して、統計的言語モデル
を作成して記憶部１０８に出力する。なお、コーパス統
計集計部１０７がコーパスから統計的言語モデル（Ｎ−
ｇｒａｍ）を作成する手法については、公知の手法のい
ずれを使用してもよい。記憶部１０８は統計的言語モデ
ルを記憶する。The corpus statistics totaling unit 107 includes a storage unit 10.
The corpus accumulated in 6 is read out to create a statistical language model and output to the storage unit 108. The corpus statistics aggregating unit 107 converts the statistical language model (N-
Any known method may be used as a method for creating a gram). The storage unit 108 stores the statistical language model.

【００２７】この記憶部１０８に記憶された統計的言語
モデルを利用することで、音声認識処理が行われる。こ
の場合には、統計的言語モデルの作成の元となるコーパ
スには秘匿情報が含まれていないことから、データベー
ス１００に蓄積されていた秘匿情報が音声認識結果に現
れる可能性は低い。By using the statistical language model stored in the storage unit 108, the voice recognition process is performed. In this case, since the secret information is not included in the corpus that is the source of the creation of the statistical language model, it is unlikely that the secret information accumulated in the database 100 appears in the speech recognition result.

【００２８】次に、このように構成された実施の形態の
作用について図３乃至図５の説明図を参照して説明す
る。Next, the operation of the embodiment thus configured will be described with reference to the explanatory views of FIGS.

【００２９】いま、データベース１００に医療関係の情
報が蓄積されているものとして説明する。テキスト抽出
部１０１はデータベース１００に蓄積されている各種情
報の中からテキストを抽出してテキスト解析部１０２に
出力する。テキスト解析部１０２は形態素解析によっ
て、テキストから見出しとその品詞との組で構成される
形態素解析結果を得る。Now, it is assumed that the database 100 stores medical information. The text extraction unit 101 extracts text from various information stored in the database 100 and outputs it to the text analysis unit 102. The text analysis unit 102 obtains a morphological analysis result composed of a set of a headline and its part of speech from the text by morphological analysis.

【００３０】いま、テキスト抽出部１０１によって抽出
されたテキストの一部が図３に示すものであるものとす
る。図３は抽出されたテキスト中に、日本一郎さんが胃
癌であること、日本次郎さんが胃潰瘍であること、日本
三郎さんが肺癌であることを示す表記が存在したことを
示している。Now, it is assumed that a part of the text extracted by the text extracting unit 101 is as shown in FIG. FIG. 3 shows that in the extracted text, there are notations indicating that Mr. Ichiro Nihon has stomach cancer, Mr. Jiro Nihon has gastric ulcer, and Mr. Saburo Nihon has lung cancer.

【００３１】この場合には、テキスト解析部１０２は、
例えば図４に示す解析結果を得る。即ち、図４に示すよ
うに、抽出された図３の各単語列は、夫々単語に分割さ
れて、その品詞が付される。例えば、図３の「日本一
郎：胃癌」は、日本／一郎／：／胃癌の４つの単語から
なり、「日本」は品詞が人名：姓氏で、「一郎」は品詞
が人名：名前で、「：」は品詞が記号で、「胃癌」は品
詞が名詞であることが解析結果によって得られる。In this case, the text analysis unit 102
For example, the analysis result shown in FIG. 4 is obtained. That is, as shown in FIG. 4, each extracted word string in FIG. 3 is divided into words and the part of speech is added. For example, “Nihon Ichiro: Gastric cancer” in FIG. 3 is made up of four words: Japan / Ichiro /: / gastric cancer. “Japan” has a part of speech as a surname: “Ichiro” has a part of speech as a name: ":" Is a sign of part of speech, and "stomach cancer" is a noun of part of speech.

【００３２】テキスト解析部１０２の解析結果は秘匿情
報マスク部１０３に与えられる。秘匿情報マスク部１０
３は記憶部１０４からマスクルールを読出す。例えば、
マスクルールとして図２に示すルールが記述されている
ものとする。この場合には、秘匿情報マスク部１０３
は、テキスト解析部１０２の解析結果のうち、品詞が人
名：姓氏の全ての単語と品詞が人名：名前の全ての単語
とをマスクし、マスク後の形態素解析結果をコーパス集
積部１０５に出力する。The analysis result of the text analysis unit 102 is given to the secret information mask unit 103. Secret information mask unit 10
3 reads the mask rule from the storage unit 104. For example,
It is assumed that the rule shown in FIG. 2 is described as the mask rule. In this case, the secret information mask unit 103
Among the analysis results of the text analysis unit 102, all the words whose part of speech is a person's name: surname and all words whose part of speech is a person's name: first name are masked, and the morphological analysis result after masking is output to the corpus accumulation unit 105. .

【００３３】例えば、秘匿情報マスク部１０３がマスク
ルールに従ってマスクした単語を○又は（黒丸）等によ
って表現するものとすると、秘匿情報マスク部１０３か
らは図５に示す形態素解析結果が出力される。For example, if a word masked by the secret information masking unit 103 according to a mask rule is represented by a circle or a black circle, the secret information masking unit 103 outputs the morphological analysis result shown in FIG.

【００３４】図５と図３及び図４との比較から明らかな
ように、形態素解析結果のうち品詞が人名：姓氏と人
名：名前の単語については、夫々○又は（黒丸）によっ
てマスクされて秘匿情報マスク部１０３から出力されて
いる。即ち、図３の例では、病院のカルテ等に含まれる
医療関係の情報のうち人名等の秘匿すべき個人情報につ
いては、形態素解析結果として出力されることが防止さ
れる。As is clear from the comparison between FIG. 5 and FIGS. 3 and 4, the words whose part-of-speech is person name: surname and person name: name in the morphological analysis result are masked by ◯ or (black circle), respectively, and concealed. It is output from the information mask unit 103. That is, in the example of FIG. 3, personal information to be concealed, such as a person's name, out of medical-related information included in a medical record of a hospital is prevented from being output as a morphological analysis result.

【００３５】コーパス集積部１０５は、秘匿情報マスク
部１０３によってマスクされた形態素解析結果を記憶部
１０６に記憶させる。こうして、記憶部１０６には、秘
匿情報が排除されたコーパスが蓄積される。コーパス統
計集計部１０７は、蓄積されたコーパスに基づいて、統
計的言語モデルを作成して、記憶部１０８に記憶させ
る。The corpus accumulating unit 105 stores the morphological analysis result masked by the secret information masking unit 103 in the storage unit 106. In this way, the storage unit 106 stores the corpus from which the secret information is excluded. The corpus statistical aggregation unit 107 creates a statistical language model based on the accumulated corpus and stores it in the storage unit 108.

【００３６】記憶部１０８に蓄積された統計的言語モデ
ルを用いて音声認識処理が行われる。記憶部１０８の統
計的言語モデルは、秘匿情報が排除されたコーパスに基
づいて作成されており、音声認識結果に秘匿情報が現れ
ることが防止される。Speech recognition processing is performed using the statistical language model accumulated in the storage unit 108. The statistical language model of the storage unit 108 is created based on the corpus from which the secret information is excluded, and prevents the secret information from appearing in the speech recognition result.

【００３７】このように、本実施の形態においては、デ
ータベースから抽出したテキストに基づいてコーパスを
作成する際に秘匿情報を排除していることから、作成さ
れた統計的言語モデルを用いた場合には、秘匿情報が音
声認識結果に現れることを防止することができる。即
ち、契約や法律上の理由からセキュリティー的に厳しい
管理をしなければならない個人情報等のデータのデータ
ベースを利用する場合でも、秘匿情報の流出を完全に防
止することができる。As described above, in the present embodiment, confidential information is excluded when creating a corpus based on the text extracted from the database. Therefore, when the created statistical language model is used, Can prevent the confidential information from appearing in the voice recognition result. That is, even when using a database of data such as personal information, which must be strictly managed in terms of security due to contracts or legal reasons, leakage of confidential information can be completely prevented.

【００３８】図６は本発明の第２の実施の形態を示すブ
ロック図である。図６において図１と同一の構成要素に
は同一符号を付して説明を省略する。FIG. 6 is a block diagram showing a second embodiment of the present invention. 6, the same components as those in FIG. 1 are designated by the same reference numerals and the description thereof will be omitted.

【００３９】第１の実施の形態においては秘匿情報をマ
スクすることで、秘匿情報の漏出を防止した。図５に示
すように、本来名前の部分はマスクされて、品詞の情報
も削除されてしまう。即ち、品詞の情報も含み名前等の
秘匿情報の全ての情報が形態素解析結果として用いられ
ないので、音声認識処理に際して名前等を正確に変換す
ることが困難となってしまうことが考えられる。そこ
で、本実施の形態は、秘匿情報については典型的な他の
単語に置き換えることで、形態素解析結果として利用す
るようにしたものである。In the first embodiment, the confidential information is masked to prevent leakage of the confidential information. As shown in FIG. 5, the originally named portion is masked and the part-of-speech information is also deleted. That is, since all the confidential information such as the name including the part-of-speech information is not used as the morphological analysis result, it may be difficult to accurately convert the name and the like in the voice recognition process. Therefore, in the present embodiment, the secret information is used as a morphological analysis result by replacing it with another typical word.

【００４０】本実施の形態は秘匿情報マスク部１０３及
び記憶部１０４に夫々代えて秘匿情報置換部１１０及び
記憶部１１１を採用した点が第１の実施の形態と異な
る。The present embodiment is different from the first embodiment in that the secret information replacing unit 110 and the storage unit 111 are adopted instead of the secret information masking unit 103 and the storage unit 104, respectively.

【００４１】記憶部１１１は置換ルールを記憶してい
る。秘匿情報置換部１１０は記憶部１１１の置換ルール
に従って、テキスト解析部１０２からの形態素解析結果
のうち秘匿情報の単語を典型的な他の単語に置き換えて
コーパス集積部１０５に出力するようになっている。The storage unit 111 stores the replacement rule. The secret information replacement unit 110 replaces the word of the secret information in the morphological analysis result from the text analysis unit 102 with another typical word according to the replacement rule of the storage unit 111, and outputs it to the corpus accumulation unit 105. There is.

【００４２】次に、このように構成された実施の形態の
作用について図３、図７及び図８の説明図を参照して説
明する。図７及び図８は夫々図２及び図４に対応したも
のである。Next, the operation of the embodiment thus constructed will be described with reference to the explanatory views of FIGS. 3, 7 and 8. 7 and 8 correspond to FIGS. 2 and 4, respectively.

【００４３】図７は記憶部１１１に記憶されている置換
ルールの一例を示している。図７の例では、置換ルール
は品詞、見出し及び典型的単語の組の情報として蓄積さ
れている。図７は２つのレコードが蓄積された例を示し
ている。図７においても＊（アスタリスク）は全ての見
出しについての制約条件がないこと、即ち、その品詞中
に含まれる全ての単語をマスクすることを示している。
また、図７の典型的単語は、形態素解析結果の単語が品
詞及び見出しで規定された単語である場合に、置き換え
る単語を示している。FIG. 7 shows an example of the replacement rule stored in the storage unit 111. In the example of FIG. 7, the replacement rule is stored as information on the set of part of speech, headline, and typical word. FIG. 7 shows an example in which two records are accumulated. Also in FIG. 7, * (asterisk) indicates that there is no constraint condition for all headings, that is, all words included in the part of speech are masked.
In addition, the typical word in FIG. 7 indicates a word to be replaced when the word of the morpheme analysis result is the word defined by the part of speech and the headline.

【００４４】図７の例では、例えば、品詞が人名：姓氏
の全ての見出し（単語）については、「鈴木」という単
語に置き換えその品詞を人名：姓氏にすることを示して
いる。なお、図７では形態素の品詞情報に基づいた置換
ルールが記述されているが、見出しあるいは見出しと品
詞の組み合わせによって表現することも可能である。In the example of FIG. 7, for example, for all headings (words) whose part of speech is person name: surname, the word “Suzuki” is replaced and the part of speech is changed to person name: surname. Although the replacement rule based on the morpheme part-of-speech information is described in FIG. 7, it can be expressed by a headline or a combination of a headline and a part-of-speech.

【００４５】いま、テキスト抽出部１０１によって抽出
されたテキストの一部が図３に示すものであるものとし
て説明する。Now, it is assumed that a part of the text extracted by the text extracting unit 101 is as shown in FIG.

【００４６】この場合の形態素解析結果は例えば図４に
示すものとなり、この解析結果が秘匿情報置換部１１０
に与えられる。秘匿情報置換部１１０は記憶部１１１か
ら置換ルールを読出す。置換ルールが図７に示すもので
ある場合には、秘匿情報置換部１１０は、テキスト解析
部１０２の解析結果のうち、品詞が人名：姓氏の全ての
単語を「鈴木」に置き換えその品詞を人名：姓氏とし、
品詞が人名：名前の全ての単語を「太郎」に置き換えそ
の品詞を人名：名前とし、置換後の形態素解析結果をコ
ーパス集積部１０５に出力する。The morpheme analysis result in this case is as shown in FIG. 4, for example, and the analysis result is the secret information replacing unit 110.
Given to. The secret information replacement unit 110 reads the replacement rule from the storage unit 111. When the replacement rule is as shown in FIG. 7, the confidential information replacement unit 110 replaces all words whose part of speech is personal name: surname in the analysis result of the text analysis unit 102 with “Suzuki” and whose part of speech is personal name. : My surname,
Part-of-speech is a person's name: All words with the name are replaced with “Taro”, and the part-of-speech is made a person's name: name, and the morphological analysis result after the replacement is output to the corpus accumulation unit 105.

【００４７】こうして、秘匿情報置換部１１０からは図
８に示す形態素解析結果が出力される。図４と図８との
比較から明らかなように、形態素解析結果のうち品詞が
人名：姓氏と人名：名前の単語については、典型的な単
語である「鈴木」、「太郎」に置換されて秘匿情報置換
部１１０から出力されている。即ち、例えば、病院のカ
ルテ等に含まれる医療関係の情報のうち人名等の秘匿す
べき個人情報については、本来の人名ではなく、典型的
な人名である例えば「鈴木太郎」に置き換えられるの
で、個人情報が形態素解析結果として出力されることが
防止される。また、置換後の単語についても置換前の単
語と同一の品詞の情報が付加されるので、音声認識処理
に際して、品詞情報を活用することが可能である。In this way, the secret information replacing unit 110 outputs the morphological analysis result shown in FIG. As is clear from the comparison between FIG. 4 and FIG. 8, the part of speech in the result of morphological analysis is replaced with the typical words “Suzuki” and “Taro” as words whose names are: surname and surname: name. It is output from the secret information replacing unit 110. That is, for example, of the medical-related information included in the medical record of the hospital, the personal information that should be kept secret, such as the personal name, is replaced with the typical personal name, for example, "Taro Suzuki", instead of the original personal name. It is possible to prevent personal information from being output as a morphological analysis result. In addition, since the same part-of-speech information as that of the word before replacement is also added to the word after replacement, it is possible to utilize the part-of-speech information in the voice recognition process.

【００４８】他の作用は第１の実施の形態と同様であ
る。Other functions are similar to those of the first embodiment.

【００４９】このように、本実施の形態においては、秘
匿情報については典型的な単語に置換されて形態素解析
結果として出力されることから、作成された統計的言語
モデルを用いた場合でも、秘匿情報が音声認識結果に現
れることを防止することができる。しかも、単語を置換
した場合でも品詞情報が失われないので、統計的言語モ
デルを用いた音声認識処理において品詞の情報を利用し
た処理を可能にすることができる。As described above, in the present embodiment, the confidential information is replaced with a typical word and is output as a morphological analysis result. Therefore, even when the created statistical language model is used, the confidential information is kept secret. It is possible to prevent the information from appearing in the voice recognition result. Moreover, since the part-of-speech information is not lost even when the word is replaced, it is possible to perform the process using the part-of-speech information in the voice recognition process using the statistical language model.

【００５０】図９は本発明の第３の実施の形態を示すブ
ロック図である。図９において図６と同一の構成要素に
は同一符号を付して説明を省略する。FIG. 9 is a block diagram showing a third embodiment of the present invention. 9, the same components as those in FIG. 6 are designated by the same reference numerals and the description thereof will be omitted.

【００５１】第２の実施の形態においては秘匿情報を典
型的な単語に置換することで、秘匿情報の漏出を防止し
た。同様に、本実施の形態は秘匿情報の単語を適宜の伏
せ字に置き換えることにより、秘匿情報の漏出を防止す
るようにしたものである。なお、置き換えに際して、置
き換え前の品詞情報と同一の品詞情報を伏せ字に付加す
るようになっている。In the second embodiment, the secret information is prevented from leaking by replacing the secret information with a typical word. Similarly, in the present embodiment, the secret information is replaced with an appropriate hidden character to prevent leakage of the secret information. When replacing, the same part-of-speech information as the part-of-speech information before the replacement is added to the blank character.

【００５２】本実施の形態は記憶部１１１に代えて記憶
部１２１を採用した点が第２の実施の形態と異なる。記
憶部１２１は品詞情報を有する伏せ字に置き換えるため
の置換ルールを記憶している。The present embodiment differs from the second embodiment in that a storage unit 121 is adopted instead of the storage unit 111. The storage unit 121 stores a replacement rule for replacing with a hiding character having part-of-speech information.

【００５３】次に、このように構成された実施の形態の
作用について図１０乃至図１３の説明図を参照して説明
する。図１０乃至図１３は夫々図２乃至図５に対応した
ものである。Next, the operation of the embodiment configured as described above will be described with reference to the explanatory views of FIGS. 10 to 13. 10 to 13 correspond to FIGS. 2 to 5, respectively.

【００５４】図１０は記憶部１２１に記憶されている置
換ルールの一例を示している。図１０の例では、置換ル
ールは品詞、見出し及びマスク単語の組の情報として蓄
積されている。図１０は１１個のレコードが蓄積された
例を示している。図１０において＊（アスタリスク）
は、全ての単語を意味し、例えば、＊県は、最後の文字
が「県」の全ての見出しを示している。図１０のマスク
単語は、形態素解析結果の単語が品詞及び見出しで規定
された単語である場合に、伏せ字で置き換える単語を示
している。FIG. 10 shows an example of the replacement rule stored in the storage unit 121. In the example of FIG. 10, the replacement rule is stored as information on a set of a part of speech, a headline, and a mask word. FIG. 10 shows an example in which 11 records are accumulated. In Figure 10, * (asterisk)
Means all words, for example, * prefecture indicates all headings whose last character is “prefecture”. The mask word in FIG. 10 indicates a word to be replaced with a hidden character when the word as a result of morphological analysis is a word defined by a part of speech and a headline.

【００５５】図１０の例では、例えば、品詞が地名のう
ち「県」を最後の文字とする全ての見出し（単語）につ
いては、「□県」という単語に置き換えその品詞を地名
にすることを示している。なお、図１０の置換ルールは
例えばレコードの上側のものから順に実行されるものと
し、例えば、品詞が地名の＊で表される見出しよりも、
＊都，＊道，…等の上側のレコードに記述されたルール
が先に実行される。In the example of FIG. 10, for example, for all headings (words) whose part of speech has "prefecture" as the last character in the place name, the word "□ prefecture" is replaced and the part of speech is used as the place name. Shows. Note that the replacement rule in FIG. 10 is executed, for example, from the top of the record in order, and, for example, rather than a headline whose part of speech is represented by * of a place name,
The rules described in the upper records such as * city, * road, ... Are executed first.

【００５６】また、図１０においても、形態素の品詞情
報に基づいた置換ルールが記述されているが、見出しあ
るいは見出しと品詞の組み合わせによって表現すること
も可能である。Also, in FIG. 10, the replacement rule based on the morpheme part-of-speech information is described, but it can be expressed by a headline or a combination of a headline and a part-of-speech.

【００５７】いま、テキスト抽出部１０１によって抽出
されたテキストの一部が図１１に示すものであるものと
して説明する。図１１は抽出されたテキスト中に、日本
一郎さんの住所が「東京都港区芝浦１−２−３４５」で
あること、日本次郎さんの住所が「神奈川県川崎市幸区
小向６−７−８」であること、日本三郎さんの住所が
「神奈川県川崎市幸区幸町９−１」であることを示す表
記が存在したことを示している。Now, description will be given assuming that a part of the text extracted by the text extracting unit 101 is that shown in FIG. FIG. 11 shows that the address of Ichiro Nihon is “1-2-345 Shibaura, Minato-ku, Tokyo” in the extracted text, and the address of Jiro Nippon is “6-7 Komukai, Kouki-ku, Kawasaki-shi, Kanagawa Prefecture”. -8 "and that there is a notation indicating that the address of Mr. Saburo Nihon is" 9-1 Sachimachi, Kawasaki-ku, Kanagawa ".

【００５８】この場合には、テキスト解析部１０２は、
例えば図１２に示す解析結果を得る。即ち、図１２に示
すように、抽出された図１１の各単語列は、夫々単語に
分割されて、その品詞が付される。例えば、図１１の
「日本一郎：東京都港区芝浦１−２−３４５」は、日本
／一郎／：／東京都／港区／１／−／２／−／３／４／
５の１２個の単語からなり、「日本」は品詞が人名：姓
氏で、「一郎」は品詞が人名：名前で、「：」は品詞が
記号で、「東京都」は品詞が地名で、「港区」は品詞が
地名で、「１」は品詞が数字で、「−」は品詞が記号
で、「２」は品詞が数字で、「−」は品詞が記号で、
「３」は品詞が数字で、「４」は品詞が数字で、「５」
は品詞が数字であることが解析結果によって得られる。In this case, the text analysis unit 102
For example, the analysis result shown in FIG. 12 is obtained. That is, as shown in FIG. 12, each extracted word string in FIG. 11 is divided into words and the part of speech is added. For example, “Ichiro Nippon: 1-2-345 Shibaura, Minato-ku, Tokyo” in FIG. 11 is Japan / Ichiro /: / Tokyo / Minato-ku / 1 /-/ 2 /-/ 3/4 /
It consists of 12 words of 5, "Japan" is the part of speech of the person: surname, "Ichiro" is the part of speech of the person: first name, ":" is the symbol of the part of speech, and "Tokyo" is the place of the part of speech. "Minato Ward" is a place of speech part name, "1" is a part of speech number, "-" is a part of speech symbol, "2" is a part of speech number, "-" is a part of speech symbol,
"3" is a part of speech number, "4" is a part of speech number, "5"
It can be obtained from the analysis result that the part of speech is a number.

【００５９】これらの形態素解析結果は秘匿情報置換部
１１０に与えられる。秘匿情報置換部１１０は記憶部１
２１から置換ルールを読出し、置換ルールに従って秘匿
情報の置換を行う。置換ルールが図１０に示すものであ
る場合には、秘匿情報置換部１１０は、テキスト解析部
１０２の解析結果のうち、品詞が人名：姓氏の全ての単
語を「○」に置き換えその品詞を人名：姓氏とし、品詞
が人名：名前の全ての単語を「（黒丸）」に置き換えそ
の品詞を人名：名前とし、品詞が数字の全ての単語を
「◇」に置き換えその品詞を数字とし、同様に、地名に
ついても図１０の置換ルールに従って置換を行う。こう
して、秘匿情報置換部１１０からは図１３に示す形態素
解析結果が出力される。These morphological analysis results are given to the secret information replacing section 110. The secret information replacement unit 110 is the storage unit 1.
The replacement rule is read from 21 and the secret information is replaced according to the replacement rule. When the replacement rule is as shown in FIG. 10, the confidential information replacement unit 110 replaces all words whose part of speech is a personal name: surname, in the analysis result of the text analysis unit 102, with “○”. : The surname and the part-of-speech is the person's name: All words in the name are replaced with "(black circles)" The part-of-speech is the person's name: All words with the part-of-speech are numbers are replaced with "◇" The place name is also replaced according to the replacement rule of FIG. In this way, the secret information replacing unit 110 outputs the morpheme analysis result shown in FIG.

【００６０】図１２と図１３との比較から明らかなよう
に、形態素解析結果のうち置換ルールで規定された単語
については、○、（黒丸）、◇、（黒菱）形の文字に置
換され、或いは、□、（黒四角）と「都」、「道」、
「府」、「県」、「市」、「区」、「町」の文字を付加
した単語に置換される。例えば、図１２中の「日本次
郎」さんの住所である「神奈川県川崎市幸区小向６−７
−８」は「□県（黒四角）市（黒四角）区（黒菱形）◇
−◇−◇◇◇」に置換される。即ち、例えば、病院のカ
ルテ等に含まれる医療関係の情報のうち人名，地名等の
秘匿すべき個人情報については、品詞の情報が付加され
た伏せ字に置き換えられるので、個人情報が形態素解析
結果として出力されることが防止される。また、伏せ字
には置換前の単語と同一の品詞の情報が付加されるの
で、音声認識処理に際して、品詞情報を活用することが
可能である。As is clear from the comparison between FIG. 12 and FIG. 13, the words defined by the replacement rules in the morphological analysis results are replaced with characters of the circle, (black circle), ◇, (black diamond) shape. , Or □, (black square) and “city”, “road”,
It is replaced with a word added with the letters "fu", "prefecture", "city", "ward", and "town". For example, 6-7 Komukai, Saiwai-ku, Kawasaki-shi, Kanagawa, which is the address of "Jiro Nihon" in FIG.
-8 ”is“ □ prefecture (black square) city (black square) ward (black rhombus) ◇
-◇-◇◇◇ ". That is, for example, personal information to be concealed, such as a person's name or a place name, out of medical-related information included in a medical record of a hospital or the like, is replaced with a hiding letter to which information of a part of speech is added. Output is prevented. Moreover, since the same part-of-speech information as that of the word before replacement is added to the blank character, it is possible to utilize the part-of-speech information in the voice recognition process.

【００６１】他の作用は第２の実施の形態と同様であ
る。Other functions are similar to those of the second embodiment.

【００６２】このように、本実施の形態においても、第
２の実施の形態と同様の効果を得ることができる。As described above, also in this embodiment, the same effect as that of the second embodiment can be obtained.

【００６３】図１４は本発明の第４の実施の形態を示す
ブロック図である。図１４において図６と同一の構成要
素には同一符号を付して説明を省略する。FIG. 14 is a block diagram showing a fourth embodiment of the present invention. 14, the same components as those of FIG. 6 are designated by the same reference numerals and the description thereof will be omitted.

【００６４】第２及び図３の実施の形態においては秘匿
情報を予め定められた典型的な語又は伏せ字に置換する
ことで、秘匿情報の漏出を防止した。しかしながら、秘
匿情報の単語が、伏せ字又は典型的な単語に置き換えら
れることから、形態素解析結果には個人名等が含まれ
ず、この形態素解析結果を基に作成した統計的言語モデ
ルを用いた場合には、個人名等を正しく音声認識処理す
ることが困難となることがある。そこで、本実施の形態
は、秘匿情報の単語を同一品詞のランダムな単語に置き
換えることにより、秘匿情報の漏出を防止すると共に、
個人名等であっても正しく音声認識処理することを可能
にしたものである。In the embodiments shown in FIGS. 2 and 3, the secret information is prevented from leaking by replacing the secret information with a predetermined typical word or a hidden letter. However, since the secret information word is replaced with a hiding letter or a typical word, the morpheme analysis result does not include an individual name, and when a statistical language model created based on this morpheme analysis result is used. May make it difficult to correctly perform voice recognition processing on a personal name or the like. Therefore, the present embodiment prevents leakage of confidential information by replacing the confidential information word with a random word having the same part of speech, and
This makes it possible to correctly perform voice recognition processing even for an individual name or the like.

【００６５】本実施の形態は秘匿情報置換部１１０及び
記憶部１１１に夫々代えて秘匿情報置換部１３０及び記
憶部１３１を採用した点が第２の実施の形態と異なる。
記憶部１３１は秘匿情報を同一品詞の他のランダムな単
語に置き換えるための置換ルールを記憶している。This embodiment differs from the second embodiment in that the secret information replacing unit 110 and the storage unit 111 are replaced with a secret information replacing unit 130 and a storage unit 131, respectively.
The storage unit 131 stores a replacement rule for replacing the confidential information with another random word having the same part of speech.

【００６６】次に、このように構成された実施の形態の
作用について図１５乃至図１８の説明図を参照して説明
する。図１５乃至図１８は夫々図２乃至図５に対応した
ものである。Next, the operation of the embodiment thus configured will be described with reference to the explanatory views of FIGS. 15 to 18 correspond to FIGS. 2 to 5, respectively.

【００６７】図１５は記憶部１３１に記憶されている置
換ルールの一例を示している。図１５の例では、置換ル
ールは品詞、見出し及びランダム単語の組の情報として
蓄積されている。図１５は１１個のレコードが蓄積され
た例を示している。図１５において＊（アスタリスク）
は、全ての単語を意味する。図１５のランダム単語は、
形態素解析結果の単語が品詞及び見出しで規定された単
語である場合に、置き換える単語を示しており、Ｒａｎ
ｄ（）はランダムに抽出する単語であることを示してい
る。FIG. 15 shows an example of the replacement rule stored in the storage unit 131. In the example of FIG. 15, the replacement rule is stored as information on a set of a part of speech, a headline, and a random word. FIG. 15 shows an example in which 11 records are accumulated. In Figure 15, * (asterisk)
Means all words. The random word in Figure 15 is
When the word of the morphological analysis result is a word defined by a part of speech and a headline, it indicates a word to be replaced.
d () indicates that the word is randomly extracted.

【００６８】図１５の例では、例えば、品詞が数字の０
−９は、夫々品詞が数字である０−９のランダムな数字
に置き換えられることを示している。また、図１５にお
いても、形態素の品詞情報に基づいた置換ルールが記述
されているが、見出しあるいは見出しと品詞の組み合わ
せによって表現することも可能である。In the example of FIG. 15, for example, the part of speech is the number 0.
-9 indicates that each part of speech is replaced by a random number 0-9. Also, in FIG. 15, the replacement rule based on the morpheme part-of-speech information is described, but it can be expressed by a headline or a combination of a headline and a part-of-speech.

【００６９】いま、テキスト抽出部１０１によって抽出
されたテキストの一部が図１６に示すものであるものと
して説明する。図１６は抽出されたテキスト中に、日本
一郎さんに関する何らかの数値が「１２３−１２３４」
であること、日本次郎さんに関する何らかの数値が「１
２３−４５６７」であること、日本三郎さんに関する何
らかの数値が「９９９−９９９９」であることを示す表
記が存在したことを示している。なお、これらの数値
は、各種測定値、金額、年齢等の各種情報である。Now, description will be given assuming that a part of the text extracted by the text extracting unit 101 is as shown in FIG. In Fig. 16, some numerical value related to Mr. Ichiro Nihon is "123-1234" in the extracted text.
That is, some numerical value regarding Mr. Jiro Nihon is "1.
23-4567 ”, and that there is a notation indicating that some numerical value regarding Mr. Saburo Nippon is“ 999-9999 ”. In addition, these numerical values are various information such as various measured values, amount of money, and age.

【００７０】この場合には、テキスト解析部１０２は、
例えば図１７に示す解析結果を得る。即ち、図１７に示
すように、抽出された図１６の各単語列は、夫々単語に
分割されて、その品詞が付される。例えば、図１６の
「日本一郎：１２３−１２３４」は、日本／一郎／：／
１／２／３／−／１／２／３／４／５の１２個の単語か
らなり、「日本」は品詞が人名：姓氏で、「一郎」は品
詞が人名：名前で、「：」は品詞が記号で、「１」〜
「３」は品詞が数字で、「−」は品詞が記号で、「１」
〜「４」は品詞が数字であることが解析結果によって得
られる。In this case, the text analysis unit 102
For example, the analysis result shown in FIG. 17 is obtained. That is, as shown in FIG. 17, each extracted word string in FIG. 16 is divided into words and the part of speech is added. For example, “Ichiro Nihon: 123-1234” in FIG. 16 is Japan / Ichiro /: /
It consists of 12 words 1/2/3 /-/ 1/2/3/4/5. "Japan" has a part of speech as a surname: "Ichiro" has a part of speech as a personal name: name, and ":". Is a symbol for the part of speech, "1" ~
"3" is a part of speech number, "-" is a part of speech symbol, "1"
It can be obtained from the analysis result that the part of speech of "4" is a number.

【００７１】これらの形態素解析結果は秘匿情報置換部
１３０に与えられる。秘匿情報置換部１３０は記憶部１
３１から置換ルールを読出し、置換ルールに従って秘匿
情報の置換を行う。置換ルールが図１５に示すものであ
る場合には、秘匿情報置換部１３０は、テキスト解析部
１０２の解析結果のうち、品詞が人名：姓氏及び人名：
名前の全ての単語をランダムに抽出した同一品詞の他の
単語に置き換えて同一の品詞を付し、品詞が数字の全て
の単語についてはランダムに抽出した品詞が数字の他の
数字に置き換える。こうして、秘匿情報置換部１３０か
らは図１８に示す形態素解析結果が出力される。These morphological analysis results are given to the secret information replacing unit 130. The secret information replacing unit 130 is the storage unit 1.
The replacement rule is read from 31 and the secret information is replaced according to the replacement rule. When the replacement rule is as shown in FIG. 15, the confidential information replacement unit 130 has the part of speech whose personal name is: surname and surname and personal name in the analysis result of the text analysis unit 102.
All words of the name are replaced with other words of the same part of speech that are randomly extracted, and the same part of speech is attached, and for all words of which the part of speech is a number, the randomly extracted part of speech is replaced with another number of a number. In this way, the secret information replacing unit 130 outputs the morphological analysis result shown in FIG.

【００７２】図１７と図１８との比較から明らかなよう
に、形態素解析結果のうち置換ルールで規定された単語
については、ランダムに抽出した同一品詞の他の単語に
置換される。例えば、図１７中の「日本一郎」さんの
「１２３−１２３４」は「鈴木良子」さんの「３１３−
６９２４」に置換される。即ち、例えば、病院のカルテ
等に含まれる医療関係の情報のうち人名及び数値等の秘
匿すべき個人情報については、ランダムに抽出された同
一品詞の他の単語に置き換えられるので、個人情報が形
態素解析結果として出力されることが防止される。ま
た、置換後の単語としてはランダムに抽出した同一品詞
の単語を用いるので、置換後の形態素解析結果には名前
や数値等の情報が含まれる。従って、この形態素解析結
果を基に作成した統計的言語モデルを用いた場合には、
音声認識処理に際して、名前や数値等を確実に認識する
ことが可能となる。As is clear from the comparison between FIG. 17 and FIG. 18, the word defined by the replacement rule in the morphological analysis result is replaced with another word of the same part of speech that is randomly extracted. For example, “123-1234” of “Nihon Ichiro” in FIG. 17 is “313-” of “Ryoko Suzuki”.
6924 ". That is, for example, personal information to be concealed such as a person's name and a numerical value among medical-related information included in a medical record of a hospital or the like is replaced with another randomly extracted word having the same part of speech, so that the personal information is morpheme. It is prevented from being output as the analysis result. Further, since a word with the same part of speech extracted at random is used as the word after replacement, the morphological analysis result after replacement includes information such as a name and a numerical value. Therefore, when using a statistical language model created based on this morphological analysis result,
In the voice recognition process, it is possible to surely recognize the name, the numerical value and the like.

【００７３】他の作用は第２の実施の形態と同様であ
る。Other functions are similar to those of the second embodiment.

【００７４】このように、本実施の形態においても、第
２及び第３の実施の形態と同様の効果を得ることができ
る。更に、置換後の単語としてランダムに抽出した同一
品詞の単語を用いるので、実際に利用される名前や数値
等の情報を含む統計的言語モデルを作成することがで
き、音声認識精度が低下することを防止することができ
る。また、置換後の単語を、利用形態に応じたデータベ
ースから抽出することにより、利用形態に適した統計的
言語モデルを作成することができ、音声認識精度を一層
向上させることができる。As described above, also in this embodiment, the same effects as those of the second and third embodiments can be obtained. Furthermore, since the words of the same part of speech that are randomly extracted are used as the words after replacement, it is possible to create a statistical language model that includes information such as names and numerical values that are actually used, and speech recognition accuracy is reduced. Can be prevented. Further, by extracting the word after the replacement from the database according to the usage pattern, a statistical language model suitable for the usage pattern can be created, and the voice recognition accuracy can be further improved.

【００７５】図１９は本発明の第５の実施の形態を示す
ブロック図である。図１９において図１と同一の構成要
素には同一符号を付して説明を省略する。FIG. 19 is a block diagram showing the fifth embodiment of the present invention. 19, the same components as those in FIG. 1 are designated by the same reference numerals and the description thereof will be omitted.

【００７６】上記第１乃至第４の実施の形態において
は、統計的言語モデル用のテキストを収集する過程で、
統計情報集計に使用するテキスト中の秘匿情報に関する
表記情報を事前に除去した。これに対し、本実施の形態
は既存のコーパス（統計的言語モデル用のテキスト収集
後のテキスト）から統計的言語モデルを作成する過程
で、統計情報集計に使用するテキスト中の秘匿情報に関
する表記情報を事前に除去するようにしたものである。In the above-described first to fourth embodiments, in the process of collecting the text for the statistical language model,
The notational information about confidential information in the text used for statistical information aggregation was removed in advance. On the other hand, in the present embodiment, in the process of creating the statistical language model from the existing corpus (text after collecting the text for the statistical language model), the notation information about the secret information in the text used for the statistical information aggregation Is to be removed in advance.

【００７７】本実施の形態はデータベース１００に代え
てコーパス１４０を採用し、テキスト抽出部１０１に代
えてコーパス入力部１４１を採用した点が図１の実施の
形態と異なる。The present embodiment differs from the embodiment shown in FIG. 1 in that a corpus 140 is adopted in place of the database 100 and a corpus input unit 141 is adopted in place of the text extraction unit 101.

【００７８】コーパス１４０は、統計的言語モデル作成
用に収集されたテキストが集積されたものである。コー
パス１４０中には秘匿情報が含まれていることがある。
コーパス入力部１４０は、コーパス１４１を処理対象と
して、コーパス１４１中のテキストをテキスト解析部１
０２に出力する。The corpus 140 is a collection of texts collected for creating a statistical language model. The corpus 140 may include confidential information.
The corpus input unit 140 sets the text in the corpus 141 to the text analysis unit 1 with the corpus 141 as a processing target.
Output to 02.

【００７９】このように構成された実施の形態において
は、既に作成されている既存のコーパス１４０からテキ
ストを抽出してテキスト解析部１０２に出力する。これ
により、以後、第１の実施の形態と同様の手法によっ
て、秘匿情報をマスクした形態素解析結果を得、秘匿情
報を排除したコーパスを蓄積することができる。In the embodiment configured as described above, the text is extracted from the existing corpus 140 already created and output to the text analysis unit 102. Accordingly, thereafter, by the same method as that of the first embodiment, it is possible to obtain a morphological analysis result in which the secret information is masked and accumulate the corpus from which the secret information is excluded.

【００８０】他の作用は図１の実施の形態と同様であ
る。Other operations are similar to those of the embodiment shown in FIG.

【００８１】このように、本実施の形態においては既存
のコーパスを利用する場合でも、秘匿情報を排除したコ
ーパスに変換することができ、秘匿情報を漏出させるこ
となく、統計的言語モデルを利用した音声認識処理が可
能にすることもできる。As described above, in the present embodiment, even when the existing corpus is used, it is possible to convert the corpus to the secret information excluded, and the statistical language model is used without leaking the secret information. Speech recognition processing can also be enabled.

【００８２】上記第５の実施の形態は上記第２乃至第４
の実施の形態にも適用することができる。図２０乃至図
２２は第５の実施の形態を図６、図９及び図１４に示す
第２乃至第４の実施の形態に適用した場合のブロック図
である。The fifth embodiment is the same as the second to fourth embodiments.
It can also be applied to the embodiment. 20 to 22 are block diagrams when the fifth embodiment is applied to the second to fourth embodiments shown in FIGS. 6, 9 and 14.

【００８３】これらの図２０乃至図２２においては、デ
ータベース１００に代えてコーパス１４０を用い、テキ
スト抽出部１０１に代えてコーパス入力部１４１を採用
した点が、図６、図９及び図１４と異なる。20 to 22, the corpus 140 is used in place of the database 100, and the corpus input unit 141 is adopted in place of the text extraction unit 101, which is a difference from FIGS. 6, 9, and 14. .

【００８４】他の構成及び作用は、夫々第２乃至第４の
実施の形態と同様である。Other configurations and operations are the same as those of the second to fourth embodiments, respectively.

【００８５】このように、これらの図２０乃至図２２に
示す例においても、第１乃至第５の実施の形態と同様の
効果を得ることができる。As described above, also in the examples shown in FIGS. 20 to 22, the same effects as those of the first to fifth embodiments can be obtained.

【００８６】[0086]

【発明の効果】以上説明したように本発明によれば、統
計的言語モデル用のテキストを収集する過程又は収集後
のテキストから統計的言語モデルを作成する過程におい
て、統計情報集計に使用するテキスト中の個人情報に関
する表記情報を除去することにより、プライベートな個
人情報が誤って認識結果中に出力されることを防止する
ことができるという効果を有する。As described above, according to the present invention, in the process of collecting texts for a statistical language model or in the process of creating a statistical language model from the collected texts, the texts used for statistical information aggregation By removing the notation information regarding the personal information therein, it is possible to prevent the private personal information from being erroneously output in the recognition result.

[Brief description of drawings]

【図１】本発明の第１の実施の形態に係る統計的言語モ
デルを作成するためのコーパス処理装置を示すブロック
図。FIG. 1 is a block diagram showing a corpus processing device for creating a statistical language model according to a first embodiment of the present invention.

【図２】図１中の記憶部１０４に記憶されているマスク
ルールを説明するための説明図。FIG. 2 is an explanatory diagram for explaining mask rules stored in a storage unit 104 in FIG.

【図３】第１の実施の形態の作用を説明するための説明
図。FIG. 3 is an explanatory diagram for explaining the operation of the first embodiment.

【図４】第１の実施の形態の作用を説明するための説明
図。FIG. 4 is an explanatory diagram for explaining the operation of the first embodiment.

【図５】第１の実施の形態の作用を説明するための説明
図。FIG. 5 is an explanatory diagram for explaining the operation of the first embodiment.

【図６】本発明の第２の実施の形態を示すブロック図。FIG. 6 is a block diagram showing a second embodiment of the present invention.

【図７】図６中の記憶部１１１に記憶されている置換ル
ールを説明するための説明図。7 is an explanatory diagram for explaining a replacement rule stored in a storage unit 111 in FIG.

【図８】第２の実施の形態の作用を説明するための説明
図。FIG. 8 is an explanatory diagram for explaining the operation of the second embodiment.

【図９】本発明の第３の実施の形態を示すブロック図。FIG. 9 is a block diagram showing a third embodiment of the present invention.

【図１０】図９中の記憶部１２１に記憶されている置換
ルールを説明するための説明図。FIG. 10 is an explanatory diagram illustrating replacement rules stored in a storage unit 121 in FIG. 9.

【図１１】第３の実施の形態の作用を説明するための説
明図。FIG. 11 is an explanatory diagram for explaining the operation of the third embodiment.

【図１２】第３の実施の形態の作用を説明するための説
明図。FIG. 12 is an explanatory diagram for explaining the operation of the third embodiment.

【図１３】第３の実施の形態の作用を説明するための説
明図。FIG. 13 is an explanatory diagram for explaining the operation of the third embodiment.

【図１４】本発明の第４の実施の形態を示すブロック
図。FIG. 14 is a block diagram showing a fourth embodiment of the present invention.

【図１５】図１４中の記憶部１３１に記憶されている置
換ルールを説明するための説明図。FIG. 15 is an explanatory diagram illustrating replacement rules stored in a storage unit 131 in FIG.

【図１６】第４の実施の形態の作用を説明するための説
明図。FIG. 16 is an explanatory diagram for explaining the operation of the fourth embodiment.

【図１７】第４の実施の形態の作用を説明するための説
明図。FIG. 17 is an explanatory diagram for explaining the operation of the fourth embodiment.

【図１８】第４の実施の形態の作用を説明するための説
明図。FIG. 18 is an explanatory diagram for explaining the operation of the fourth embodiment.

【図１９】本発明の第５の実施の形態を示すブロック
図。FIG. 19 is a block diagram showing a fifth embodiment of the present invention.

【図２０】第５の実施の形態の変形例を示すブロック
図。FIG. 20 is a block diagram showing a modified example of the fifth embodiment.

【図２１】第５の実施の形態の変形例を示すブロック
図。FIG. 21 is a block diagram showing a modification of the fifth embodiment.

【図２２】第５の実施の形態の変形例を示すブロック
図。FIG. 22 is a block diagram showing a modification of the fifth embodiment.

[Explanation of symbols]

１００…データベース、１０１…テキスト抽出部、１０
２…テキスト解析部、１０３…秘匿情報マスク部、１０
４，１０６，１０８…記憶部、１０５…コーパス集積
部、１０７…コーパス統計集計部。100 ... Database, 101 ... Text extraction unit, 10
2 ... Text analysis unit, 103 ... Secret information mask unit, 10
4, 106, 108 ... Storage unit, 105 ... Corpus accumulation unit, 107 ... Corpus statistical aggregation unit.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 3/00 ５２１Ｃ ─────────────────────────────────────────────────── ─── Continued Front Page (51) Int.Cl. ⁷ Identification Code FI Theme Coat (Reference) G10L 3/00 521C

Claims

[Claims]

1. A text analysis unit for morphologically analyzing text data and outputting a morphological analysis result, and a secret information mask for masking the morphological analysis result according to a predetermined mask rule for masking a word included in the secret information. A corpus accumulating unit that accumulates the morphological analysis results masked by the secret information masking unit as a corpus, and a corpus statistical aggregation unit that collects statistical information from the corpus accumulated by the corpus accumulating unit. A corpus processing device for creating a characteristic statistical language model.

2. A text analysis unit that morphologically analyzes text data and outputs a morpheme analysis result, and replaces the morpheme analysis result according to a predetermined replacement rule for replacing a word included in confidential information with another word. A secret information replacing unit, a corpus accumulating unit that accumulates the morphological analysis results replaced by the secret information replacing unit as a corpus, and a corpus statistics aggregating unit that collects statistical information from the corpus accumulated by the corpus accumulating unit. A corpus processing device for creating a statistical language model characterized by being provided.

3. The statistical language model according to claim 1, wherein the text data given to the text analysis unit is extracted from a predetermined database. For corpus processing equipment.

4. The statistical language model according to claim 1, wherein the text data given to the text analysis unit is extracted from a predetermined corpus. For corpus processing equipment.

5. The corpus processing device for creating a statistical language model according to claim 2, wherein the secret information replacing unit replaces a word included in the secret information with a typical word.

6. The corpus for creating a statistical language model according to claim 2, wherein the secret information replacing unit replaces a word included in the secret information with a hiding letter to which part of speech information is added. Processing equipment.

7. The corpus processing device for creating a statistical language model according to claim 2, wherein the secret information replacing unit replaces a word included in the secret information with a random word.

8. The mask rule or the replacement rule is a rule for masking or replacing a word having a part of speech of at least one of a person's name, a place name and a number as a word included in the confidential information. A corpus processing device for creating the statistical language model according to claim 1.

9. A secret information mask for masking the morphological analysis result according to a text analysis procedure for morphologically analyzing text data and outputting a morphological analysis result, and a predetermined mask rule for masking a word included in the confidential information. A statistical language model is created which comprises a procedure, a corpus accumulation procedure for accumulating the masked morphological analysis results as a corpus, and a corpus statistical aggregation procedure for collecting statistical information from the accumulated corpus. Corpus processing method to do.

10. The morphological analysis result is replaced according to a text analysis procedure for morphologically analyzing text data and outputting a morphological analysis result, and a predetermined replacement rule for replacing a word included in confidential information with another word. A secret information replacement procedure, a corpus collection procedure for collecting the replaced morphological analysis results as a corpus, and a corpus statistics collection procedure for collecting statistical information from the collected corpus. A corpus processing method for creating a language model.

11. A morphological analysis process of morphologically analyzing text data and outputting a morphological analysis result to a computer, and masking the morphological analysis result according to a predetermined mask rule for masking a word included in confidential information. Create a statistical language model for executing confidential information mask processing, corpus accumulation processing that accumulates the masked morphological analysis results as a corpus, and corpus statistical aggregation processing that collects statistical information from the accumulated corpus Corpus processing program to do.

12. The morphological analysis according to a text analysis process of morphologically analyzing text data and outputting a morphological analysis result to a computer, and a predetermined replacement rule for replacing a word included in confidential information with another word. Statistical information for executing secret information replacement processing for replacing results, corpus accumulation processing for accumulating the replaced morphological analysis results as a corpus, and corpus statistical aggregation processing for collecting statistical information from the accumulated corpus A corpus processing program for creating a language model.