JP3725470B2

JP3725470B2 - Corpus processing apparatus, method, and program for creating statistical language model

Info

Publication number: JP3725470B2
Application number: JP2001401616A
Authority: JP
Inventors: 尚義永江
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2001-12-28
Filing date: 2001-12-28
Publication date: 2005-12-14
Anticipated expiration: 2021-12-28
Also published as: JP2003202893A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識処理に際して秘匿すべき個人情報等の流出を防止するようにした統計的言語モデルを作成するためのコーパス処理装置及び方法並びにプログラムに関する。
【０００２】
【従来の技術】
近年、音声で日本語の文章を入力することができる日本語ディクテーションシステムが実用化されてきた。このシステムでは音響的な正しさだけでなく、統計的言語モデルと呼ばれる単語組の出現確率の情報を使用することで高い認識性能を実現している。ここで、統計的言語モデルを構成する単語組の出現確率は、コーパスと呼ばれる大量のテキストに対する統計的な手法を使用した統計情報の集計によって得られる。
【０００３】
従って、統計的言語モデルにおける単語組の出現確率は、言語モデルを作成するために使用したコーパス（テキスト）中の文章の内容に強く依存することになる。即ち、ディクテーションシステムでは、統計的言語モデル作成に使用したテキスト中の文章に出現する表現については認識しやすく、逆に、出現しない表現については認識しづらい。このため、利用者が統計的言語モデル作成に使用したテキスト中に出現しない文章を音声入力した場合には、テキスト中に出現する単語組に音響的に近い表記列に誤認識してしまうことも少なくない。
【０００４】
なお、これらの技術については、下記文献に詳述されている。
【０００５】
「確率モデルによる音声認識」中川聖一著電子情報通信学会 {ISBN4-88552-072-X}
「音声言語処理」北研二著森北出版 {ISBN4-627-82380-0}
「確率的言語モデル」北研二著東京大学出版会 {ISBN4-13-065404-7}
【０００６】
【発明が解決しようとする課題】
このように、コーパス（テキスト）中に出現した全ての表現は、統計的言語モデルの単語組の出現確率に直接反映される。従って、ディクテーションシステムではコーパス（テキスト）中に出現した表現が認識結果として出現しやすい。
【０００７】
このため、統計的言語モデル作成に使用したコーパス（テキスト）中に、プライベートな個人情報等が含まれている場合には、個人情報の内容が誤認識結果として出現してしまう虞があり、個人情報が他人に露見してしまう可能性がある。
【０００８】
例えば、医療関係の音声認識処理を行う場合には、医療関係のコーパスを用いた方が認識精度を向上させることができる。ところが、例えばカルテ等のように、医療関係のコーパス中には個人情報が含まれることが多く、音声認識処理の結果によって個人情報が露見してしまうことがあるという問題があった。
【０００９】
本発明は、統計的言語モデル用のテキストを収集する過程又は収集後のテキストから統計的言語モデルを作成する過程において、統計情報集計に使用するテキスト中の秘匿情報に関する表記情報を除去することにより、プライベートな個人情報が誤って認識結果中に出力されることを防止することができる統計的言語モデルを作成するためのコーパス処理装置及び方法並びにプログラムを提供することを目的とする。
【００１０】
【課題を解決するための手段】
本発明の請求項１に係る統計的言語モデルを作成するためのコーパス処理装置は、テキストデータを形態素解析して形態素解析結果を出力するテキスト解析部と、秘匿情報に含まれる単語をマスクするための所定のマスクルールに従って、前記形態素解析結果をマスクする秘匿情報マスク部と、前記秘匿情報マスク部によってマスクされた形態素解析結果をコーパスとして集積するコーパス集積部と、前記コーパス集積部によって集積されたコーパスから統計情報を収集するコーパス統計集計部とを具備したものであり、
本発明の請求項２に係る統計的言語モデルを作成するためのコーパス処理装置は、テキストデータを形態素解析して形態素解析結果を出力するテキスト解析部と、秘匿情報に含まれる単語を他の単語に置換するための所定の置換ルールに従って、前記形態素解析結果を置換する秘匿情報置換部と、前記秘匿情報置換部によって置換された形態素解析結果をコーパスとして集積するコーパス集積部と、前記コーパス集積部によって集積されたコーパスから統計情報を収集するコーパス統計集計部とを具備したものである。
【００１１】
本発明の請求項１において、テキスト解析部による形態素解析結果は秘匿情報マスク部に与えられる。秘匿情報マスク部は、秘匿情報に含まれる単語をマスクするための所定のマスクルールに従って、前記形態素解析結果をマスクする。コーパス集積部は、秘匿情報に含まれる単語がマスクされた形態素解析結果をコーパスとして集積する。コーパス統計集計部は、集積されたコーパスに基づいて、統計情報を収集する。統計情報は秘匿情報がマスクされており、統計情報を用いた音声認識処理において秘匿情報が漏出することが防止される。
【００１２】
本発明の請求項２において、テキスト解析部による形態素解析結果は秘匿情報置換部に与えられる。秘匿情報置換部は、秘匿情報に含まれる単語を他の単語に置換するための所定の置換ルールに従って、前記形態素解析結果を置換する。コーパス集積部は、秘匿情報に含まれる単語が置換された形態素解析結果をコーパスとして集積する。コーパス統計集計部は、集積されたコーパスに基づいて、統計情報を収集する。統計情報は秘匿情報が他の単語に置換されており、統計情報を用いた音声認識処理において秘匿情報が漏出することが防止される。
【００１３】
なお、装置に係る本発明は方法に係る発明としても成立する。
【００１４】
また、装置に係る本発明は、コンピュータに当該発明に相当する処理を実行させるためのプログラムとしても成立する。
【００１５】
【発明の実施の形態】
以下、図面を参照して本発明の実施の形態について詳細に説明する。図１は本発明の第１の実施の形態に係る統計的言語モデルを作成するためのコーパス処理装置を示すブロック図である。
【００１６】
本実施の形態は、統計的言語モデル用のテキストを収集する過程で、統計情報集計に使用するテキスト中の個人情報等の秘匿すべき情報（以下、秘匿情報という）に関する表記情報を事前に除去することにより、プライベートな個人情報等が誤ってディクテーションシステム等の認識結果中に出力されることを防止するものである。
【００１７】
データベース１００には、各種データが蓄積されている。例えば、病院等の医療関係では、データベース１００に蓄積される情報としては、カルテやレントゲンの撮影結果等の情報が考えられる。データベース１００内の情報中には、統計的言語モデルを作成するために必要なテキストを含む文章データも含まれる。そして、文章データ中には本来秘匿すべき情報、例えば個人情報に関する表記情報が記述されていることがある。
【００１８】
テキスト抽出部１０１は、データが蓄積されているデータベース１００からコーパスとして利用可能なテキスト情報を抽出してテキスト解析部１０２に供給するようになっている。
【００１９】
テキスト解析部１０２は、テキスト抽出部１０１によって抽出されたテキストを形態素解析した結果を出力する。例えば、テキスト解析部１０２は、テキスト中の文章を語に分割し、分割した語（見出し）とその品詞との組で構成される形態素解析結果を得る。なお、ここで使用する形態素解析の手法は、テキストを見出しと品詞の組で構成される形態素に分割する公知の手法のいずれを使用してもよい。
【００２０】
テキスト解析部１０２の解析結果は秘匿情報マスク部１０３に供給される。秘匿情報マスク部１０３は記憶部１０４に記憶されているマスクルールを読出し、読出したマスクルールに従って、テキスト解析部１０２からの形態素解析結果をマスクするようになっている。マスクルールは、個人情報等の秘匿情報をコーパス作成時に削除するための情報を記述したものである。
【００２１】
図２は記憶部１０４に記憶されているマスクルールを説明するための説明図である。
【００２２】
図２の例では、マスクルールは品詞と見出しとの組の情報として蓄積されている。図２は２つのレコードが蓄積された例を示している。図２中の＊（アスタリスク）は全ての見出しについての制約条件がないこと、即ち、その品詞中に含まれる全ての単語をマスクすることを示している。図２の例は品詞が人名：姓氏の全ての見出し（単語）をマスクすると共に、品詞が人名：名前の全ての見出し（単語）をマスクする処理を示している。
【００２３】
例えば、見出しの欄にアスタリスクに代えて「田中」を記述することにより、品詞が人名：姓氏のうち「田中」という単語のみをマスクするようにすることができる。
【００２４】
なお、図２の例では、記憶部１０４に形態素の品詞情報に基づいて記述したマスクルールを記憶させた例を示しているが、マスクルールを見出し又は見出しと品詞の組み合わせによって表現することも可能である。
【００２５】
秘匿情報マスク部１０３はマスクルールに従ってマスクした形態素解析結果をコーパス集積部１０５に出力する。コーパス集積部１０５は、入力されたマスク後の形態素解析結果を記憶部１０６に与えて、統計的言語モデル作成用の秘匿情報排除済みコーパスとして蓄積させる。こうして、記憶部１０６には、マスクルールに従って秘匿情報が除去されたコーパスが蓄積される。なお、記憶部１０６に記憶させるコーパスの形式としては、形態素解析結果であってもよく、また、単なるテキストの形式であってもよい。
【００２６】
コーパス統計集計部１０７は、記憶部１０６に蓄積されたコーパスを読出して、統計的言語モデルを作成して記憶部１０８に出力する。なお、コーパス統計集計部１０７がコーパスから統計的言語モデル（Ｎ−ｇｒａｍ）を作成する手法については、公知の手法のいずれを使用してもよい。記憶部１０８は統計的言語モデルを記憶する。
【００２７】
この記憶部１０８に記憶された統計的言語モデルを利用することで、音声認識処理が行われる。この場合には、統計的言語モデルの作成の元となるコーパスには秘匿情報が含まれていないことから、データベース１００に蓄積されていた秘匿情報が音声認識結果に現れる可能性は低い。
【００２８】
次に、このように構成された実施の形態の作用について図３乃至図５の説明図を参照して説明する。
【００２９】
いま、データベース１００に医療関係の情報が蓄積されているものとして説明する。テキスト抽出部１０１はデータベース１００に蓄積されている各種情報の中からテキストを抽出してテキスト解析部１０２に出力する。テキスト解析部１０２は形態素解析によって、テキストから見出しとその品詞との組で構成される形態素解析結果を得る。
【００３０】
いま、テキスト抽出部１０１によって抽出されたテキストの一部が図３に示すものであるものとする。図３は抽出されたテキスト中に、日本一郎さんが胃癌であること、日本次郎さんが胃潰瘍であること、日本三郎さんが肺癌であることを示す表記が存在したことを示している。
【００３１】
この場合には、テキスト解析部１０２は、例えば図４に示す解析結果を得る。即ち、図４に示すように、抽出された図３の各単語列は、夫々単語に分割されて、その品詞が付される。例えば、図３の「日本一郎：胃癌」は、日本／一郎／：／胃癌の４つの単語からなり、「日本」は品詞が人名：姓氏で、「一郎」は品詞が人名：名前で、「：」は品詞が記号で、「胃癌」は品詞が名詞であることが解析結果によって得られる。
【００３２】
テキスト解析部１０２の解析結果は秘匿情報マスク部１０３に与えられる。秘匿情報マスク部１０３は記憶部１０４からマスクルールを読出す。例えば、マスクルールとして図２に示すルールが記述されているものとする。この場合には、秘匿情報マスク部１０３は、テキスト解析部１０２の解析結果のうち、品詞が人名：姓氏の全ての単語と品詞が人名：名前の全ての単語とをマスクし、マスク後の形態素解析結果をコーパス集積部１０５に出力する。
【００３３】
例えば、秘匿情報マスク部１０３がマスクルールに従ってマスクした単語を○又は（黒丸）等によって表現するものとすると、秘匿情報マスク部１０３からは図５に示す形態素解析結果が出力される。
【００３４】
図５と図３及び図４との比較から明らかなように、形態素解析結果のうち品詞が人名：姓氏と人名：名前の単語については、夫々○又は（黒丸）によってマスクされて秘匿情報マスク部１０３から出力されている。即ち、図３の例では、病院のカルテ等に含まれる医療関係の情報のうち人名等の秘匿すべき個人情報については、形態素解析結果として出力されることが防止される。
【００３５】
コーパス集積部１０５は、秘匿情報マスク部１０３によってマスクされた形態素解析結果を記憶部１０６に記憶させる。こうして、記憶部１０６には、秘匿情報が排除されたコーパスが蓄積される。コーパス統計集計部１０７は、蓄積されたコーパスに基づいて、統計的言語モデルを作成して、記憶部１０８に記憶させる。
【００３６】
記憶部１０８に蓄積された統計的言語モデルを用いて音声認識処理が行われる。記憶部１０８の統計的言語モデルは、秘匿情報が排除されたコーパスに基づいて作成されており、音声認識結果に秘匿情報が現れることが防止される。
【００３７】
このように、本実施の形態においては、データベースから抽出したテキストに基づいてコーパスを作成する際に秘匿情報を排除していることから、作成された統計的言語モデルを用いた場合には、秘匿情報が音声認識結果に現れることを防止することができる。即ち、契約や法律上の理由からセキュリティー的に厳しい管理をしなければならない個人情報等のデータのデータベースを利用する場合でも、秘匿情報の流出を完全に防止することができる。
【００３８】
図６は本発明の第２の実施の形態を示すブロック図である。図６において図１と同一の構成要素には同一符号を付して説明を省略する。
【００３９】
第１の実施の形態においては秘匿情報をマスクすることで、秘匿情報の漏出を防止した。図５に示すように、本来名前の部分はマスクされて、品詞の情報も削除されてしまう。即ち、品詞の情報も含み名前等の秘匿情報の全ての情報が形態素解析結果として用いられないので、音声認識処理に際して名前等を正確に変換することが困難となってしまうことが考えられる。そこで、本実施の形態は、秘匿情報については典型的な他の単語に置き換えることで、形態素解析結果として利用するようにしたものである。
【００４０】
本実施の形態は秘匿情報マスク部１０３及び記憶部１０４に夫々代えて秘匿情報置換部１１０及び記憶部１１１を採用した点が第１の実施の形態と異なる。
【００４１】
記憶部１１１は置換ルールを記憶している。秘匿情報置換部１１０は記憶部１１１の置換ルールに従って、テキスト解析部１０２からの形態素解析結果のうち秘匿情報の単語を典型的な他の単語に置き換えてコーパス集積部１０５に出力するようになっている。
【００４２】
次に、このように構成された実施の形態の作用について図３、図７及び図８の説明図を参照して説明する。図７及び図８は夫々図２及び図４に対応したものである。
【００４３】
図７は記憶部１１１に記憶されている置換ルールの一例を示している。図７の例では、置換ルールは品詞、見出し及び典型的単語の組の情報として蓄積されている。図７は２つのレコードが蓄積された例を示している。図７においても＊（アスタリスク）は全ての見出しについての制約条件がないこと、即ち、その品詞中に含まれる全ての単語をマスクすることを示している。また、図７の典型的単語は、形態素解析結果の単語が品詞及び見出しで規定された単語である場合に、置き換える単語を示している。
【００４４】
図７の例では、例えば、品詞が人名：姓氏の全ての見出し（単語）については、「鈴木」という単語に置き換えその品詞を人名：姓氏にすることを示している。なお、図７では形態素の品詞情報に基づいた置換ルールが記述されているが、見出しあるいは見出しと品詞の組み合わせによって表現することも可能である。
【００４５】
いま、テキスト抽出部１０１によって抽出されたテキストの一部が図３に示すものであるものとして説明する。
【００４６】
この場合の形態素解析結果は例えば図４に示すものとなり、この解析結果が秘匿情報置換部１１０に与えられる。秘匿情報置換部１１０は記憶部１１１から置換ルールを読出す。置換ルールが図７に示すものである場合には、秘匿情報置換部１１０は、テキスト解析部１０２の解析結果のうち、品詞が人名：姓氏の全ての単語を「鈴木」に置き換えその品詞を人名：姓氏とし、品詞が人名：名前の全ての単語を「太郎」に置き換えその品詞を人名：名前とし、置換後の形態素解析結果をコーパス集積部１０５に出力する。
【００４７】
こうして、秘匿情報置換部１１０からは図８に示す形態素解析結果が出力される。図４と図８との比較から明らかなように、形態素解析結果のうち品詞が人名：姓氏と人名：名前の単語については、典型的な単語である「鈴木」、「太郎」に置換されて秘匿情報置換部１１０から出力されている。即ち、例えば、病院のカルテ等に含まれる医療関係の情報のうち人名等の秘匿すべき個人情報については、本来の人名ではなく、典型的な人名である例えば「鈴木太郎」に置き換えられるので、個人情報が形態素解析結果として出力されることが防止される。また、置換後の単語についても置換前の単語と同一の品詞の情報が付加されるので、音声認識処理に際して、品詞情報を活用することが可能である。
【００４８】
他の作用は第１の実施の形態と同様である。
【００４９】
このように、本実施の形態においては、秘匿情報については典型的な単語に置換されて形態素解析結果として出力されることから、作成された統計的言語モデルを用いた場合でも、秘匿情報が音声認識結果に現れることを防止することができる。しかも、単語を置換した場合でも品詞情報が失われないので、統計的言語モデルを用いた音声認識処理において品詞の情報を利用した処理を可能にすることができる。
【００５０】
図９は本発明の第３の実施の形態を示すブロック図である。図９において図６と同一の構成要素には同一符号を付して説明を省略する。
【００５１】
第２の実施の形態においては秘匿情報を典型的な単語に置換することで、秘匿情報の漏出を防止した。同様に、本実施の形態は秘匿情報の単語を適宜の伏せ字に置き換えることにより、秘匿情報の漏出を防止するようにしたものである。なお、置き換えに際して、置き換え前の品詞情報と同一の品詞情報を伏せ字に付加するようになっている。
【００５２】
本実施の形態は記憶部１１１に代えて記憶部１２１を採用した点が第２の実施の形態と異なる。記憶部１２１は品詞情報を有する伏せ字に置き換えるための置換ルールを記憶している。
【００５３】
次に、このように構成された実施の形態の作用について図１０乃至図１３の説明図を参照して説明する。図１０乃至図１３は夫々図２乃至図５に対応したものである。
【００５４】
図１０は記憶部１２１に記憶されている置換ルールの一例を示している。図１０の例では、置換ルールは品詞、見出し及びマスク単語の組の情報として蓄積されている。図１０は１１個のレコードが蓄積された例を示している。図１０において＊（アスタリスク）は、全ての単語を意味し、例えば、＊県は、最後の文字が「県」の全ての見出しを示している。図１０のマスク単語は、形態素解析結果の単語が品詞及び見出しで規定された単語である場合に、伏せ字で置き換える単語を示している。
【００５５】
図１０の例では、例えば、品詞が地名のうち「県」を最後の文字とする全ての見出し（単語）については、「□県」という単語に置き換えその品詞を地名にすることを示している。なお、図１０の置換ルールは例えばレコードの上側のものから順に実行されるものとし、例えば、品詞が地名の＊で表される見出しよりも、＊都，＊道，…等の上側のレコードに記述されたルールが先に実行される。
【００５６】
また、図１０においても、形態素の品詞情報に基づいた置換ルールが記述されているが、見出しあるいは見出しと品詞の組み合わせによって表現することも可能である。
【００５７】
いま、テキスト抽出部１０１によって抽出されたテキストの一部が図１１に示すものであるものとして説明する。図１１は抽出されたテキスト中に、日本一郎さんの住所が「東京都港区芝浦１−２−３４５」であること、日本次郎さんの住所が「神奈川県川崎市幸区小向６−７−８」であること、日本三郎さんの住所が「神奈川県川崎市幸区幸町９−１」であることを示す表記が存在したことを示している。
【００５８】
この場合には、テキスト解析部１０２は、例えば図１２に示す解析結果を得る。即ち、図１２に示すように、抽出された図１１の各単語列は、夫々単語に分割されて、その品詞が付される。例えば、図１１の「日本一郎：東京都港区芝浦１−２−３４５」は、日本／一郎／：／東京都／港区／１／−／２／−／３／４／５の１２個の単語からなり、「日本」は品詞が人名：姓氏で、「一郎」は品詞が人名：名前で、「：」は品詞が記号で、「東京都」は品詞が地名で、「港区」は品詞が地名で、「１」は品詞が数字で、「−」は品詞が記号で、「２」は品詞が数字で、「−」は品詞が記号で、「３」は品詞が数字で、「４」は品詞が数字で、「５」は品詞が数字であることが解析結果によって得られる。
【００５９】
これらの形態素解析結果は秘匿情報置換部１１０に与えられる。秘匿情報置換部１１０は記憶部１２１から置換ルールを読出し、置換ルールに従って秘匿情報の置換を行う。置換ルールが図１０に示すものである場合には、秘匿情報置換部１１０は、テキスト解析部１０２の解析結果のうち、品詞が人名：姓氏の全ての単語を「○」に置き換えその品詞を人名：姓氏とし、品詞が人名：名前の全ての単語を「（黒丸）」に置き換えその品詞を人名：名前とし、品詞が数字の全ての単語を「◇」に置き換えその品詞を数字とし、同様に、地名についても図１０の置換ルールに従って置換を行う。こうして、秘匿情報置換部１１０からは図１３に示す形態素解析結果が出力される。
【００６０】
図１２と図１３との比較から明らかなように、形態素解析結果のうち置換ルールで規定された単語については、○、（黒丸）、◇、（黒菱）形の文字に置換され、或いは、□、（黒四角）と「都」、「道」、「府」、「県」、「市」、「区」、「町」の文字を付加した単語に置換される。例えば、図１２中の「日本次郎」さんの住所である「神奈川県川崎市幸区小向６−７−８」は「□県（黒四角）市（黒四角）区（黒菱形）◇−◇−◇◇◇」に置換される。即ち、例えば、病院のカルテ等に含まれる医療関係の情報のうち人名，地名等の秘匿すべき個人情報については、品詞の情報が付加された伏せ字に置き換えられるので、個人情報が形態素解析結果として出力されることが防止される。また、伏せ字には置換前の単語と同一の品詞の情報が付加されるので、音声認識処理に際して、品詞情報を活用することが可能である。
【００６１】
他の作用は第２の実施の形態と同様である。
【００６２】
このように、本実施の形態においても、第２の実施の形態と同様の効果を得ることができる。
【００６３】
図１４は本発明の第４の実施の形態を示すブロック図である。図１４において図６と同一の構成要素には同一符号を付して説明を省略する。
【００６４】
第２及び図３の実施の形態においては秘匿情報を予め定められた典型的な語又は伏せ字に置換することで、秘匿情報の漏出を防止した。しかしながら、秘匿情報の単語が、伏せ字又は典型的な単語に置き換えられることから、形態素解析結果には個人名等が含まれず、この形態素解析結果を基に作成した統計的言語モデルを用いた場合には、個人名等を正しく音声認識処理することが困難となることがある。そこで、本実施の形態は、秘匿情報の単語を同一品詞のランダムな単語に置き換えることにより、秘匿情報の漏出を防止すると共に、個人名等であっても正しく音声認識処理することを可能にしたものである。
【００６５】
本実施の形態は秘匿情報置換部１１０及び記憶部１１１に夫々代えて秘匿情報置換部１３０及び記憶部１３１を採用した点が第２の実施の形態と異なる。記憶部１３１は秘匿情報を同一品詞の他のランダムな単語に置き換えるための置換ルールを記憶している。
【００６６】
次に、このように構成された実施の形態の作用について図１５乃至図１８の説明図を参照して説明する。図１５乃至図１８は夫々図２乃至図５に対応したものである。
【００６７】
図１５は記憶部１３１に記憶されている置換ルールの一例を示している。図１５の例では、置換ルールは品詞、見出し及びランダム単語の組の情報として蓄積されている。図１５は１１個のレコードが蓄積された例を示している。図１５において＊（アスタリスク）は、全ての単語を意味する。図１５のランダム単語は、形態素解析結果の単語が品詞及び見出しで規定された単語である場合に、置き換える単語を示しており、Ｒａｎｄ（）はランダムに抽出する単語であることを示している。
【００６８】
図１５の例では、例えば、品詞が数字の０−９は、夫々品詞が数字である０−９のランダムな数字に置き換えられることを示している。また、図１５においても、形態素の品詞情報に基づいた置換ルールが記述されているが、見出しあるいは見出しと品詞の組み合わせによって表現することも可能である。
【００６９】
いま、テキスト抽出部１０１によって抽出されたテキストの一部が図１６に示すものであるものとして説明する。図１６は抽出されたテキスト中に、日本一郎さんに関する何らかの数値が「１２３−１２３４」であること、日本次郎さんに関する何らかの数値が「１２３−４５６７」であること、日本三郎さんに関する何らかの数値が「９９９−９９９９」であることを示す表記が存在したことを示している。なお、これらの数値は、各種測定値、金額、年齢等の各種情報である。
【００７０】
この場合には、テキスト解析部１０２は、例えば図１７に示す解析結果を得る。即ち、図１７に示すように、抽出された図１６の各単語列は、夫々単語に分割されて、その品詞が付される。例えば、図１６の「日本一郎：１２３−１２３４」は、日本／一郎／：／１／２／３／−／１／２／３／４／５の１２個の単語からなり、「日本」は品詞が人名：姓氏で、「一郎」は品詞が人名：名前で、「：」は品詞が記号で、「１」〜「３」は品詞が数字で、「−」は品詞が記号で、「１」〜「４」は品詞が数字であることが解析結果によって得られる。
【００７１】
これらの形態素解析結果は秘匿情報置換部１３０に与えられる。秘匿情報置換部１３０は記憶部１３１から置換ルールを読出し、置換ルールに従って秘匿情報の置換を行う。置換ルールが図１５に示すものである場合には、秘匿情報置換部１３０は、テキスト解析部１０２の解析結果のうち、品詞が人名：姓氏及び人名：名前の全ての単語をランダムに抽出した同一品詞の他の単語に置き換えて同一の品詞を付し、品詞が数字の全ての単語についてはランダムに抽出した品詞が数字の他の数字に置き換える。こうして、秘匿情報置換部１３０からは図１８に示す形態素解析結果が出力される。
【００７２】
図１７と図１８との比較から明らかなように、形態素解析結果のうち置換ルールで規定された単語については、ランダムに抽出した同一品詞の他の単語に置換される。例えば、図１７中の「日本一郎」さんの「１２３−１２３４」は「鈴木良子」さんの「３１３−６９２４」に置換される。即ち、例えば、病院のカルテ等に含まれる医療関係の情報のうち人名及び数値等の秘匿すべき個人情報については、ランダムに抽出された同一品詞の他の単語に置き換えられるので、個人情報が形態素解析結果として出力されることが防止される。また、置換後の単語としてはランダムに抽出した同一品詞の単語を用いるので、置換後の形態素解析結果には名前や数値等の情報が含まれる。従って、この形態素解析結果を基に作成した統計的言語モデルを用いた場合には、音声認識処理に際して、名前や数値等を確実に認識することが可能となる。
【００７３】
他の作用は第２の実施の形態と同様である。
【００７４】
このように、本実施の形態においても、第２及び第３の実施の形態と同様の効果を得ることができる。更に、置換後の単語としてランダムに抽出した同一品詞の単語を用いるので、実際に利用される名前や数値等の情報を含む統計的言語モデルを作成することができ、音声認識精度が低下することを防止することができる。また、置換後の単語を、利用形態に応じたデータベースから抽出することにより、利用形態に適した統計的言語モデルを作成することができ、音声認識精度を一層向上させることができる。
【００７５】
図１９は本発明の第５の実施の形態を示すブロック図である。図１９において図１と同一の構成要素には同一符号を付して説明を省略する。
【００７６】
上記第１乃至第４の実施の形態においては、統計的言語モデル用のテキストを収集する過程で、統計情報集計に使用するテキスト中の秘匿情報に関する表記情報を事前に除去した。これに対し、本実施の形態は既存のコーパス（統計的言語モデル用のテキスト収集後のテキスト）から統計的言語モデルを作成する過程で、統計情報集計に使用するテキスト中の秘匿情報に関する表記情報を事前に除去するようにしたものである。
【００７７】
本実施の形態はデータベース１００に代えてコーパス１４０を採用し、テキスト抽出部１０１に代えてコーパス入力部１４１を採用した点が図１の実施の形態と異なる。
【００７８】
コーパス１４０は、統計的言語モデル作成用に収集されたテキストが集積されたものである。コーパス１４０中には秘匿情報が含まれていることがある。コーパス入力部１４０は、コーパス１４１を処理対象として、コーパス１４１中のテキストをテキスト解析部１０２に出力する。
【００７９】
このように構成された実施の形態においては、既に作成されている既存のコーパス１４０からテキストを抽出してテキスト解析部１０２に出力する。これにより、以後、第１の実施の形態と同様の手法によって、秘匿情報をマスクした形態素解析結果を得、秘匿情報を排除したコーパスを蓄積することができる。
【００８０】
他の作用は図１の実施の形態と同様である。
【００８１】
このように、本実施の形態においては既存のコーパスを利用する場合でも、秘匿情報を排除したコーパスに変換することができ、秘匿情報を漏出させることなく、統計的言語モデルを利用した音声認識処理が可能にすることもできる。
【００８２】
上記第５の実施の形態は上記第２乃至第４の実施の形態にも適用することができる。図２０乃至図２２は第５の実施の形態を図６、図９及び図１４に示す第２乃至第４の実施の形態に適用した場合のブロック図である。
【００８３】
これらの図２０乃至図２２においては、データベース１００に代えてコーパス１４０を用い、テキスト抽出部１０１に代えてコーパス入力部１４１を採用した点が、図６、図９及び図１４と異なる。
【００８４】
他の構成及び作用は、夫々第２乃至第４の実施の形態と同様である。
【００８５】
このように、これらの図２０乃至図２２に示す例においても、第１乃至第５の実施の形態と同様の効果を得ることができる。
【００８６】
【発明の効果】
以上説明したように本発明によれば、統計的言語モデル用のテキストを収集する過程又は収集後のテキストから統計的言語モデルを作成する過程において、統計情報集計に使用するテキスト中の個人情報に関する表記情報を除去することにより、プライベートな個人情報が誤って認識結果中に出力されることを防止することができるという効果を有する。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態に係る統計的言語モデルを作成するためのコーパス処理装置を示すブロック図。
【図２】図１中の記憶部１０４に記憶されているマスクルールを説明するための説明図。
【図３】第１の実施の形態の作用を説明するための説明図。
【図４】第１の実施の形態の作用を説明するための説明図。
【図５】第１の実施の形態の作用を説明するための説明図。
【図６】本発明の第２の実施の形態を示すブロック図。
【図７】図６中の記憶部１１１に記憶されている置換ルールを説明するための説明図。
【図８】第２の実施の形態の作用を説明するための説明図。
【図９】本発明の第３の実施の形態を示すブロック図。
【図１０】図９中の記憶部１２１に記憶されている置換ルールを説明するための説明図。
【図１１】第３の実施の形態の作用を説明するための説明図。
【図１２】第３の実施の形態の作用を説明するための説明図。
【図１３】第３の実施の形態の作用を説明するための説明図。
【図１４】本発明の第４の実施の形態を示すブロック図。
【図１５】図１４中の記憶部１３１に記憶されている置換ルールを説明するための説明図。
【図１６】第４の実施の形態の作用を説明するための説明図。
【図１７】第４の実施の形態の作用を説明するための説明図。
【図１８】第４の実施の形態の作用を説明するための説明図。
【図１９】本発明の第５の実施の形態を示すブロック図。
【図２０】第５の実施の形態の変形例を示すブロック図。
【図２１】第５の実施の形態の変形例を示すブロック図。
【図２２】第５の実施の形態の変形例を示すブロック図。
【符号の説明】
１００…データベース、１０１…テキスト抽出部、１０２…テキスト解析部、１０３…秘匿情報マスク部、１０４，１０６，１０８…記憶部、１０５…コーパス集積部、１０７…コーパス統計集計部。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a corpus processing apparatus, method, and program for creating a statistical language model that prevents leakage of personal information and the like that should be kept secret during speech recognition processing.
[0002]
[Prior art]
In recent years, a Japanese dictation system that can input Japanese sentences by voice has been put into practical use. This system achieves high recognition performance by using not only the acoustic correctness but also information on the appearance probability of a word set called a statistical language model. Here, the appearance probability of the word group constituting the statistical language model is obtained by aggregating statistical information using a statistical technique for a large amount of text called a corpus.
[0003]
Therefore, the appearance probability of the word set in the statistical language model strongly depends on the content of the sentence in the corpus (text) used to create the language model. That is, in the dictation system, it is easy to recognize an expression that appears in a sentence in a text used to create a statistical language model, and conversely, an expression that does not appear is difficult to recognize. For this reason, when a user inputs a sentence that does not appear in the text used to create the statistical language model, the user may misrecognize a notation string that is acoustically close to a word set that appears in the text. Not a few.
[0004]
These techniques are described in detail in the following documents.
[0005]
"Speech recognition using probabilistic models" by Seiichi Nakagawa IEICE {ISBN4-88552-072-X}
“Spoken Language Processing” by Kenji Kitamori Morikita Publishing {ISBN4-627-82380-0}
"Probabilistic language model" by Kenji Kita University of Tokyo Press {ISBN4-13-065404-7}
[0006]
[Problems to be solved by the invention]
In this way, all the expressions appearing in the corpus (text) are directly reflected in the appearance probability of the word set of the statistical language model. Therefore, in the dictation system, expressions that appear in the corpus (text) are likely to appear as recognition results.
[0007]
For this reason, if the corpus (text) used to create the statistical language model contains private personal information, etc., the content of the personal information may appear as a misrecognition result. Information may be revealed to others.
[0008]
For example, when performing a medical speech recognition process, recognition accuracy can be improved by using a medical corpus. However, personal information is often included in a medical corpus, such as a medical record, and there is a problem that personal information may be revealed depending on the result of voice recognition processing.
[0009]
The present invention eliminates notation information relating to confidential information in text used for statistical information aggregation in the process of collecting text for a statistical language model or in the process of creating a statistical language model from the collected text. Another object of the present invention is to provide a corpus processing apparatus, method, and program for creating a statistical language model that can prevent private personal information from being erroneously output in a recognition result.
[0010]
[Means for Solving the Problems]
A corpus processing apparatus for creating a statistical language model according to claim 1 of the present invention is for masking words included in confidential information, and a text analysis unit that morphologically analyzes text data and outputs a morphological analysis result. In accordance with the predetermined mask rule, the confidential information masking part for masking the morphological analysis result, the corpus integrating part for collecting the morphological analysis result masked by the confidential information masking part as a corpus, and the corpus integrating part It has a corpus statistics totaling unit that collects statistical information from the corpus,
A corpus processing device for creating a statistical language model according to claim 2 of the present invention includes: a text analysis unit that performs morphological analysis on text data and outputs a morpheme analysis result; A confidential information replacement unit that replaces the morphological analysis result according to a predetermined replacement rule for replacing the morpheme analysis result, a corpus integration unit that accumulates the morphological analysis result replaced by the confidential information replacement unit as a corpus, and the corpus integration unit And a corpus statistics totaling unit that collects statistical information from the corpus accumulated by.
[0011]
In the first aspect of the present invention, the result of morphological analysis by the text analysis unit is given to the confidential information mask unit. The confidential information masking unit masks the morphological analysis result according to a predetermined mask rule for masking words included in the confidential information. The corpus accumulation unit accumulates the morphological analysis results in which words included in the confidential information are masked as a corpus. The corpus statistics totaling unit collects statistical information based on the accumulated corpus. The confidential information is masked in the statistical information, and the confidential information is prevented from leaking out in the voice recognition process using the statistical information.
[0012]
In claim 2 of the present invention, the morphological analysis result by the text analysis unit is given to the confidential information replacement unit. The secret information replacement unit replaces the morphological analysis result according to a predetermined replacement rule for replacing a word included in the secret information with another word. The corpus accumulation unit accumulates, as a corpus, morphological analysis results in which words included in the confidential information are replaced. The corpus statistics totaling unit collects statistical information based on the accumulated corpus. In the statistical information, the confidential information is replaced with another word, and the confidential information is prevented from leaking out in the speech recognition processing using the statistical information.
[0013]
Note that the present invention relating to an apparatus is also established as an invention relating to a method.
[0014]
Further, the present invention relating to the apparatus is also realized as a program for causing a computer to execute processing corresponding to the present invention.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a corpus processing apparatus for creating a statistical language model according to the first embodiment of the present invention.
[0016]
In this embodiment, in the process of collecting text for a statistical language model, notation information relating to information to be concealed (hereinafter referred to as concealment information) such as personal information in the text used for statistical information aggregation is removed in advance. This prevents private personal information or the like from being erroneously output in the recognition result of the dictation system or the like.
[0017]
Various data are stored in the database 100. For example, in medical relations such as hospitals, information stored in the database 100 may be information such as medical records and radiographic results. The information in the database 100 includes sentence data including text necessary for creating a statistical language model. In the text data, information that should be kept secret, for example, notation information about personal information may be described.
[0018]
The text extraction unit 101 extracts text information that can be used as a corpus from the database 100 in which data is stored, and supplies the text information to the text analysis unit 102.
[0019]
The text analysis unit 102 outputs the result of morphological analysis of the text extracted by the text extraction unit 101. For example, the text analysis unit 102 divides a sentence in the text into words, and obtains a morphological analysis result including a combination of the divided word (heading) and its part of speech. Note that the morpheme analysis method used here may be any known method that divides text into morphemes composed of combinations of headings and parts of speech.
[0020]
The analysis result of the text analysis unit 102 is supplied to the confidential information mask unit 103. The confidential information mask unit 103 reads the mask rule stored in the storage unit 104, and masks the morphological analysis result from the text analysis unit 102 according to the read mask rule. The mask rule describes information for deleting confidential information such as personal information when creating a corpus.
[0021]
FIG. 2 is an explanatory diagram for explaining the mask rules stored in the storage unit 104.
[0022]
In the example of FIG. 2, the mask rules are stored as information on pairs of parts of speech and headings. FIG. 2 shows an example in which two records are accumulated. * (Asterisk) in FIG. 2 indicates that there are no restrictions on all headings, that is, all words included in the part of speech are masked. The example of FIG. 2 shows a process in which the part of speech masks all headings (words) of the person name: surname and the part of speech masks all headings (words) of the person name: name.
[0023]
For example, by describing “Tanaka” instead of an asterisk in the heading column, it is possible to mask only the word “Tanaka” in the part of speech: surname and surname.
[0024]
In the example of FIG. 2, the mask rule described based on the morpheme part-of-speech information is stored in the storage unit 104. However, the mask rule can be expressed by a headline or a combination of a headline and a part-of-speech. It is.
[0025]
The confidential information mask unit 103 outputs the morphological analysis result masked according to the mask rule to the corpus accumulating unit 105. The corpus accumulating unit 105 gives the input post-mask morpheme analysis result to the storage unit 106 and accumulates it as a confidential information excluded corpus for creating a statistical language model. In this way, the corpus from which the confidential information is removed according to the mask rule is stored in the storage unit 106. The corpus format stored in the storage unit 106 may be a morphological analysis result or may be a simple text format.
[0026]
The corpus statistics totaling unit 107 reads out the corpus accumulated in the storage unit 106, creates a statistical language model, and outputs the statistical language model to the storage unit 108. Note that any of the known methods may be used as the method by which the corpus statistics totaling unit 107 creates a statistical language model (N-gram) from the corpus. The storage unit 108 stores a statistical language model.
[0027]
By using the statistical language model stored in the storage unit 108, voice recognition processing is performed. In this case, since the confidential information is not included in the corpus from which the statistical language model is created, it is unlikely that the confidential information stored in the database 100 will appear in the speech recognition result.
[0028]
Next, the operation of the embodiment configured as described above will be described with reference to FIGS. 3 to 5.
[0029]
Now, description will be made assuming that medical-related information is stored in the database 100. The text extraction unit 101 extracts text from various information stored in the database 100 and outputs it to the text analysis unit 102. The text analysis unit 102 obtains a morpheme analysis result including a combination of a headline and its part of speech from the text by morphological analysis.
[0030]
Now, it is assumed that a part of the text extracted by the text extraction unit 101 is as shown in FIG. FIG. 3 shows that in the extracted text, there is a notation indicating that Ichiro Nihon is a stomach cancer, Jiro Jiro is a stomach ulcer, and Saburo Nihon is a lung cancer.
[0031]
In this case, the text analysis unit 102 obtains the analysis result shown in FIG. 4, for example. That is, as shown in FIG. 4, each extracted word string in FIG. 3 is divided into words and given their parts of speech. For example, “Nichiro Ichiro: Stomach Cancer” in FIG. 3 is composed of four words: Japan / Ichiro /: / Stomach Cancer. “Japan” is a part of speech with a person name: surname, and “Ichiro” is a part of speech with a person name: name. ":" Is a part of speech, and "stomach cancer" is a noun.
[0032]
The analysis result of the text analysis unit 102 is given to the confidential information mask unit 103. The confidential information mask unit 103 reads the mask rule from the storage unit 104. For example, it is assumed that the rule shown in FIG. 2 is described as the mask rule. In this case, the confidential information masking unit 103 masks all words of the person name: surname and part of speech of the person name: name from the analysis result of the text analysis unit 102 and masks the morpheme after masking. The analysis result is output to the corpus accumulating unit 105.
[0033]
For example, assuming that the word masked by the secret information mask unit 103 according to the mask rule is represented by a circle or a black circle, the secret information mask unit 103 outputs the morphological analysis result shown in FIG.
[0034]
As is clear from the comparison between FIG. 5 and FIG. 3 and FIG. 4, the part of speech of the morphological analysis results is masked by ○ or (black circle) for the person name: surname and person name: name word, respectively, and the confidential information mask part 103 is output. That is, in the example of FIG. 3, personal information to be concealed, such as a person's name, among medical-related information included in a hospital chart or the like is prevented from being output as a morphological analysis result.
[0035]
The corpus accumulating unit 105 causes the storage unit 106 to store the morphological analysis result masked by the confidential information masking unit 103. In this way, the corpus from which the confidential information is excluded is stored in the storage unit 106. The corpus statistics totaling unit 107 creates a statistical language model based on the accumulated corpus and stores it in the storage unit 108.
[0036]
Speech recognition processing is performed using the statistical language model stored in the storage unit 108. The statistical language model of the storage unit 108 is created based on a corpus from which confidential information is excluded, and the confidential information is prevented from appearing in the speech recognition result.
[0037]
As described above, in the present embodiment, since confidential information is excluded when creating a corpus based on text extracted from a database, when the created statistical language model is used, the confidential information is excluded. It is possible to prevent information from appearing in the speech recognition result. That is, even when using a database of data such as personal information that must be strictly managed in terms of security for contract and legal reasons, the leakage of confidential information can be completely prevented.
[0038]
FIG. 6 is a block diagram showing a second embodiment of the present invention. In FIG. 6, the same components as those in FIG.
[0039]
In the first embodiment, the confidential information is masked to prevent leakage of the confidential information. As shown in FIG. 5, the original name portion is masked, and the part of speech information is also deleted. That is, since all information of confidential information such as names including part-of-speech information is not used as a morphological analysis result, it may be difficult to accurately convert names and the like during speech recognition processing. Therefore, in the present embodiment, confidential information is used as a morphological analysis result by replacing it with other typical words.
[0040]
This embodiment is different from the first embodiment in that a secret information replacement unit 110 and a storage unit 111 are employed instead of the secret information mask unit 103 and the storage unit 104, respectively.
[0041]
The storage unit 111 stores replacement rules. The secret information replacement unit 110 replaces the word of the secret information in the morpheme analysis result from the text analysis unit 102 with another typical word according to the replacement rule of the storage unit 111 and outputs it to the corpus accumulation unit 105. Yes.
[0042]
Next, the operation of the embodiment configured in this way will be described with reference to the explanatory views of FIGS. 7 and 8 correspond to FIGS. 2 and 4, respectively.
[0043]
FIG. 7 shows an example of the replacement rule stored in the storage unit 111. In the example of FIG. 7, the replacement rule is stored as information on a part of speech, a headline, and a typical word set. FIG. 7 shows an example in which two records are accumulated. Also in FIG. 7, * (asterisk) indicates that there are no restrictions on all headings, that is, all words included in the part of speech are masked. Moreover, the typical word of FIG. 7 has shown the word to replace when the word of a morphological analysis result is the word prescribed | regulated by the part of speech and the headline.
[0044]
In the example of FIG. 7, for example, all headings (words) whose part of speech is personal name: surname are replaced with the word “Suzuki” and the part of speech is represented as personal name: surname. In FIG. 7, the replacement rule based on the morpheme part-of-speech information is described.
[0045]
Now, description will be made assuming that a part of the text extracted by the text extraction unit 101 is as shown in FIG.
[0046]
The morpheme analysis result in this case is, for example, as shown in FIG. The secret information replacement unit 110 reads the replacement rule from the storage unit 111. When the replacement rule is as shown in FIG. 7, the secret information replacement unit 110 replaces all words whose part of speech is personal name: surname in the analysis result of the text analysis unit 102 with “Suzuki” and replaces the part of speech with the personal name. : Last name, part of speech is personal name: replace all words of name: “Taro”, the part of speech is personal name: name, and the morphological analysis result after replacement is output to corpus accumulation unit 105.
[0047]
Thus, the secret information replacement unit 110 outputs the morphological analysis result shown in FIG. As is clear from the comparison between FIG. 4 and FIG. 8, the part of speech in the morphological analysis results is replaced with the typical words “Suzuki” and “Taro” for the personal name: surname and personal name: name words. The secret information replacement unit 110 outputs the information. That is, for example, personal information to be concealed, such as a person's name, among medical-related information included in hospital charts, etc., is not the original person name, but is replaced with a typical person name such as `` Taro Suzuki ''. Personal information is prevented from being output as a morphological analysis result. Also, the part-of-speech information can be used in the speech recognition process because the same part-of-speech information as the word before the replacement is added to the word after the replacement.
[0048]
Other operations are the same as those in the first embodiment.
[0049]
As described above, in the present embodiment, the confidential information is replaced with typical words and output as a morphological analysis result. Therefore, even when the created statistical language model is used, the confidential information is sounded. It can be prevented from appearing in the recognition result. Moreover, since the part-of-speech information is not lost even when the word is replaced, it is possible to perform processing using the part-of-speech information in the speech recognition processing using the statistical language model.
[0050]
FIG. 9 is a block diagram showing a third embodiment of the present invention. In FIG. 9, the same components as those of FIG.
[0051]
In the second embodiment, leakage of confidential information is prevented by replacing the confidential information with typical words. Similarly, in the present embodiment, leakage of confidential information is prevented by replacing words of the confidential information with appropriate hidden characters. When replacing, the same part-of-speech information as the part-of-speech information before replacement is added to the hidden character.
[0052]
This embodiment is different from the second embodiment in that a storage unit 121 is used instead of the storage unit 111. The storage unit 121 stores a replacement rule for replacing with a hidden character having part-of-speech information.
[0053]
Next, the operation of the embodiment configured as described above will be described with reference to the explanatory diagrams of FIGS. 10 to 13 correspond to FIGS. 2 to 5, respectively.
[0054]
FIG. 10 shows an example of the replacement rule stored in the storage unit 121. In the example of FIG. 10, the replacement rule is stored as information on a set of parts of speech, headings, and mask words. FIG. 10 shows an example in which 11 records are accumulated. In FIG. 10, * (asterisk) means all words, for example, * prefecture indicates all headings whose last character is “prefecture”. The mask word in FIG. 10 indicates a word to be replaced with a covert character when the word of the morphological analysis result is a word defined by the part of speech and the heading.
[0055]
In the example of FIG. 10, for example, all headings (words) whose part of speech has “prefecture” as the last character in the place name are replaced with the word “□ prefecture” and the part of speech is used as the place name. . Note that the replacement rules in FIG. 10 are executed in order from the top of the record, for example, and for example, in the upper record such as * city, * road,. The described rule is executed first.
[0056]
In FIG. 10, the replacement rule based on the morpheme part-of-speech information is described, but it can be expressed by a heading or a combination of a heading and a part-of-speech.
[0057]
Now, description will be made assuming that a part of the text extracted by the text extraction unit 101 is as shown in FIG. FIG. 11 shows that in the extracted text, the address of Ichiro Nihon is “1-2-345 Shibaura, Minato-ku, Tokyo” and the address of Jiro Nihon is “6-7 Komukai, Saiwai-ku, Kawasaki City, Kanagawa Prefecture”. This indicates that there was a notation indicating that the address of Saburo Nihon-san was “9-1, Yukimachi, Saiwai-ku, Kawasaki-shi, Kanagawa”.
[0058]
In this case, the text analysis unit 102 obtains an analysis result shown in FIG. 12, for example. That is, as shown in FIG. 12, each extracted word string in FIG. 11 is divided into words, and their parts of speech are attached. For example, “Nihon Ichiro: Shibaura 1-2-345 Minato-ku, Tokyo” in FIG. 11 is 12 pieces in Japan / Ichiro /: / Tokyo / Minato-ku / 1 / − / 2 / − / 3/4/5. “Japan” is part of the name: surname, “Ichiro” is part of speech: name, name: “:” is part of speech, “Tokyo” is part of speech, and “Minato” Is part of speech, "1" is part of speech, "-" is part of speech, "2" is part of speech, "-" is part of speech, and "3" is part of speech. , “4” indicates that the part of speech is a number, and “5” indicates that the part of speech is a number.
[0059]
These morphological analysis results are given to the confidential information replacement unit 110. The secret information replacement unit 110 reads the replacement rule from the storage unit 121 and replaces the secret information according to the replacement rule. When the replacement rule is as shown in FIG. 10, the secret information replacement unit 110 replaces all the words whose part of speech is personal name: surname in the analysis result of the text analysis unit 102 with “○”. ： Last name, part of speech is personal name: Replace all words in the name with “(Kuromaru)”. Replace the part of speech with personal name: name, replace all words with part of speech in the number with “◇”, and change the part of speech into numbers. The place name is also replaced according to the replacement rule of FIG. Thus, the secret information replacement unit 110 outputs the morphological analysis result shown in FIG.
[0060]
As is clear from the comparison between FIG. 12 and FIG. 13, the words specified by the replacement rule in the morphological analysis results are replaced with characters of ○, (black circle), ◇, (black diamond), or □, (black squares) and words replaced with the words “city”, “road”, “fu”, “prefecture”, “city”, “ku”, “town”. For example, the address of “Jiro Jiro” in FIG. 12, “6-7-8 Komukai Koyuki, Kawasaki City, Kanagawa Prefecture” is “□ prefecture (black square) city (black square) ward (black rhombus) ◇ − Replaced by “◇-◇◇◇”. That is, for example, personal information to be concealed, such as personal names and place names, among medical information included in hospital charts, etc., is replaced with a covert character with part-of-speech information added. Output is prevented. Moreover, since the part of speech information that is the same as the word before the replacement is added to the concealed character, the part of speech information can be used in the speech recognition processing.
[0061]
Other operations are the same as those of the second embodiment.
[0062]
Thus, also in this embodiment, the same effect as that of the second embodiment can be obtained.
[0063]
FIG. 14 is a block diagram showing a fourth embodiment of the present invention. In FIG. 14, the same components as those of FIG.
[0064]
In the second and FIG. 3 embodiments, the confidential information is prevented from being leaked by replacing the confidential information with predetermined typical words or hidden characters. However, since the word of confidential information is replaced with a hidden word or a typical word, the morphological analysis result does not include personal names, etc., and when using a statistical language model created based on this morphological analysis result In some cases, it may be difficult to correctly perform voice recognition processing of personal names and the like. Therefore, in this embodiment, by replacing the word of the confidential information with a random word of the same part of speech, it is possible to prevent leakage of the confidential information and to correctly perform voice recognition processing even for an individual name or the like. Is.
[0065]
This embodiment is different from the second embodiment in that a secret information replacement unit 130 and a storage unit 131 are used instead of the secret information replacement unit 110 and the storage unit 111, respectively. The storage unit 131 stores a replacement rule for replacing confidential information with another random word of the same part of speech.
[0066]
Next, the operation of the embodiment configured as described above will be described with reference to FIGS. 15 to 18. 15 to 18 correspond to FIGS. 2 to 5, respectively.
[0067]
FIG. 15 shows an example of the replacement rule stored in the storage unit 131. In the example of FIG. 15, the replacement rule is stored as information on a set of parts of speech, headings, and random words. FIG. 15 shows an example in which 11 records are accumulated. In FIG. 15, * (asterisk) means all words. The random word in FIG. 15 indicates a replacement word when the word of the morphological analysis result is a word specified by the part of speech and the headline, and Rand () indicates a word extracted at random.
[0068]
In the example of FIG. 15, for example, the part of speech 0-9 indicates that the part of speech is replaced by a random number 0-9, which is a number. In FIG. 15, the replacement rule based on the morpheme part-of-speech information is described, but it can be expressed by a heading or a combination of a heading and a part-of-speech.
[0069]
Now, description will be made assuming that a part of the text extracted by the text extraction unit 101 is as shown in FIG. FIG. 16 shows that in the extracted text, some numerical value related to Mr. Ichiro Nihon is “1233-1234”, some numerical value related to Jiro Nihon is “123-4567”, and some numerical value related to Mr. Saburo Nihon is “ This indicates that there is a notation indicating “999-9999”. These numerical values are various information such as various measurement values, amounts, and ages.
[0070]
In this case, the text analysis unit 102 obtains an analysis result shown in FIG. 17, for example. That is, as shown in FIG. 17, each extracted word string in FIG. 16 is divided into words, and their parts of speech are attached. For example, “Nihon Ichiro: 123-1234” in FIG. 16 is made up of 12 words of Japan / Ichiro /: / 1/2/3 /-/ 1/2/3/4/5, and “Japan” is Part-of-speech is a person's first name, “Ichiro” is a part-of-speech name: name, “:” is a part-of-speech symbol, “1”-“3” is a part-of-speech number, “-” is a part-of-speech symbol, It can be obtained from the analysis result that the parts of speech of “1” to “4” are numbers.
[0071]
These morphological analysis results are given to the confidential information replacement unit 130. The secret information replacement unit 130 reads the replacement rule from the storage unit 131 and replaces the secret information according to the replacement rule. When the replacement rule is as shown in FIG. 15, the confidential information replacement unit 130 is the same in which all the words whose part of speech is personal name: surname and personal name: name are randomly extracted from the analysis result of the text analysis unit 102. The same part-of-speech is attached to other words in the part of speech, and for all words whose part-of-speech is a number, the part-of-speech extracted at random is replaced with another number. Thus, the secret information replacement unit 130 outputs the morphological analysis result shown in FIG.
[0072]
As is clear from a comparison between FIG. 17 and FIG. 18, words defined by the replacement rule in the morphological analysis results are replaced with other words of the same part of speech extracted at random. For example, “1233-1234” of “Nihon Ichiro” in FIG. 17 is replaced with “313-6924” of “Ryoko Suzuki”. That is, for example, personal information to be concealed, such as personal names and numerical values, among medical-related information included in hospital charts, etc., is replaced with other words of the same part of speech that are randomly extracted. Outputting as an analysis result is prevented. Moreover, since the word of the same part of speech extracted at random is used as the word after substitution, the morphological analysis result after substitution includes information such as a name and a numerical value. Therefore, when a statistical language model created based on the result of morpheme analysis is used, it is possible to reliably recognize names, numerical values, and the like during speech recognition processing.
[0073]
Other operations are the same as those of the second embodiment.
[0074]
As described above, also in this embodiment, the same effect as in the second and third embodiments can be obtained. Furthermore, since words of the same part of speech that are randomly extracted as replacement words are used, a statistical language model including information such as names and numerical values that are actually used can be created, resulting in reduced speech recognition accuracy. Can be prevented. Further, by extracting the replaced word from the database corresponding to the usage pattern, a statistical language model suitable for the usage pattern can be created, and the speech recognition accuracy can be further improved.
[0075]
FIG. 19 is a block diagram showing a fifth embodiment of the present invention. In FIG. 19, the same components as those in FIG.
[0076]
In the first to fourth embodiments, in the process of collecting the text for the statistical language model, the notation information related to the confidential information in the text used for statistical information aggregation is removed in advance. In contrast, in the present embodiment, in the process of creating a statistical language model from an existing corpus (text after collecting text for a statistical language model), notation information about confidential information in the text used for statistical information aggregation Is to be removed in advance.
[0077]
This embodiment is different from the embodiment of FIG. 1 in that a corpus 140 is used instead of the database 100 and a corpus input unit 141 is used instead of the text extraction unit 101.
[0078]
The corpus 140 is a collection of text collected for statistical language model creation. The corpus 140 may contain confidential information. The corpus input unit 140 outputs the text in the corpus 141 to the text analysis unit 102 with the corpus 141 as a processing target.
[0079]
In the embodiment configured as described above, text is extracted from an existing corpus 140 that has already been created and output to the text analysis unit 102. Thereby, the morphological analysis result obtained by masking the confidential information can be obtained and the corpus excluding the confidential information can be accumulated by the same method as in the first embodiment.
[0080]
Other operations are the same as those of the embodiment of FIG.
[0081]
As described above, in this embodiment, even when an existing corpus is used, it can be converted into a corpus that excludes confidential information, and speech recognition processing using a statistical language model without leaking confidential information. Can also be possible.
[0082]
The fifth embodiment can also be applied to the second to fourth embodiments. FIGS. 20 to 22 are block diagrams when the fifth embodiment is applied to the second to fourth embodiments shown in FIGS.
[0083]
20 to 22 are different from FIGS. 6, 9, and 14 in that a corpus 140 is used instead of the database 100 and a corpus input unit 141 is used instead of the text extraction unit 101.
[0084]
Other configurations and operations are the same as those of the second to fourth embodiments, respectively.
[0085]
As described above, also in the examples shown in FIGS. 20 to 22, the same effects as those of the first to fifth embodiments can be obtained.
[0086]
【The invention's effect】
As described above, according to the present invention, in the process of collecting the text for the statistical language model or the process of creating the statistical language model from the collected text, the personal information in the text used for statistical information aggregation By removing the written information, it is possible to prevent private personal information from being erroneously output in the recognition result.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a corpus processing device for creating a statistical language model according to a first embodiment of the present invention.
2 is an explanatory diagram for explaining mask rules stored in a storage unit 104 in FIG. 1; FIG.
FIG. 3 is an explanatory diagram for explaining the operation of the first embodiment;
FIG. 4 is an explanatory diagram for explaining the operation of the first embodiment.
FIG. 5 is an explanatory diagram for explaining the operation of the first embodiment;
FIG. 6 is a block diagram showing a second embodiment of the present invention.
7 is an explanatory diagram for explaining a replacement rule stored in a storage unit 111 in FIG. 6;
FIG. 8 is an explanatory diagram for explaining the operation of the second embodiment.
FIG. 9 is a block diagram showing a third embodiment of the present invention.
10 is an explanatory diagram for explaining a replacement rule stored in a storage unit 121 in FIG. 9;
FIG. 11 is an explanatory diagram for explaining the operation of the third embodiment.
FIG. 12 is an explanatory diagram for explaining the operation of the third embodiment.
FIG. 13 is an explanatory diagram for explaining the operation of the third embodiment.
FIG. 14 is a block diagram showing a fourth embodiment of the present invention.
15 is an explanatory diagram for explaining a replacement rule stored in a storage unit 131 in FIG. 14;
FIG. 16 is an explanatory diagram for explaining the operation of the fourth embodiment;
FIG. 17 is an explanatory diagram for explaining the operation of the fourth embodiment;
FIG. 18 is an explanatory diagram for explaining the operation of the fourth embodiment.
FIG. 19 is a block diagram showing a fifth embodiment of the present invention.
FIG. 20 is a block diagram showing a modification of the fifth embodiment.
FIG. 21 is a block diagram showing a modification of the fifth embodiment.
FIG. 22 is a block diagram showing a modification of the fifth embodiment.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 100 ... Database, 101 ... Text extraction part, 102 ... Text analysis part, 103 ... Confidential information mask part, 104,106,108 ... Memory | storage part, 105 ... Corpus accumulation | storage part, 107 ... Corpus statistics totaling part.

Claims

A text analysis unit that morphologically analyzes text data and outputs a morphological analysis result;
According to a predetermined mask rule for masking a word included in confidential information, a confidential information masking unit that masks the morphological analysis result;
A corpus accumulation unit that accumulates the morphological analysis results masked by the confidential information mask unit as a corpus;
A corpus processing apparatus for creating a statistical language model, comprising: a corpus statistics totaling unit that collects statistical information from the corpus accumulated by the corpus accumulation unit.

A text analysis unit that morphologically analyzes text data and outputs a morphological analysis result;
A secret information replacement unit that replaces the morphological analysis result according to a predetermined replacement rule for replacing a word included in the secret information with another word;
A corpus accumulation unit for accumulating the morphological analysis results replaced by the secret information replacement unit as a corpus;
A corpus processing apparatus for creating a statistical language model, comprising: a corpus statistics totaling unit that collects statistical information from the corpus accumulated by the corpus accumulation unit.

The corpus processing apparatus for creating a statistical language model according to claim 1, wherein the text data given to the text analysis unit is extracted from a predetermined database. .

The corpus processing device for creating a statistical language model according to claim 1, wherein the text data given to the text analysis unit is extracted from a predetermined corpus. .

The corpus processing device for creating a statistical language model according to claim 2, wherein the secret information replacement unit replaces a word included in the secret information with a typical word.

3. The corpus processing device for creating a statistical language model according to claim 2, wherein the secret information replacement unit replaces a word included in the secret information with a covert character to which part-of-speech information is added.

The corpus processing device for creating a statistical language model according to claim 2, wherein the secret information replacement unit replaces a word included in the secret information with a random word.

The mask rule or the replacement rule is a rule for masking or replacing a word having at least one part of speech of a person name, a place name, and a number as a word included in the confidential information. A corpus processing device for creating the statistical language model described in any one of 2 above.

A corpus processing method for creating a statistical language model by a computer comprising a text analysis unit, a secret information mask unit, a corpus accumulation unit, and a corpus statistics totaling unit,
A text analysis procedure in which the text analysis unit morphologically analyzes text data and outputs a morphological analysis result;
The confidential information mask unit masks the morphological analysis result according to a predetermined mask rule for masking a word included in confidential information, and
The corpus accumulating unit for accumulating the morphological analysis results masked by the confidential information mask unit as a corpus;
6. A corpus processing method for creating a statistical language model, wherein the corpus statistics totaling unit comprises a corpus statistics totaling procedure for collecting statistical information from the corpus accumulated by the corpus accumulation unit .

A corpus processing method for creating a statistical language model by a computer comprising a text analysis unit, a secret information replacement unit, a corpus accumulation unit, and a corpus statistics totaling unit,
A text analysis procedure in which the text analysis unit morphologically analyzes text data and outputs a morphological analysis result;
The secret information replacement unit replaces the morphological analysis result according to a predetermined replacement rule for replacing a word included in the secret information with another word, and
A corpus accumulation procedure in which the corpus accumulation unit accumulates, as a corpus, the morphological analysis result replaced by the confidential information replacement unit ;
The corpus statistic aggregation unit comprises a corpus statistic aggregation procedure for collecting statistical information from the corpus accumulated by the corpus accumulation unit . A corpus processing method for creating a statistical language model, comprising:

In a computer having a text analysis unit, a confidential information mask unit, a corpus accumulation unit, and a corpus statistics totaling unit ,
Text analysis processing procedure for outputting morphological analysis results by performing morphological analysis on text data by the text analysis unit ,
By the secret information mask portion, according to a predetermined mask rule for masking the words contained in the confidential information, the confidential information masking procedure for masking the morphological analysis result,
A corpus accumulation processing procedure for accumulating the morphological analysis results masked by the confidential information mask unit as a corpus by the corpus accumulation unit;
A corpus processing program for creating a statistical language model for causing the corpus statistical totaling unit to execute a corpus statistical totaling processing procedure for collecting statistical information from the corpus accumulated by the corpus accumulating unit .

In a computer having a text analysis unit, a secret information replacement unit, a corpus accumulation unit, and a corpus statistics totaling unit ,
Text analysis processing procedure for outputting morphological analysis results by performing morphological analysis on text data by the text analysis unit ,
A secret information replacement processing procedure for replacing the morphological analysis result according to a predetermined replacement rule for replacing a word included in the secret information with another word by the secret information replacement unit;
Corpus integration processing procedure for integrating the morphological analysis results replaced by the confidential information replacement unit as a corpus by the corpus integration unit;
A corpus processing program for creating a statistical language model for causing the corpus statistical totaling unit to execute a corpus statistical totaling processing procedure for collecting statistical information from the corpus accumulated by the corpus accumulating unit .