JP2010256498A

JP2010256498A - Conversion model generating apparatus, voice recognition result conversion system, method and program

Info

Publication number: JP2010256498A
Application number: JP2009104494A
Authority: JP
Inventors: Masahiro Nishimitsu; 雅弘西光; Kentaro Nagatomo; 健太郎長友; Seiichi Miki; 清一三木
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-04-22
Filing date: 2009-04-22
Publication date: 2010-11-11

Abstract

<P>PROBLEM TO BE SOLVED: To facilitate generation of a conversion model according to each use in a plurality of use, determined by a purpose for converting voice recognition result. <P>SOLUTION: A conversion model generating apparatus has a learning data for each usage selecting means 501, that extracts a part where a voice recognition result text shown by voice recognition result data for learning is different from a formatted text for learning, using voice recognition result data for learning showing a result of voice recognition processing, with respect to certain voice data and the formatted text for learning showing a result of converting the voice data into a text with a predetermined purpose, and selects learning data for generating the conversion model according to predetermined individual use, based on appearance tendency of the extracted difference part; and a conversion model generating means 502, that generates the conversion model according to the individual use using the learning data selected by using the learning data for each usage selecting section. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音声認識結果に対して音声認識誤りの修正や可読性を高めるための整形を目的とした変換を行うための変換モデルを生成する変換モデル生成装置、該変換モデルを用いて音声認識結果を変換する音声認識結果変換システム、その方法およびプログラムに関する。 The present invention relates to a conversion model generation device for generating a conversion model for performing conversion for the purpose of correcting speech recognition errors and shaping for improving readability, and a speech recognition result using the conversion model. The present invention relates to a speech recognition result conversion system, method and program thereof.

音声認識技術は、発話された一言一句を可能な限り正確に自動的に文字に書き起こす技術である。音声認識技術を用いたアプリケーションの１つとして、会議録や字幕等の文章を作成する装置がある。例えば会議録を作成する場合、ある話者の発話の音声データを音声認識することで、ある話者の発話の一言一句が文字に書き起こされ、会議録の文章が作成される。 Speech recognition technology is a technology that automatically transcribes spoken words as accurately as possible. One application that uses speech recognition technology is a device that creates sentences such as conference minutes and subtitles. For example, when creating a conference record, by recognizing speech data of a certain speaker's utterance, each phrase of a certain speaker's utterance is transcribed into characters, and a sentence of the conference record is generated.

しかし、このような音声認識技術を用いて書き起こされた文章には、多くの場合、少なからず音声認識誤りを含む。また、一言一句正確に書き起こされた文章は、口語表現等が多く含まれているために冗長となり、可読性が低くなっている可能性が高い。 However, in many cases, sentences written using such a speech recognition technique include not only speech recognition errors. In addition, a sentence that is transcribed correctly word by phrase is redundant because it contains many colloquial expressions and the like, and there is a high possibility that the readability is low.

適切な会議録等の文章を作成するために、音声認識誤りの修正や可読性を高めるためのテキスト整形の処理が必要となる。このような修正または整形を自動的に行う方法に関する先行技術文献として、例えば、特許文献１，特許文献２，非特許文献１がある。 In order to create an appropriate sentence such as a conference record, it is necessary to correct speech recognition errors and process text shaping to improve readability. For example, Patent Document 1, Patent Document 2, and Non-Patent Document 1 are prior art documents relating to a method for automatically performing such correction or shaping.

特許文献１には、音声認識結果とそれを修正、整形した学習用整形テキストを用いて変形モデルを生成する方法が記載されている。 Patent Document 1 describes a method of generating a deformation model using a speech recognition result and a corrected text for learning that has been corrected and shaped.

また、特許文献２には、音声またはテキストで入力される言語変換の対象となる文と、それを言語変換した文とが対になった学習用データベースを用いて、変換に必要なフレーズ辞書の作成やフレーズ間規則を、なるべく人手をかけずに自動的に作成する方法が記載されている。また、作成される変換規則の一例として、くだけた発話文（口語表現）を書き言葉のようなテキスト文（文語表現）に変換する言語変換装置が挙げられている。 Patent Document 2 discloses a phrase dictionary necessary for conversion using a learning database in which a sentence to be converted into speech or text and a sentence that has been converted into a language are paired. It describes a method for automatically creating rules and rules between phrases as much as possible. Further, as an example of the created conversion rule, there is a language conversion device that converts a spoken sentence (spoken expression) into a text sentence (sentence expression) such as a written word.

また、非特許文献１には、音声認識誤りを含まない音声書き起こしとそれを整形した学習用整形済みテキストを用いて変換モデルを生成する方法が記載されている。 Non-Patent Document 1 describes a method of generating a conversion model using a speech transcript that does not include a speech recognition error and a pre-formatted text for learning that has been shaped.

特開２００８−２１６３４１号公報JP 2008-216341 A 特開２０００−３０５９３０号公報JP 2000-305930 A

秋田祐哉，河原達也，「会議録作成のための話し言葉音声認識結果の自動整形」，日本音響学会研究発表会論文集，２００８年９月，ｐ．１０３−１０４Yuya Akita and Tatsuya Kawahara, “Automatic Spoken Speech Recognition Results for Producing Minutes”, Proceedings of the Acoustical Society of Japan, September 2008, p. 103-104

しかし、特許文献１に記載されている方法のように、「音声認識結果に含まれる音声認識誤りの修正」や「可読性を高める等のテキスト整形」といった所定の用途に応じて処理を行った結果得られる整形済みテキストである正解テキストを学習データに用いて変換モデルを生成する方法では、複数の用途が設定される場合に各用途に応じた変換モデルを生成することができないという問題がある。 However, as in the method described in Patent Document 1, the result of processing according to a predetermined application such as “correction of speech recognition error included in speech recognition result” or “text shaping such as improving readability” In the method of generating a conversion model using the correct text, which is the formatted text obtained, as learning data, there is a problem that a conversion model corresponding to each application cannot be generated when a plurality of uses are set.

なお、特許文献２に記載されている方法では、所望の用途の変換規則を作成するために、言語変換の対象となる文とそれを言語変換した文とが対になった所望の用途の学習用データベースが必要となる。この所望の用途の学習用データベースを用意するために、膨大な労力を要するという問題がある。 Note that in the method described in Patent Document 2, learning of a desired application in which a sentence that is a target of language conversion and a sentence that has been subjected to language conversion are paired in order to create a conversion rule for the desired application. Database is required. There is a problem that enormous effort is required to prepare the learning database for the desired application.

また、非特許文献１に記載されている方法では、音声認識結果ではなく、音声認識誤りを含まない音声書き起こしを用いることで、例えば「可読性を高める等のテキスト整形」の用途に応じた変換モデルを生成できる。しかし、音声認識誤りを含まない音声書き起こしを用意するために、膨大な労力を要するという問題がある。 Further, in the method described in Non-Patent Document 1, conversion according to the use of, for example, “text shaping to improve readability” is performed by using a speech transcription that does not include a speech recognition error instead of a speech recognition result. A model can be generated. However, there is a problem that enormous effort is required to prepare a voice transcription that does not include a voice recognition error.

そこで、本発明は、音声認識結果をどのような目的で変換するかによって定まる複数の用途における各用途に応じた変換モデルの生成を容易に行うことが可能な変換モデル生成装置、該変換モデルを用いて音声認識結果を変換する音声認識結果変換システム、その方法およびプログラムを提供することを目的とする。 Therefore, the present invention provides a conversion model generation device capable of easily generating a conversion model corresponding to each application in a plurality of applications determined by what purpose the speech recognition result is converted. It is an object of the present invention to provide a speech recognition result conversion system, method and program for converting speech recognition results.

本発明による変換モデル生成装置は、ある音声データに対して音声認識処理の結果を示す学習用音声認識結果データと、音声データを所定の目的をもってテキストに変換した結果を示す学習用整形済みテキストとを用いて、学習用音声認識結果データによって示される音声認識結果テキストと学習用整形済みテキストとの間の異なり部分を抽出し、抽出された異なり部分の出現傾向に基づいて、予め定められている個別の用途に応じた変換モデルを生成するための学習データを選別する用途別学習データ選別手段と、用途別学習データ選別手段によって選別された学習データを用いて、個別の用途に応じた変換モデルを生成する変換モデル生成手段とを備えたことを特徴とする。 The conversion model generation device according to the present invention includes learning speech recognition result data indicating a result of speech recognition processing for certain speech data, and learning shaped text indicating a result of converting speech data into text with a predetermined purpose, Is used to extract a different part between the speech recognition result text indicated by the learning speech recognition result data and the shaped text for learning, and is determined in advance based on the appearance tendency of the extracted different part A learning model for each application for selecting learning data for generating a conversion model for each application, and a conversion model for each application by using the learning data selected by the learning data for each application. And conversion model generation means for generating.

また、本発明による音声認識結果変換システムは、予めある目的のためのテキスト整形を、変換モデルを生成する際の個別の用途の一つとする旨が定められ、ある音声データに対して音声認識処理の結果を示す学習用音声認識結果データと、音声データを所定の目的をもってテキストに変換した結果を示す学習用整形済みテキストとを用いて、学習用音声認識結果データによって示される音声認識結果テキストと学習用整形済みテキストとの間の異なり部分を抽出し、抽出された異なり部分の出現傾向に基づいて、予め定められている個別の用途に応じた変換モデルを生成するための学習データを選別する用途別学習データ選別手段と、用途別学習データ選別手段によって選別された学習データを用いて、個別の用途に応じた変換モデルを生成する変換モデル生成手段と、変換モデル生成手段によって生成されたある目的のためのテキスト整形用途に応じた変換モデルを用いて、与えられた音声認識結果データに対して変換処理を行うことによって、ある目的に合致するテキストへの整形を行う音声認識結果整形手段とを備えたことを特徴とする。 In addition, the speech recognition result conversion system according to the present invention is preliminarily determined that text shaping for a certain purpose is one of individual uses when generating a conversion model. The speech recognition result text indicated by the learning speech recognition result data using the learning speech recognition result data indicating the result of the learning, and the learning formatted text indicating the result of converting the speech data into text with a predetermined purpose, and Extract different parts from the pre-formatted text for learning, and select the learning data to generate a conversion model according to each predetermined application based on the appearance tendency of the extracted different parts Using the learning data sorting means for each purpose and the learning data sorted by the learning data sorting means for each application, a conversion model for each purpose is generated. A conversion model generation unit that performs conversion processing on the given speech recognition result data using a conversion model that is generated by the conversion model generation unit and that is generated by the conversion model generation unit according to a text shaping application for a certain purpose. Voice recognition result shaping means for shaping the text to match the purpose is provided.

また、音声認識結果変換システムは、予めある目的のためのテキスト整形を、変換モデルを生成する際の個別の用途の一つとする旨が定められ、ある音声データに対して音声認識処理の結果を示す学習用音声認識結果データと、音声データを所定の目的をもってテキストに変換した結果を示す学習用整形済みテキストとを用いて、学習用音声認識結果データによって示される音声認識結果テキストと学習用整形済みテキストとの間の異なり部分を抽出し、抽出された異なり部分の出現傾向に基づいて、予め定められている個別の用途に応じた変換モデルを生成するための学習データを選別する用途別学習データ選別手段と、用途別学習データ選別手段によって選別された学習データを用いて、個別の用途に応じた変換モデルを生成する変換モデル生成手段と、変換モデル生成手段によって生成されたある目的のためのテキスト整形用途に応じた変換モデルを用いて、与えられた音声認識結果データに対して変換処理を行うことによって、ある目的に合致するテキストへの整形を行う音声認識結果整形手段とを備えていてもよい。 In addition, the speech recognition result conversion system is preliminarily determined that text shaping for a certain purpose is one of individual uses when generating a conversion model, and the result of speech recognition processing is applied to certain speech data. The speech recognition result text indicated by the learning speech recognition result data and the learning shaping using the learning speech recognition result data shown and the learning formatted text showing the result of converting the speech data into text with a predetermined purpose -By-application learning that extracts different parts from the existing text and selects the learning data to generate a conversion model according to each predetermined application based on the appearance tendency of the extracted different parts Using the learning data selected by the data selection means and the application-specific learning data selection means, a conversion model that generates a conversion model for each application By performing conversion processing on the given speech recognition result data using the conversion model according to the text shaping application for a certain purpose generated by the data generation means and the conversion model generation means, Voice recognition result shaping means for shaping the matching text may be provided.

また、音声認識結果変換システムは、予め音声認識誤りの修正と、ある目的のためのテキスト整形とを、変換モデルを生成する際の個別の用途とする旨が定められ、ある音声データに対して音声認識処理の結果を示す学習用音声認識結果データと、音声データを所定の目的をもってテキストに変換した結果を示す学習用整形済みテキストとを用いて、学習用音声認識結果データによって示される音声認識結果テキストと学習用整形済みテキストとの間の異なり部分を抽出し、抽出された異なり部分の出現傾向に基づいて、予め定められている個別の用途に応じた変換モデルを生成するための学習データを選別する用途別学習データ選別手段と、変換モデル生成手段によって生成された音声認識誤りの修正用途に応じた変換モデルを用いて、与えられた音声認識結果データに対して変換処理を行うことによって、音声認識結果に含まれる音声認識誤りを修正する音声認識結果修正手段と、変換モデル生成手段によって生成されたある目的のためのテキスト整形用途に応じた変換モデルを用いて、与えられた音声認識結果データに対して変換処理を行うことによって、ある目的に合致するテキストへの整形を行う音声認識結果整形手段とを備えていてもよい。 In addition, the speech recognition result conversion system is defined in advance that the correction of speech recognition errors and the text shaping for a certain purpose are individually used when generating a conversion model. Speech recognition indicated by the speech recognition result data for learning using the speech recognition result data for learning indicating the result of the speech recognition processing and the preformatted text for learning indicating the result of converting the speech data into text with a predetermined purpose Learning data for extracting different parts between the result text and the pre-formatted text for learning, and generating a conversion model for each predetermined application based on the appearance tendency of the extracted different parts Using the conversion model according to the application to correct the speech recognition error generated by the learning data selection means by application and the conversion model generation means. A speech recognition result correcting means for correcting a speech recognition error included in the speech recognition result by performing a conversion process on the obtained speech recognition result data, and a text shaping for a certain purpose generated by the conversion model generating means A speech recognition result shaping unit may be provided that performs transformation processing on given speech recognition result data using a conversion model according to the application, thereby shaping the text to match a certain purpose. .

また、本発明による変換モデル生成方法は、ある音声データに対して音声認識処理の結果を示す学習用音声認識結果データと、音声データを所定の目的をもってテキストに変換した結果を示す学習用整形済みテキストとを用いて、学習用音声認識結果データによって示される音声認識結果テキストと学習用整形済みテキストとの間の異なり部分を抽出し、抽出された異なり部分の出現傾向に基づいて、予め定められている個別の用途に応じた変換モデルを生成するための学習データを選別する用途別学習データ選別ステップと、用途別学習データ選別ステップで選別された学習データを用いて、個別の用途に応じた変換モデルを生成する変換モデル生成ステップとを含むことを特徴とする。 In addition, the conversion model generation method according to the present invention includes learning speech recognition result data indicating the result of speech recognition processing for certain speech data, and learning preformatted indicating the result of converting the speech data into text for a predetermined purpose. The text is used to extract a different portion between the speech recognition result text indicated by the learning speech recognition result data and the learning formatted text, and is determined in advance based on the appearance tendency of the extracted different portions. Using the learning data selected in the learning data selection step by use and the learning data selection step by use to select the learning data for generating a conversion model according to the individual application, according to the individual application A conversion model generation step of generating a conversion model.

また、本発明による音声認識結果変換方法は、ある音声データに対して音声認識処理の結果を示す学習用音声認識結果データと、音声データを所定の目的をもってテキストに変換した結果を示す学習用整形済みテキストとを用いて、学習用音声認識結果データによって示される音声認識結果テキストと学習用整形済みテキストとの間の異なり部分を抽出し、抽出された異なり部分の出現傾向に基づいて、予め定められている個別の用途に応じた変換モデルを生成するための学習データを選別する用途別学習データ選別ステップと、用途別学習データ選別ステップで選別された学習データを用いて、個別の用途に応じた変換モデルを生成する変換モデル生成ステップと、変換モデル生成ステップで生成された個別の用途に応じた変換モデルを用いて、与えられた音声認識結果に対して変換処理を行う音声認識結果変換ステップとを含むことを特徴とする。 Further, the speech recognition result conversion method according to the present invention includes learning speech recognition result data indicating the result of speech recognition processing on certain speech data, and learning shaping indicating the result of converting speech data into text for a predetermined purpose. A different portion between the speech recognition result text indicated by the learning speech recognition result data and the learned pre-formatted text is extracted, and predetermined based on the extracted appearance tendency of the different portion. Depending on the individual application, using the learning data selection step for selecting learning data to generate a conversion model according to each individual application and the learning data selected in the learning data selection step for each application A conversion model generation step that generates a converted conversion model, and a conversion model that is generated in the conversion model generation step according to the individual application. Te, characterized in that it comprises a speech recognition result conversion step of performing a conversion process on the speech recognition result given.

また、本発明による変換モデル生成用プログラムは、コンピュータに、ある音声データに対して音声認識処理の結果を示す学習用音声認識結果データと、音声データを所定の目的をもってテキストに変換した結果を示す学習用整形済みテキストとを用いて、学習用音声認識結果データによって示される音声認識結果テキストと学習用整形済みテキストとの間の異なり部分を抽出し、抽出された異なり部分の出現傾向に基づいて、予め定められている個別の用途に応じた変換モデルを生成するための学習データを選別する用途別学習データ選別処理と、用途別学習データ選別処理で選別された学習データを用いて、個別の用途に応じた変換モデルを生成する変換モデル生成処理とを実行させることを特徴とする。 The conversion model generation program according to the present invention indicates to a computer the learning speech recognition result data indicating the result of speech recognition processing for certain speech data, and the result of converting the speech data into text for a predetermined purpose. Extracting the different parts between the speech recognition result text indicated by the speech recognition result data for learning and the pre-formatted text for learning using the learning pre-formed text, and based on the appearance tendency of the extracted different parts , Using learning data selected by use-specific learning data selection processing for selecting learning data for generating a conversion model corresponding to a predetermined individual use and learning data selection processing for each application, A conversion model generation process for generating a conversion model according to the application is executed.

また、本発明による音声認識結果変換用プログラムは、コンピュータに、ある音声データに対して音声認識処理の結果を示す学習用音声認識結果データと、音声データを所定の目的をもってテキストに変換した結果を示す学習用整形済みテキストとを用いて、学習用音声認識結果データによって示される音声認識結果テキストと学習用整形済みテキストとの間の異なり部分を抽出し、抽出された異なり部分の出現傾向に基づいて、予め定められている個別の用途に応じた変換モデルを生成するための学習データを選別する用途別学習データ選別処理と、用途別学習データ選別処理で選別された学習データを用いて、個別の用途に応じた変換モデルを生成する変換モデル生成処理と、変換モデル生成処理で生成された個別の用途に応じた変換モデルを用いて、与えられた音声認識結果に対して変換処理を行う音声認識結果変換処理とを実行させることを特徴とする。 Further, the speech recognition result conversion program according to the present invention provides a computer with learning speech recognition result data indicating the result of speech recognition processing on certain speech data, and a result obtained by converting the speech data into text with a predetermined purpose. Based on the appearance tendency of the extracted different parts, the difference between the speech recognition result text indicated by the speech recognition result data for learning and the formatted text for learning is extracted using Individual learning data selected by the application-specific learning data selection process for selecting the learning data for generating the conversion model according to the predetermined individual application and the learning data selection process for each application. The conversion model generation process that generates a conversion model according to the usage of the conversion model and the conversion model according to the individual application generated by the conversion model generation process Using Le, characterized in that to execute a speech recognition result conversion process for performing conversion processing on the speech recognition result given.

本発明によれば、音声認識結果をどのような目的で変換するかによって定められている複数の用途における各用途に応じた変換モデルの生成を容易に行うことができる。また、その変換モデルを用いることで各用途に応じた変換処理が施されたテキストを得ることができる。 ADVANTAGE OF THE INVENTION According to this invention, the production | generation of the conversion model according to each use in the some use defined by what kind of purpose the speech recognition result is converted can be performed easily. Further, by using the conversion model, it is possible to obtain text that has been subjected to conversion processing according to each application.

本発明の第１の実施形態の音声認識結果変換モデル生成装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition result conversion model production | generation apparatus of the 1st Embodiment of this invention. 第１の実施形態の音声認識結果変換モデル生成装置１００の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the speech recognition result conversion model production | generation apparatus 100 of 1st Embodiment. 学習用音声認識結果データＤ１４に含まれる音声認識結果テキストと、学習用整形済みテキストＤ１５の例を示す説明図である。It is explanatory drawing which shows the example of the speech recognition result text contained in the learning speech recognition result data D14, and the learning preformatted text D15. ワードグラフの例を示す説明図である。It is explanatory drawing which shows the example of a word graph. 「より良い音声翻訳結果を得るためのテキスト整形」用途を適用させた場合の各種データの例を示す説明図である。It is explanatory drawing which shows the example of various data at the time of applying the "text shaping for obtaining a better speech translation result" use. 「音声理解のためのテキスト整形」用途を適用させた場合の各種データの例を示す説明図である。It is explanatory drawing which shows the example of various data at the time of applying the "text shaping for speech understanding" use. 第２の施形態の音声認識結果修正システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition result correction system of 2nd Embodiment. 修正用音声認識結果データと修正済みテキストの例を示す説明図である。It is explanatory drawing which shows the example of the speech recognition result data for correction, and the corrected text. 第３の実施形態の音声認識結果整形システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition result shaping system of 3rd Embodiment. 整形用音声認識結果データと整形済みテキストの例を示す説明図である。It is explanatory drawing which shows the example of the speech recognition result data for shaping, and the shaped text. 第４の実施形態の音声認識結果変換システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition result conversion system of 4th Embodiment. 修正・整形用音声認識結果データと修正・整形済みテキストの例を示す説明図である。It is explanatory drawing which shows the example of the speech recognition result data for correction | amendment / formatting, and the corrected / formatted text. 本発明の概要を示すブロック図である。It is a block diagram which shows the outline | summary of this invention. 本発明の音声認識結果変換システムの概要を示すブロック図である。It is a block diagram which shows the outline | summary of the speech recognition result conversion system of this invention. 本発明の音声認識結果変換システムの概要を示すブロック図である。It is a block diagram which shows the outline | summary of the speech recognition result conversion system of this invention. 本発明の音声認識結果変換システムの概要を示すブロック図である。It is a block diagram which shows the outline | summary of the speech recognition result conversion system of this invention.

実施形態１．
以下、本発明の実施形態を図面を参照して説明する。図１は、本発明の第１の実施形態の音声認識結果変換モデル生成装置の構成例を示すブロック図である。図１に示す音声認識結果変換モデル生成装置１００は、変換モデル学習データ選別手段１１と、変換モデル生成手段１２と、変換モデル記憶部１３と、学習用音声認識結果記憶部１４と、学習用整形済みテキスト記憶部１５とを備える。 Embodiment 1. FIG.
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram illustrating a configuration example of the speech recognition result conversion model generation device according to the first embodiment of this invention. A speech recognition result conversion model generation apparatus 100 shown in FIG. 1 includes a conversion model learning data selection unit 11, a conversion model generation unit 12, a conversion model storage unit 13, a learning speech recognition result storage unit 14, and a learning shaping unit. And a completed text storage unit 15.

なお、変換モデル学習データ選別手段１１、変換モデル生成手段１２は、例えば、当該音声認識結果変換モデル生成装置１００が備える、プログラムに従って動作するＣＰＵによって実現される。また、変換モデル記憶部１３、学習用音声認識結果記憶部１４、学習用整形済みテキスト記憶部１５は、例えば、当該音声認識結果変換モデル生成装置１００が備える記憶装置によって実現される。 The conversion model learning data selection unit 11 and the conversion model generation unit 12 are realized by, for example, a CPU that operates according to a program provided in the speech recognition result conversion model generation device 100. The conversion model storage unit 13, the learning speech recognition result storage unit 14, and the learning shaped text storage unit 15 are realized by, for example, a storage device included in the speech recognition result conversion model generation device 100.

変換モデル記憶部１３は、後述する変換モデル生成手段１２によって生成された変換モデルを記憶する。なお、本実施形態において「変換モデル」とは、音声認識誤りの修正や可読性を高めるための整形といった所定の目的を達成するために音声認識結果に対して行う変換の規則または方法を示す情報をいう。以下、変換モデル記憶部１３に記憶される個々の変換モデルを変換モデルＤ１３という。変換モデルＤ１３は、例えば、変換規則を示す情報の集合であってもよい。 The conversion model storage unit 13 stores the conversion model generated by the conversion model generation unit 12 described later. In the present embodiment, the “conversion model” is information indicating a rule or method of conversion performed on a speech recognition result in order to achieve a predetermined purpose such as correction of a speech recognition error or shaping for improving readability. Say. Hereinafter, each conversion model stored in the conversion model storage unit 13 is referred to as a conversion model D13. The conversion model D13 may be a set of information indicating conversion rules, for example.

学習用音声認識結果記憶部１４は、学習用音声認識結果データを記憶する。学習用音声認識結果データとは、本装置において学習用データとして用いられる、ある音声データについての音声認識結果を示す情報である。以下、学習用音声認識結果記憶部１４に記憶される個々の学習用音声認識結果データを学習用音声認識結果データＤ１４という。学習用音声認識結果データＤ１４は、例えば、会議音声の音声データを音声認識エンジンを用いて音声認識処理することによって得られる、音声認識結果テキストやＮベスト、ワードグラフ、音声認識結果に含まれる単語の品詞または意味情報、音響尤度、信頼度などであってもよい。 The learning speech recognition result storage unit 14 stores learning speech recognition result data. The learning speech recognition result data is information indicating a speech recognition result for certain speech data used as learning data in the present apparatus. Hereinafter, each learning speech recognition result data stored in the learning speech recognition result storage unit 14 is referred to as learning speech recognition result data D14. The speech recognition result data D14 for learning includes, for example, speech recognition result text, N best, word graph, and words included in the speech recognition result obtained by performing speech recognition processing on conference speech sound data using a speech recognition engine. Part of speech or semantic information, acoustic likelihood, reliability, and the like.

学習用整形済みテキスト記憶部１５は、学習用整形済みテキストを記憶する。学習用整形済みテキストとは、学習用音声認識結果データＤ１４によって示される音声認識結果を修正または／および整形（以下、修正・整形と表記する。）することによって得られる書き起こしテキストである。以下、学習用整形済みテキスト記憶部１５に記憶される個々の学習用整形済みテキストを学習用整形済みテキストＤ１５という。学習用整形済みテキストＤ１５は、例えば、ある話者の発言を適切に修正・整形した会議録のようなテキストであってもよい。なお、学習用整形済みテキストＤ１５は、用途別といったことは特に意識されず、ある最終目的のために作成されたデータでよい。また、本実施形態では、会議録全体の変換結果といったように、同じ条件で作成された複数のテキスト文を含むものとする。 The learning formatted text storage unit 15 stores the learning formatted text. The learning formatted text is a transcription text obtained by correcting or / and shaping the speech recognition result indicated by the learning speech recognition result data D14 (hereinafter referred to as correction / shaping). Hereinafter, each of the learning shaped text stored in the learning shaped text storage unit 15 is referred to as a learning shaped text D15. The learning-formatted text D15 may be, for example, a text such as a conference record in which a certain speaker's utterance is appropriately corrected and shaped. Note that the learning-formatted text D15 is not particularly conscious of the usage, and may be data created for a certain final purpose. Further, in the present embodiment, it is assumed that a plurality of text sentences created under the same conditions are included, such as a conversion result of the entire conference record.

なお、学習用整形済みテキストＤ１５は、厳密には必ずしも音声認識結果を修正、整形したものでなくてもよい。例えば、音声認識処理を経由せずに、音声認識結果の元となった入力音声から、ある最終目的のために書き起こされたテキストであってもよい。すなわち、音声認識結果に対して、予め定められているある最終目的のために変換処理が行われた結果得られる情報と同等の情報であればよい。 Strictly, the learned formatted text D15 does not necessarily have to be a speech recognition result corrected and shaped. For example, it may be a text that has been transcribed for a certain final purpose from the input speech that is the basis of the speech recognition result without going through the speech recognition process. That is, it is sufficient that the information is equivalent to the information obtained as a result of the conversion process performed for a predetermined final purpose with respect to the voice recognition result.

変換モデル学習データ選別手段１１は、学習用音声認識結果データＤ１４に含まれる音声認識結果テキストと、学習用整形済みテキストＤ１５とを用いて、両テキストの異なり部分を抽出し、その異なりの傾向を用いて、変換モデルを生成するための学習データを選別し、選別した学習データを変換モデル生成手段１２に提供する。 The conversion model learning data selection means 11 uses the speech recognition result text included in the learning speech recognition result data D14 and the learning formatted text D15 to extract different portions of the two texts, and displays the different tendencies. Using this, the learning data for generating the conversion model is selected, and the selected learning data is provided to the conversion model generating means 12.

ここで、両テキスト（音声認識結果テキストと学習用整形済みテキスト）の異なり部分の抽出とは、例えば、両テキストをアライメント（並べて比較）することにより、両テキスト間で異なっている部分の部分文字列を求める処理であってもよい。また、異なりの傾向とは、例えば、ある学習用音声認識結果データＤ１４に含まれる音声認識結果テキストと学習用整形済みテキストＤ１５における、異なり部分の各々の部分文字列の組の出現数（すなわち、学習用整形済みテキストＤ１５全体における出現確率）であってもよい。 Here, the extraction of the different parts of both texts (speech recognition result text and pre-formatted text for learning) is, for example, partial characters of parts that are different between the two texts by aligning (comparing) the two texts. Processing for obtaining a column may be used. The different tendency is, for example, the number of appearances of a set of partial character strings of different portions in a speech recognition result text included in a certain learning speech recognition result data D14 and a shaped text D15 for learning (that is, Appearance probability in the entire learning formatted text D15).

変換モデル生成手段１２は、変換モデル学習データ選別手段１１によって選別された、変換モデルの学習データを用いて変換モデルＤ１３を生成する。変換モデル生成手段１２は、例えば、変換モデル学習データ選別手段１１によって選別された、学習用音声認識結果データＤ１４に含まれる音声認識結果テキストと、学習用整形済みテキストＤ１５の両テキストの異なり部分の各々の部分文字列の組を、変換規則とするモデルを生成してもよい。このような場合、変換モデルＤ１３として、両テキストの異なり部分の各々の部分文字列の組を用いて示される変換規則の集合を生成すればよい。 The conversion model generation means 12 generates a conversion model D13 using the learning data of the conversion model selected by the conversion model learning data selection means 11. For example, the conversion model generation unit 12 selects different portions between the speech recognition result text included in the learning speech recognition result data D14 selected by the conversion model learning data selection unit 11 and the learning formatted text D15. A model may be generated in which each set of partial character strings is a conversion rule. In such a case, a set of conversion rules indicated by using a set of partial character strings of different portions of both texts may be generated as the conversion model D13.

なお、各記憶部は外部記憶装置によって実現されていてもよい。そのような場合には、データ入出力手段を備えていればよい。 Each storage unit may be realized by an external storage device. In such a case, data input / output means may be provided.

次に、本実施形態の動作について説明する。図２は、本実施形態の音声認識結果変換モデル生成装置１００の動作の一例を示すフローチャートである。なお、本実施形態では、音声認識結果の変換の用途として、「音声認識誤りの修正」（用途１）と「可読性を高めるためのテキストの整形」（用途２）の２つが与えられているものとする。これは、学習用整形済みテキストＤ１５が、音声認識結果に対して修正処理に関する変換と整形処理に関する変換とが行われたもの（またはその変換結果に相当するテキスト）であることを意味している。なお、テキスト整形における目的は、特に限定しない。本例のように、可読性を高めるためであってもよいし、口語表現を文語表現に変換するためであってもよい。また、例えば、より良い音声翻訳結果を得るためのテキスト整形や、音声理解のためのテキスト整形であってもよい。 Next, the operation of this embodiment will be described. FIG. 2 is a flowchart showing an example of the operation of the speech recognition result conversion model generation device 100 of the present embodiment. In the present embodiment, two types of applications for conversion of speech recognition results are given: “correction of speech recognition error” (use 1) and “text formatting to improve readability” (use 2). And This means that the learning formatted text D15 is a speech recognition result that has undergone conversion related to correction processing and conversion related to shaping processing (or text corresponding to the conversion result). . The purpose of text formatting is not particularly limited. As in this example, it may be for improving readability, or for converting spoken language expressions into sentence language expressions. Further, for example, text shaping for obtaining a better speech translation result or text shaping for speech understanding may be used.

図２に示すように、本実施形態の音声認識結果変換モデル生成装置１００では、まず、変換モデル学習データ選別手段１１に、学習用音声認識結果データＤ１４と、学習用整形済みテキストＤ１５とを入力する（ステップＳ１１）。なお、変換モデル学習データ選別手段１１が、処理を開始する旨の通知を受けて、学習用音声認識結果記憶部１４や学習用整形済みテキスト記憶部１５から必要なデータを読み出してもよい。 As shown in FIG. 2, in the speech recognition result conversion model generation device 100 according to the present embodiment, learning speech recognition result data D14 and learning shaped text D15 are first input to the conversion model learning data selection unit 11. (Step S11). Note that the conversion model learning data selection unit 11 may read necessary data from the learning speech recognition result storage unit 14 or the learning shaped text storage unit 15 upon receiving a notification to start processing.

図３は、学習用音声認識結果データＤ１４に含まれる音声認識結果テキストと、その音声認識結果データＤ１４に対応する（すなわち、同じフレーズの）学習用整形済みテキストＤ１５の例を示す説明図である。図３では、学習用音声認識結果テキストが「えー／血痕／を／してます」（”／”は単語境界を表す。）であるのに対して、学習用整形済みテキストが「結婚／を／しています」であることが示されている。 FIG. 3 is an explanatory diagram illustrating an example of the speech recognition result text included in the learning speech recognition result data D14 and the learning formatted text D15 corresponding to the speech recognition result data D14 (that is, the same phrase). . In FIG. 3, the learning speech recognition result text is “E / Blood //” (“/” represents a word boundary), whereas the learning formatted text is “Marriage / It is shown that

処理対象とされる学習用音声認識結果データＤ１４と学習用整形済みテキストＤ１５とが入力されると、変換モデル学習データ選別手段１１は、その学習用音声認識結果データＤ１４と学習用整形済みテキストＤ１５とを用いて、変換モデルを生成するための学習データを選別する（ステップＳ１２）。なお、ここでは、学習用音声認識結果データＤ１４と学習用整形済みテキストＤ１５の異なり部分を、用途に応じた変換モデルを生成するための学習データとして選別する。 When the learning speech recognition result data D14 and the learning shaped text D15 to be processed are input, the conversion model learning data selection unit 11 selects the learning speech recognition result data D14 and the learning shaped text D15. Are used to select learning data for generating a conversion model (step S12). Here, different portions of the learning speech recognition result data D14 and the learning shaped text D15 are selected as learning data for generating a conversion model corresponding to the application.

具体的には、変換モデル学習データ選別手段１１は、まず学習用音声認識結果テキスト「えー／血痕／を／してます」と、学習用整形済みテキスト「結婚／を／しています」の異なり部分を抽出する。ここでは、異なり部分として（１）『えー』の削除、（２）『血痕』を『結婚』に置換、（３）『してます』を『しています』に置換、の３つを抽出する。そして、それら異なり部分の組に基づいて、用途に応じた変換モデルを生成するための学習データとして、用途別に学習データを選別する。 Specifically, the conversion model learning data selection means 11 is different from the learning speech recognition result text “Eh / Blood / I am doing” and the learning preformatted text “Marriage / doing / doing”. Extract the part. Here, we extracted three different parts: (1) Delete “Eh”, (2) Replace “Blood stain” with “Marriage”, and (3) Replace “I do” with “I am”. To do. Then, based on the set of the different parts, the learning data is selected for each application as learning data for generating a conversion model according to the application.

学習データの選別の一例として、会議録の作成が最終目的であった学習用整形済みテキストＤ１５を用いた場合を以下に説明する。 As an example of selection of learning data, the case where the formatted text D15 for learning whose final purpose is the creation of a conference record is used will be described below.

音声認識技術を用いて作成した会議録には、多くの場合、少なからず音声認識誤りが含まれている。また、一言一句正確に書き起こされた文章には口語表現等が多く含まれているため、冗長であり、可読性が低い。そこで、適切な会議録等の文章を作成するために、音声認識誤りの修正や可読性を高めるためのテキスト整形の処理が行われる。 In many cases, minutes recorded using speech recognition technology contain speech recognition errors. In addition, sentences that are transcribed correctly word by phrase contain many colloquial expressions and the like, so they are redundant and have low readability. Therefore, in order to create a sentence such as an appropriate conference record, a text shaping process is performed to correct speech recognition errors and improve readability.

図３に示した例では、音声認識誤りである『血痕』を『結婚』に置換する処理（上述の異なり部分（２）に対応）が音声認識誤りの修正処理である。また、不要語である『えー』の削除や、『してます』という口語表現を『しています』という文語表現に置換するといった処理（上述の異なり部分（１），（３）に対応）が整形処理である。ここでは、「音声認識誤りの修正」用途と「可読性を高めるための整形」用途という２つの用途について、各用途に対応した変換モデルを生成する。このために、変換モデル学習データ選別手段１１は、各々の異なり部分の組を、修正処理によるものと整形処理によるものに選別する。この選別方法の一つとして、異なり部分の出現数を用いる例を示す。 In the example shown in FIG. 3, the process of replacing “blood stain”, which is a voice recognition error, with “marriage” (corresponding to the different part (2) described above) is a voice recognition error correction process. Also, the process of deleting unnecessary words “e” and replacing the spoken phrase “I am” with the word expression “I am” (corresponding to the different parts (1) and (3) above) Is the shaping process. Here, a conversion model corresponding to each application is generated for two applications, ie, “correction of speech recognition error” and “shaping to improve readability”. For this purpose, the conversion model learning data sorting unit 11 sorts each set of different parts into one based on correction processing and one based on shaping processing. As one of the selection methods, an example in which the number of appearances of different parts is used will be described.

例えば、整形処理は会議録全体をとおして行われるため、「『えー』の削除」や「『してます』を『しています』に置換」のような異なり部分の出現数は必然的に大きくなる。一方で、「『血痕』を『結婚』に置換」のような音声認識誤りによる異なり部分の出現数は、音声認識誤りが起こる要因が多岐にわたることから、その出現数は「『してます』を『しています』に置換」のような口語表現を文語表現に整形する出現数に比べて小さくなる。このように変換処理によって差が生じる出現数を基準として、各々の用途に応じた変換モデルを生成するための学習データを選別する。 For example, since the formatting process is performed throughout the entire proceedings, the number of occurrences of different parts such as “Delete“ Eh ”” or “Replace“ I do ”” with “I am” is inevitably growing. On the other hand, the number of occurrences of different parts due to speech recognition errors such as “Replace“ blood stain ”with“ marriage ”” is due to a variety of factors that cause speech recognition errors. Is smaller than the number of occurrences of colloquial expressions such as “replace with“ doing ”” to sentence expressions. In this way, learning data for generating a conversion model corresponding to each application is selected based on the number of appearances in which a difference is caused by the conversion process.

例えば、変換モデル学習データ選別手段１１は、会議録全体の変換結果について、各文章ごとの異なり部分を抽出し、抽出した各異なり部分に対する変換処理別に会議録全体での出現数を求めてもよい。そして、求めた変換処理別の出現数をそれぞれ用途毎に設けられている所定の閾値と比較し、出現数がその閾値で示される適用範囲内にあると判定された場合には、該出現数となった変換処理が行われている学習用整形済みテキスト（すなわち、該出現数が算出された変換処理が行われている異なり部分を有する文章に対応する学習用整形済みテキストＤ１５）を、当該用途に対応した変換モデルを生成するための学習データとして選別してもよい。 For example, the conversion model learning data selection unit 11 may extract a different part for each sentence from the conversion result of the entire minutes, and obtain the number of appearances in the whole minutes for each extracted conversion process. . Then, the number of appearances obtained for each conversion process is compared with a predetermined threshold provided for each use, and when it is determined that the number of appearances is within the applicable range indicated by the threshold, the number of appearances The learning-formatted text that has been subjected to the conversion process (that is, the learning-formatted text D15 corresponding to a sentence having a different part for which the conversion process for which the number of appearances has been calculated is performed) You may select as learning data for producing | generating the conversion model corresponding to a use.

例えば、「音声認識誤りの修正」用途のための学習データを選別する場合、ある変換処理の出現数が該用途に対して定められている閾値よりも小さければ、その変換処理が行われている学習用整形済みテキストを当該用途のための学習データとして選別してもよい。また、例えば、「可読性を高めるための整形」用途のための学習データを選別する場合、ある変換処理の出現数が該用途に対して定められている閾値よりも大きければ、その変換処理が行われている学習用整形済みテキストを当該用途のための学習データとして選別してもよい。なお、１つの閾値を使って、出現数（または出現確率）がその閾値よりも小さければ「音声認識誤りの修正」用途のための学習データとして選別し、その閾値以上であれば「可読性を高めるための整形」用途のための学習データとして選別するといったことも可能である。 For example, when selecting learning data for a “correction of voice recognition error” application, if the number of appearances of a certain conversion process is smaller than a threshold value determined for the application, the conversion process is performed. The pre-formatted text for learning may be selected as learning data for the application. Further, for example, when learning data for use in “shaping to improve readability” is selected, if the number of appearances of a certain conversion process is larger than a threshold value set for the use, the conversion process is performed. The learned preformatted text may be selected as learning data for the application. In addition, using one threshold, if the number of appearances (or appearance probability) is smaller than the threshold, it is selected as learning data for “correction of speech recognition error”, and if it exceeds the threshold, “enhance readability” It is also possible to select as learning data for “shaping for use”.

次に、学習モデル生成手段１２は、変換モデル学習データ選別手段１１によって選別された学習データを用いて、変換モデルＤ１３を生成する（ステップＳ１３）。変換モデル生成手段１２は、例えば、選別された学習データに含まれる異なり部分を変換規則として示す情報を変換モデルＤ１３として生成してもよい。上述の例であれば、「音声認識誤りの修正」を目的とした変換モデルＤ１３として、少なくとも「音声認識誤りの修正」用途のための学習データとして選別された学習用整形済みテキストＤ１５に含まれる異なり部分である（２）「『血痕』を『結婚』に置換」を変換規則として含む情報を生成してもよい。また、例えば、「可読性を高めるための整形」を目的とした変換モデルＤ１３として、少なくとも「可読性を高めるための整形」用途のための学習データとして選別された学習用整形済みテキストＤ１５に含まれる異なり部分である（１）「『えー』の削除」と、（３）「『してます』を『しています』に置換」とを変換規則として含む情報を生成してもよい。 Next, the learning model generation means 12 generates a conversion model D13 using the learning data selected by the conversion model learning data selection means 11 (step S13). For example, the conversion model generation unit 12 may generate, as the conversion model D13, information indicating a different part included in the selected learning data as a conversion rule. In the above example, the conversion model D13 for the purpose of “correction of speech recognition errors” is included in at least the learned formatted text D15 selected as learning data for “correction of speech recognition errors”. Information including (2) “replace“ blood stain ”with“ marriage ”” as a conversion rule may be generated. Further, for example, as the conversion model D13 for the purpose of “shaping for improving readability”, the difference included in the learning formatted text D15 selected as the learning data for at least “shaping for improving readability” is used. Information including (1) “deletion of“ e ”” and (3) “replace“ is done ”with“ is ”” as conversion rules may be generated.

なお、変換処理別の出現数を求める際に、異なり部分の変換処理が同じなものを計数するだけでなく、異なり部分が抽出された場所や置換や削除される品詞等が同じであれば同じ変換方法とみなして１つの変換処理として計数する方法も考えられる。 When calculating the number of appearances by conversion process, not only count the parts with the same conversion process for different parts, but the same if the parts where the different parts are extracted, the parts of speech that are replaced or deleted, etc. are the same. A method of counting as one conversion process by considering it as a conversion method is also conceivable.

また、上述の例では、学習データの選別方法として出現数を用いる例を示したが、他の方法も可能である。例えば、学習用音声認識結果データＤ１４に含まれるワードグラフを用いることも可能である。図４は、ワードグラフの例を示す説明図である。図４では、音声データ「えー／結婚／を／してます」が入力されたときの、複数の認識結果候補をグラフで表現したワードグラフを示している。図４に示した例は、音声データ「えー／結婚／を／してます」に対する認識結果候補として、「えー／結婚／を／してます」と、「えー／血痕／を／してます」の候補があり、第１位の候補が「えー／血痕／を／してます」であることを示している。 In the above example, the number of appearances is used as the learning data selection method. However, other methods are also possible. For example, it is possible to use a word graph included in the learning speech recognition result data D14. FIG. 4 is an explanatory diagram showing an example of a word graph. FIG. 4 shows a word graph in which a plurality of recognition result candidates are expressed in a graph when the voice data “E / Marriage / Do / Do” is input. In the example shown in FIG. 4, the recognition result candidate for the voice data “E / marriage / do / do” is “e / marry / do / do” and “e / blood / ”And the first candidate is“ Eh / Blood / I ’m doing it ”.

本例の場合、変換モデル学習データ選別手段１１は、学習用音声認識結果データＤ１４に含まれるワードグラフと学習用整形済みテキストＤ１５とを用いて、用途に応じた変換モデルを生成するための学習データを選別する。変換モデル学習データ選別手段１１は、例えば、学習用音声認識結果データＤ１４に含まれるワードグラフにより示される２つの認識結果候補と、学習用整形済みテキストＤ１５とを比較して、異なり部分を抽出し、抽出した異なり部分の組から、用途に応じた変換モデルの学習データを選別する。例えば、図４で示した例では、２つの認識結果候補「えー／血痕／を／してます」，「えー／結婚／を／してます」と、それに対応する学習用整形済みテキストＤ１５「結婚／を／しています」との間の異なり部分として、（１）『えー』の削除、（２）『血痕』を『結婚』に置換、（３）『してます』を『しています』に置換、の３つを抽出してもよい。 In this example, the conversion model learning data selection unit 11 uses the word graph included in the learning speech recognition result data D14 and the learning formatted text D15 to learn to generate a conversion model according to the application. Select data. The conversion model learning data selection unit 11 compares, for example, two recognition result candidates indicated by the word graph included in the learning speech recognition result data D14 and the learning formatted text D15, and extracts different portions. The learning data of the conversion model corresponding to the application is selected from the extracted set of different parts. For example, in the example shown in FIG. 4, two recognition result candidates “E / Blood / I am doing”, “E / Marriage / I am doing”, and the corresponding formatted text D15 for learning “ The differences between “marriage / do / do” are (1) deletion of “e”, (2) “blood stain” replaced with “marriage”, and (3) “doing” You may extract three of “Replace”.

以下、本例の選別方法について、出現数の例と同様に、会議録を学習用整形済みテキストＤ１５に用いた場合を例に用いて具体的に説明する。 Hereinafter, similarly to the example of the number of appearances, the selection method of this example will be specifically described using a case where the minutes are used as the learning-formatted text D15 as an example.

音声認識結果データは、音声認識処理によって入力された音声データの一言一句が文字に書き起こされるため、会議録には不適切な、不要語『えー』や口語表現『してます』についても認識誤りがなければそのままテキスト表現にしたものが出力される。従って、音声認識結果の一例であるワードグラフに含まれる認識結果候補には不要語『えー』や口語表現『してます』が含まれる一方で、会議録のテキストにはそれらの語が含まれないといった違いが生じることがある。この違いに着目し、異なり部分の各々の部分文字列の組のうち、ワードグラフに含まれない部分文字列を含む組を整形処理に関するデータと判定し、それ以外のデータ（異なり部分がワードグラフに含まれる部分文字列を含む組を修正処理に関するデータと判定することにより、用途に応じた変換モデルを生成するための学習データを獲得する。 In the speech recognition result data, each word of speech data input by the speech recognition process is transcribed into characters, so unnecessary words “e” and colloquial expression “setsu” that are inappropriate for the minutes are also included. If there is no recognition error, the text representation is output as it is. Therefore, the recognition result candidates included in the word graph, which is an example of the speech recognition result, include the unnecessary word “e” and the colloquial expression “setsu”, while the text of the minutes includes those words. There may be differences such as no. Focusing on this difference, among the sets of partial character strings of different parts, a set including a partial character string that is not included in the word graph is determined as data related to the shaping process, and other data (the different part is the word graph) Learning data for generating a conversion model corresponding to the application is acquired by determining that the group including the partial character string included in is data related to the correction process.

なお、ワードグラフを用いた選別方法の場合、選別元となる学習用整形済みテキストＤ１５は、会議録全体のデータといった同じ条件で作成されたデータでなくても、各文ごとに選別されたデータであってもよい。 In the case of the sorting method using the word graph, the learning formatted text D15 as the sorting source is not the data created under the same condition such as the data of the entire conference record, but the data sorted for each sentence. It may be.

また本例では、ワードグラフに異なり部分の部分文字列が含まれるか否かを、学習データの選別基準とする例を示したが、例えば、音声認識結果に含まれる単語の品詞または意味情報、音響尤度、言語尤度、信頼度などを用いて異なり部分の部分文字列がワードグラフに存在しうる尤度を求め、それを基準として選別することも可能である。 Further, in this example, an example in which whether or not a partial character string of a different part is included in the word graph is used as a selection criterion for learning data, for example, the part of speech or semantic information of a word included in the speech recognition result, It is also possible to obtain the likelihood that different partial character strings can exist in the word graph using acoustic likelihood, language likelihood, reliability, etc., and select them based on the likelihood.

例えば、図４に示す例では、音声認識結果の単語の表記のみを示しているが、それぞれの単語について、品詞や意味情報、その単語の音響的確からしさである音響尤度、その単語の言語的確からしさである言語尤度、その単語の正確らしさである信頼度などが情報として付与される場合がある。 For example, in the example shown in FIG. 4, only the word notation of the speech recognition result is shown. For each word, the part of speech and the semantic information, the acoustic likelihood that is the acoustic likelihood of the word, the language of the word Information such as language likelihood, which is accuracy, and reliability, which is accuracy of the word, may be given as information.

これらの情報を用いて、異なり部分の部分文字列がワードグラフ中に存在するか否かの尤度を求めればよい。単語の品詞や意味情報については、例えば品詞（または意味情報）毎に尤度を設定し、ある異なり部分の部分文字列に含まれる単語の品詞（または意味情報）の尤度が所定の閾値以下の場合は、ワードグラフ中にその異なり部分の部分文字列が存在している場合であっても存在しないものとして扱い、その結果の存在有無によって、所望の用途に応じた学習データを選別してもよい。 Using these pieces of information, it is only necessary to determine the likelihood of whether or not different character strings exist in the word graph. For the word part of speech and semantic information, for example, the likelihood is set for each part of speech (or semantic information), and the likelihood of the word part of speech (or semantic information) included in a partial character string of a different part is equal to or less than a predetermined threshold value. In the case of the case, even if there is a partial character string of the different part in the word graph, it is treated as not existing, and the learning data according to the desired application is selected according to the presence or absence of the result Also good.

音響尤度や言語尤度、信頼度については、その値そのものを尤度として利用し、同様の存在有無の判定を行えばよい。例えば、信頼度が０．１以下のものはワードグラフ中に存在しないものとして、所望の用途に応じた学習データを選別してもよい。 For acoustic likelihood, language likelihood, and reliability, the values themselves may be used as likelihoods to determine whether or not they are present. For example, learning data corresponding to a desired application may be selected on the assumption that a reliability of 0.1 or less does not exist in the word graph.

また、これらの情報の尤度を複数用いてもよい。例えば、各情報の尤度を足し合わせた数値が基準値以下であれば、ワードグラフ中に存在しないものとするといった処理を行えばよい。さらに、これらの情報を１つ以上用いて、ワードグラフ中に異なり部分の部分文字列が存在するか否かの尤度を出力する、統計的に学習されたモデルを用いることも可能である。たとえば、以下の式（１）のように定義される、最大エントロピーモデルを用いて、存在尤度を算出させ、出力するようにしてもよい。 A plurality of likelihoods of these pieces of information may be used. For example, if the numerical value obtained by adding the likelihood of each information is equal to or less than the reference value, a process such as not existing in the word graph may be performed. Furthermore, it is also possible to use a statistically learned model that uses one or more of these pieces of information to output the likelihood of whether or not there are different partial character strings in the word graph. For example, the existence likelihood may be calculated and output using a maximum entropy model defined as the following equation (1).

式（１）において、ｘは、単語の表記やその品詞、意味情報等の音声認識結果である。また、ｙは、ワードグラフ中に異なり部分の部分文字列が存在するか否かのラベルである。また、Ｐ（ｙ｜ｘ）は、ｘに対するｙの尤度、すなわちワードグラフ中に異なり部分の部分文字列が存在するか否かの尤度である。また、Λは、モデルパラメタの集合である。Φ（ｙ，ｘ）は、モデルの特徴量の値の集合であり、単語の品詞等の情報またはそれらの組み合わせである。Ｚｘは、正規化項である。なお、統計的に学習されたモデルとしては、この他に、ニューラルネットワークや隠れマルコフモデル、サポートベクタマシンなどを用いることも可能である。 In Expression (1), x is a speech recognition result such as a word notation, its part of speech, and semantic information. Further, y is a label indicating whether or not there is a partial character string of a different part in the word graph. P (y | x) is the likelihood of y with respect to x, that is, the likelihood of whether or not there are different partial character strings in the word graph. Λ is a set of model parameters. Φ (y, x) is a set of feature value values of the model, and is information such as word part of speech or a combination thereof. Zx is a normalization term. As a statistically learned model, a neural network, a hidden Markov model, a support vector machine, or the like can be used.

また上記例では、いずれもワードグラフを例にあげて説明したが、Ｎベストの場合においても同様の判別方法が実施可能である。具体的には、ワードグラフ中に含まれるか否かを基準に選別していた処理を、Ｎベスト中に含まれるか否かを基準に選別するようにすればよい。 In the above examples, the word graph has been described as an example, but the same determination method can be implemented even in the case of N best. Specifically, the process selected based on whether or not it is included in the word graph may be selected based on whether or not it is included in the N best.

また、上記例では、整形処理を行う用途として会議録や字幕等に利用される「可読性を高める等のテキスト整形」用途を例に学習データの選別処理および変換モデルを生成する処理を説明したが、整形処理が行われる用途としてはこの他にも音声認識結果を用いて機械翻訳を行う「音声翻訳」や機械を操作する「音声理解」用途に対応した学習データの選別および変換モデルの生成も可能である。 In the above example, the learning data selection process and the process of generating the conversion model have been described by taking the example of the “text shaping for improving readability” application used for the minutes and subtitles as the application for the shaping process. In addition to this, shaping processing is also performed for selection of learning data and generation of conversion models for “speech translation” for machine translation using speech recognition results and “speech understanding” for machine operation Is possible.

本発明は、例えば、「より良い音声翻訳結果を得るためのテキスト整形」用途にも適用させることが可能である。音声翻訳は、ある発話を音声認識し、その結果を機械翻訳することで、他国語へ翻訳する技術である。例えば、「私は駅に行きたい」という日本語発話を入力とし、その発話を音声認識、機械翻訳することで、「I want to go to the station.」という英語の出力を得る。以下、よい良い音声翻訳結果を得るために、機械翻訳の定型文を用いる場合を例に説明する。 The present invention can be applied to, for example, “text shaping for obtaining better speech translation results”. Speech translation is a technology that recognizes a certain utterance and translates the result into another language by machine translation. For example, a Japanese utterance “I want to go to the station” is input, and the utterance is speech recognized and machine-translated to obtain an English output “I want to go to the station.”. Hereinafter, in order to obtain a good speech translation result, a case where a fixed sentence of machine translation is used will be described as an example.

図５は、「より良い音声翻訳結果を得るためのテキスト整形」用途を適用させた場合の各種データの例を示す説明図である。図５に示すように、例えば、機械翻訳の定型文「私は駅に行きたい」に関する入力音声として「えー私は駅に行きたいんですけど」という発話が入力され、その音声認識結果として「えー／私は／木に／行きたいんですけど」が得られたとする。また、少なくともより良い音声翻訳結果を得ることを目的の１つに含む学習用整形済みテキストＤ１５として、音声認識結果「えー／私は／木に／行きたいんですけど」に対して機械翻訳の定型文「私は駅に行きたい」に修正、整形されたものが入力されたとする。 FIG. 5 is an explanatory diagram showing an example of various data when the “text shaping for obtaining a better speech translation result” application is applied. As shown in Fig. 5, for example, an utterance “I want to go to the station” is input as an input voice related to the standard sentence of machine translation “I want to go to the station”, and the speech recognition result is “ Eh / I want to go to / to the tree. " Also, as a preformatted text D15 for learning that includes at least one better speech translation result, machine translation for the speech recognition result “Eh / I want to go to the tree” Suppose that the fixed sentence “I want to go to the station” is input as it is corrected and shaped.

変換モデル学習データ選別手段１１は、会議録作成のためのテキスト整形を１用途とする生成モデルの生成処理と同様に、学習用整形済みテキストＤ１５と学習用音声認識結果データのアライメントをとることにより、異なり部分として「『えー』の削除」、「『木に』を『駅に』に置換」、「『行きたいんですけど』を『行きたい』に置換」を抽出すればよい。 The conversion model learning data selection unit 11 aligns the learning-formatted text D15 and the learning speech recognition result data by aligning the learning-formatted text D15 and the learning speech recognition result data, similarly to the generation process of the generation model that uses text shaping for preparing the minutes. However, it is only necessary to extract “deletion of“ e ””, “replace“ to wood ”with“ to station ””, and “replace“ I want to go ”with“ want to go ”” as different parts.

そして、会議録のためのテキスト整形の場合と同様に、出現数やワードグラフ（またはＮベスト）に含まれるか否かを基準にした所定の選別処理を行う。本例では、その結果、「『木に』を『駅に』に置換」が音声認識誤りの修正処理と判別され、それ以外の異なり部分についてはより良い音声翻訳結果を得るための整形処理と判別される。 Then, as in the case of text formatting for the minutes, a predetermined sorting process is performed based on the number of appearances and whether it is included in the word graph (or N best). In this example, as a result, “replace“ in the wood ”with“ in the station ”” is determined as the speech recognition error correction processing, and other different parts are shaped and processed to obtain better speech translation results. Determined.

また、本発明は、例えば、「音声理解のためのテキスト整形」用途にも適用させることが可能である。音声理解は、ある発話を音声認識し、その認識結果を用いて発話の意図を理解し、意図に応じた処理を行うための技術である。例えば、音声認識技術を用いたカーナビゲーションシステムの操作などに利用される。以下、音声理解のために、音声理解用に設けられている定型文を用いる場合を例に説明する。 The present invention can also be applied to, for example, “text shaping for speech understanding”. Speech understanding is a technique for recognizing a certain utterance, understanding the intention of the utterance using the recognition result, and performing processing according to the intention. For example, it is used for the operation of a car navigation system using voice recognition technology. Hereinafter, a case where a fixed sentence provided for speech understanding is used for speech understanding will be described as an example.

図６は、「音声理解のためのテキスト整形」用途を適用させた場合の各種データの例を示す説明図である。図６に示すように、例えば、カーナビゲーションシステムの音声理解の定型文として「京都に行きたい」（カーナビゲーションシステムの目的地を京都に設定する定型文）があるとする。また、音声翻訳への適用と同様に、カーナビゲーションシステムの音声理解の定型文「京都に行きたい」に関する入力音声として「えー京都に行きたいんですけど」という発話が入力され、その音声認識結果として「えー／今日とに／行きたいんですけど」が得られたとする。また、少なくとも音声理解をすることを目的の１つとする学習用整形済みテキストＤ１５として、音声認識結果「えー／今日とに／行きたいんですけど」に対してカーナビゲーションシステムの定型文「京都に／行きたい」に修正、整形されたものが入力されたとする。 FIG. 6 is an explanatory diagram showing an example of various data when the “text shaping for speech understanding” application is applied. As shown in FIG. 6, for example, it is assumed that there is “I want to go to Kyoto” (a fixed sentence that sets the destination of the car navigation system in Kyoto) as a fixed sentence for voice understanding of the car navigation system. Similarly to the application to speech translation, the utterance “I want to go to Kyoto” is input as the input speech related to the standard sentence “I want to go to Kyoto” for speech understanding of the car navigation system, and the speech recognition result As a result, “Eh / I want to go to today / I want to go”. Also, as a preformatted text D15 for learning, which is at least one of the purpose of understanding speech, the standard text of the car navigation system “Kyoto in Kyoto” for the speech recognition result “Eh / Today / I want to go” Suppose you input something that has been modified and shaped to "/ I want to go".

変換モデル学習データ選別手段１１は、会議録作成や音声翻訳のためのテキスト整形を１用途とする生成モデルの生成処理と同様に、学習用整形済みテキストＤ１５と学習用音声認識結果データのアライメントをとることにより、異なり部分として「『えー』の削除」、「『今日とに』を『京都に』に置換」、「『行きたいんですけど』を『行きたい』に置換」を抽出すればよい。 The conversion model learning data selection means 11 aligns the learning-formatted text D15 and the learning speech recognition result data in the same manner as the generation process of the generation model that uses text shaping for the preparation of conference minutes and speech translation as one use. By extracting “Deleting“ E ””, “Replacing“ Today ”with“ Kyoto ””, and “Replacing“ I want to go ”with“ I want to go ”” as different parts Good.

そして、会議録作成や音声翻訳のためのテキスト整形の場合と同様に、出現数やワードグラフ（またはＮベスト）に含まれるか否かを基準にした所定の選別処理を行う。本例では、その結果、「『今日とに』を『京都に』に置換」が音声認識誤りの修正処理と判別され、それ以外の異なり部分については音声理解のための整形処理と判別される。 Then, in the same manner as in the case of text formatting for conference record creation and speech translation, a predetermined selection process is performed based on the number of appearances and whether or not it is included in the word graph (or N best). In this example, as a result, “Replace“ Today and Toni ”with“ Kyoto ”” is determined as the speech recognition error correction processing, and other differences are determined as the shaping processing for speech understanding. .

以上のように、本実施形態によれば、学習用音声認識結果データＤ１４と、学習用整形済みテキストＤ１５とに基づいて、変換モデル学習データ選別手段１１が、「音声認識結果に含まれる音声認識誤りの修正」や「可読性を高める等のテキスト整形」といった変換目的に応じた学習データを自動的に選別し、それら各々の学習データを用いて、変換モデル生成手段１２が各用途に応じた変換モデルを生成するように構成されている。これにより、「音声認識結果に含まれる音声認識誤りの修正」や「可読性を高める等のテキスト整形」といった各用途に応じた変換モデルを少ない労力で容易に生成することが可能となる。 As described above, according to the present embodiment, based on the learning speech recognition result data D14 and the learning-prepared text D15, the conversion model learning data selection unit 11 performs “speech recognition included in the speech recognition result”. The learning data according to the conversion purpose such as “correction of error” and “text shaping such as improving readability” is automatically selected, and the conversion model generation means 12 performs conversion according to each application using each learning data. Configured to generate a model. Thereby, it is possible to easily generate a conversion model corresponding to each application such as “correction of a speech recognition error included in a speech recognition result” and “text shaping such as improving readability” with less effort.

また、一般に、「可読性を高める等のテキスト整形」の処理は、生成する会議録の種類や会議録の作成者に大きく依存するという性質をもつ。また、一方で「音声認識結果に含まれる音声認識誤りの修正」の処理は、それらには依存しない。従って、依存しない変換処理だけを抽出して変換モデルを作成したり、依存する変換処理だけを抽出して変換モデルを作成したりすることができ、例えば、「音声認識結果に含まれる音声認識誤りの修正」用途に対応する変換モデルは、会議録の種類や会議録の作成者が異なる場合においても容易に適用可能であり、そのような場合において音声認識誤りの自動修正を効率よく行うことが可能となる。 In general, the process of “text formatting such as improving readability” has a characteristic that it greatly depends on the type of conference record to be generated and the creator of the conference record. On the other hand, the processing of “correction of speech recognition error included in speech recognition result” does not depend on them. Therefore, it is possible to create a conversion model by extracting only independent conversion processes, or to create a conversion model by extracting only dependent conversion processes. For example, “a speech recognition error included in a speech recognition result” The conversion model corresponding to the “correction” application can be easily applied even when the type of minutes and the creator of the minutes are different. In such cases, automatic correction of speech recognition errors can be efficiently performed. It becomes possible.

「可読性を高める等のテキスト整形」の処理結果が文章作成の目的や作成者によって異なる例として、例えば、国会のような官公庁議会の会議録であれば、議員の発言意図の曲解を防ぐために発話者の発言を要約せずに、発言の内容を正確に記述することが一般的である。一方で、企業等の会議録では、発言内容を正確に記述するのではなく、会議のポイントをまとめた要旨だけを記述することが多い。 As an example of the processing result of “text formatting to improve readability” differ depending on the purpose of the sentence creation and the creator, for example, if it is a minutes of a government assembly such as the Diet, utterance to prevent the deliberation of the intention of the member of the Diet It is common to accurately describe the contents of the remarks without summarizing their remarks. On the other hand, in a conference record of a company or the like, it is often the case that only the gist summarizing the points of a conference is described rather than describing the content of a statement accurately.

また、会議録の作成者が異なれば、会議のポイントをまとめた要旨が作成者各々によって異なることも多い。また、要旨に限らず、「です・ます調」と「である調」のように表現が異なることもあるため、音声認識結果に対する処理内容も異なる結果となる。 In addition, if the creators of the minutes are different, the summary of the meeting points is often different for each creator. In addition, not only the gist but also the expression may be different, such as “Iso Masune” and “Naru Tone”, so that the processing content for the speech recognition result is also different.

このように、会議録の種類や作成者によって音声認識結果に対する処理内容が異なるが、本発明によれば、文書作成の目的や作成者に依存しない処理（本例では「音声認識誤りの修正」処理）と文書作成の目的や作成者に依存する処理（本例では、「テキスト整形」処理）とを所定の基準により区別し、各々の処理に関するデータを自動的に選別して学習データとするため、各処理に応じた変換モデルを作成することができる。 As described above, although the processing content for the speech recognition result varies depending on the type of the minutes and the creator, according to the present invention, the processing independent of the purpose of the document creation and the creator (in this example, “correction of voice recognition error”). Process) and the process that depends on the purpose of the document creation and the creator (in this example, the “text shaping” process) are discriminated based on a predetermined standard, and data relating to each process is automatically selected as learning data. Therefore, a conversion model corresponding to each process can be created.

従来のように「依存しない処理」と「依存する処理」のデータを区別せずに変換モデルを生成した場合、生成される変換モデルは、必然的に文書作成の目的や作成者に依存した変換モデルとなる。このような変換モデルは、与えられた学習データと文書作成の目的や作成者が異なる場合への適用が困難である。一方、本実施形態では、「依存しない処理」と「依存する処理」とを区別して各々の処理に関するデータを選別し、各々の処理に対応させて抽出したデータを用いて変換モデルを生成するため、例えば「依存しない処理」に対応させて抽出した学習データから生成された変換モデルは、学習データとして用いた文書と文書作成の目的や作成者が異なる場合にも適用可能である。これにより、従来の変換モデルでは適用が困難な場合であっても「依存しない処理」の変換モデルを用いることで、音声認識技術を用いた文章作成の労力を削減することができる。 If a conversion model is generated without distinguishing between “independent processing” and “dependent processing” data as before, the generated conversion model is inevitably converted depending on the purpose of creating the document and the creator. Become a model. Such a conversion model is difficult to apply when the given learning data is different from the purpose of document creation and the creator. On the other hand, in the present embodiment, data relating to each process is selected by distinguishing between “independent processes” and “dependent processes”, and a conversion model is generated using data extracted corresponding to each process. For example, a conversion model generated from learning data extracted corresponding to “independent processing” can be applied even when the document used as learning data is different in purpose and creator. Thereby, even if it is difficult to apply with the conventional conversion model, it is possible to reduce the effort of sentence creation using the speech recognition technology by using the conversion model of “independent processing”.

例えば、学習用整形済みテキストに官公庁議会の会議録を用いて生成した「音声認識誤り修正」用の変換モデルを、一般企業の会議録作成に活用することも可能である。 For example, a conversion model for “speech recognition error correction” generated by using the minutes of the government assembly for the learning-ready text can be used for making minutes of a general company.

また、例えば、異なるパターン別に学習用音声認識結果データＤ１４を与え、その結果生成される変換モデルの差分をとれば、文書作成の目的や作成者に依存したテキスト整形を行う変換モデルを生成するといったことも可能である。 Further, for example, if learning speech recognition result data D14 is given for each different pattern, and the difference between the conversion models generated as a result is taken, a conversion model that performs text shaping depending on the purpose of the document creation and the creator is generated. It is also possible.

実施形態２．
次に、本発明の第２の実施形態について説明する。本実施形態では、第１の実施形態で説明した音声認識結果変換モデル生成装置によって生成される変換モデルを用いた応用例として、音声認識結果誤りの修正を目的とした音声認識結果変換モデルを用いる音声認識結果修正システムについて説明する。図７は、本実施形態の音声認識結果修正システムの構成例を示すブロック図である。図７に示す音声認識結果修正システム２００は、図１に示す音声認識結果変換モデル生成装置１００の各構成要素（１１〜１５）と、修正用音声認識結果記憶部２１と、音声認識結果修正手段２２と、修正済みテキスト記憶部２３とを備える。 Embodiment 2. FIG.
Next, a second embodiment of the present invention will be described. In this embodiment, a speech recognition result conversion model for correcting a speech recognition result error is used as an application example using the conversion model generated by the speech recognition result conversion model generating apparatus described in the first embodiment. A speech recognition result correction system will be described. FIG. 7 is a block diagram illustrating a configuration example of the speech recognition result correction system according to the present embodiment. The speech recognition result correction system 200 shown in FIG. 7 includes each component (11 to 15) of the speech recognition result conversion model generation device 100 shown in FIG. 1, a correction speech recognition result storage unit 21, and speech recognition result correction means. 22 and a corrected text storage unit 23.

なお、本実施形態における変換モデル記憶部１３は、少なくとも音声認識誤り修正用の変換モデルを記憶していればよい。なお、図７では、音声認識誤り修正用の変換モデルが含まれることを明示するために、変換モデル記憶部１３Ａとして区別して表記している。なお、他の用途に対応した変換モデルが含まれていてもよい。そのような場合には、例えば、どれが音声認識誤り修正用の変換モデルであるかが特定できるような識別子が付加されていればよい。 Note that the conversion model storage unit 13 in the present embodiment only needs to store at least a conversion model for correcting speech recognition errors. In FIG. 7, in order to clearly indicate that a conversion model for correcting speech recognition errors is included, the conversion model storage unit 13A is distinguished. Note that a conversion model corresponding to another application may be included. In such a case, for example, an identifier that can specify which is a conversion model for correcting speech recognition errors may be added.

また、本実施形態では、変換モデル学習データ選別手段１１が、音声認識誤り修正用の変換モデルを生成するための学習データのみを選別し、変換モデル生成手段１２が、その学習データを用いて音声認識誤り修正用の変換モデルのみを生成するといった限定的な実施を行ってもよい。 In this embodiment, the conversion model learning data selection unit 11 selects only learning data for generating a conversion model for correcting speech recognition errors, and the conversion model generation unit 12 uses the learning data to You may perform limited implementation, such as producing | generating only the conversion model for recognition error correction.

修正用音声認識結果記憶部２１は、修正用音声認識結果データを記憶する。修正用音声認識結果データとは、本システムにおいて修正対象とされる、ある音声認識エンジンを用いてある音声データを音声認識することによって得られる音声認識結果を示す情報である。 The correction speech recognition result storage unit 21 stores correction speech recognition result data. The correction speech recognition result data is information indicating a speech recognition result obtained by speech recognition of certain speech data using a certain speech recognition engine, which is a correction target in the present system.

音声認識結果修正手段２２は、変換モデル記憶部１３Ａに記憶されている音声認識誤り修正用の変換モデルＤ１３を用いて、修正用音声認識結果データを修正（変換）し、修正済みテキストを生成する。音声認識結果修正手段２２は、例えば、音声認識誤りを修正することを目的に生成された変換モデルの一例として「『血痕』を『結婚』に置換」する旨の変換規則が規定されている場合には、入力された修正用音声認識結果データにその変換規則を適用する箇所が含まれていないかを走査し、含まれていた場合には『血痕』を『結婚』に置換する処理を実施することによって、修正済みテキストを生成してもよい。 The speech recognition result correction means 22 corrects (converts) the correction speech recognition result data using the conversion model D13 for correcting speech recognition errors stored in the conversion model storage unit 13A, and generates corrected text. . For example, the speech recognition result correcting unit 22 defines a conversion rule that “replaces“ blood stain ”with“ marriage ”” as an example of a conversion model generated for the purpose of correcting a speech recognition error. Scans the input correction speech recognition result data for the location to which the conversion rule is applied, and if it is included, performs processing to replace “blood stain” with “marriage” By doing so, the corrected text may be generated.

図８は、修正用音声認識結果データとそれを元に生成される修正済みテキストの例を示す説明図である。図８に示すように、例えば、「血痕してます」という修正用音声認識結果データが入力された場合には、修正済みテキストとして「結婚してます」を生成する。 FIG. 8 is an explanatory diagram showing examples of correction speech recognition result data and corrected text generated based thereon. As shown in FIG. 8, for example, when correction speech recognition result data “blood stains” is input, “married” is generated as the corrected text.

修正済みテキスト記憶部２３は、音声認識結果修正手段２２によって生成された修正済みテキスト、すなわち修正用音声認識結果データに対して音声認識誤り修正用の変換モデルを適用させた結果得られるテキストを記憶する。 The corrected text storage unit 23 stores the corrected text generated by the voice recognition result correction unit 22, that is, the text obtained as a result of applying the conversion model for correcting the voice recognition error to the correction voice recognition result data. To do.

なお、音声認識誤り修正用の変換モデルの生成方法は、第１の実施形態と同様でよい。また、本実施形態において、音声認識結果修正手段２２は、プログラムに従って動作するＣＰＵ等によって実現される。また、修正用音声認識結果記憶部２１および修正済みテキスト記憶部２３は、メモリ等の記憶装置によって実現される。なお、これら各手段は、音声認識結果変換モデル生成装置１００とは別の装置として実装してもよいし、１つの装置として実装することも可能である。 Note that the method for generating a conversion model for correcting speech recognition errors may be the same as in the first embodiment. In the present embodiment, the speech recognition result correction unit 22 is realized by a CPU or the like that operates according to a program. The correction speech recognition result storage unit 21 and the corrected text storage unit 23 are realized by a storage device such as a memory. Each of these means may be implemented as an apparatus different from the speech recognition result conversion model generation apparatus 100 or may be implemented as one apparatus.

以上のように、本実施形態によれば、音声認識結果変換モデル生成装置１００に与えた学習データに処された変換処理に、修正用音声認識結果データに対する文章作成用に望む変換処理とは異なるもの（例えば、テキスト整形用の変換処理）が含まれている場合であっても、音声認識誤りの自動修正を効率よく行うことが可能となる。 As described above, according to the present embodiment, the conversion process performed on the learning data given to the speech recognition result conversion model generation device 100 is different from the conversion process desired for creating a sentence for the correction speech recognition result data. Even when a thing (for example, conversion processing for text shaping) is included, automatic correction of speech recognition errors can be performed efficiently.

実施形態３．
次に、本発明の第３の実施形態について説明する。本実施形態では、第１の実施形態で説明した音声認識結果変換モデル生成装置によって生成される変換モデルを用いた応用例として、テキスト整形を目的とした音声認識結果変換モデルを用いる音声認識結果整形システムについて説明する。図９は、本実施形態の音声認識結果整形システムの構成例を示すブロック図である。図９に示す音声認識結果整形システム３００は、図１に示す音声認識結果変換モデル生成装置１００の各構成要素（１１〜１５）と、整形用音声認識結果記憶部３１と、音声認識結果整形手段３２と、整形済みテキスト記憶部３３とを備える。 Embodiment 3. FIG.
Next, a third embodiment of the present invention will be described. In this embodiment, as an application example using the conversion model generated by the speech recognition result conversion model generation device described in the first embodiment, speech recognition result shaping using a speech recognition result conversion model for the purpose of text shaping. The system will be described. FIG. 9 is a block diagram illustrating a configuration example of the speech recognition result shaping system of the present embodiment. A speech recognition result shaping system 300 shown in FIG. 9 includes each component (11 to 15) of the speech recognition result conversion model generation device 100 shown in FIG. 1, a shaping speech recognition result storage unit 31, and speech recognition result shaping means. 32 and a formatted text storage unit 33.

なお、本実施形態における変換モデル記憶部１３は、少なくともテキスト整形用の変換モデルを記憶していればよい。なお、図９では、テキスト整形用の変換モデルが含まれることを明示するために、変換モデル記憶部１３Ｂとして区別して表記している。なお、他の用途に対応した変換モデルが含まれていてもよい。そのような場合には、例えば、どれがテキスト整形用の変換モデルであるかが特定できるような識別子が付加されていればよい。 Note that the conversion model storage unit 13 in the present embodiment only needs to store at least a conversion model for text shaping. In FIG. 9, in order to clearly indicate that a conversion model for text shaping is included, the conversion model storage unit 13 </ b> B is distinguished and described. Note that a conversion model corresponding to another application may be included. In such a case, for example, an identifier that can identify which is a conversion model for text formatting may be added.

また、本実施形態では、変換モデル学習データ選別手段１１が、テキスト整形用の変換モデルを生成するための学習データのみを選別し、変換モデル生成手段１２が、その学習データを用いてテキスト整形用の変換モデルのみを生成するといった限定的な実施を行ってもよい。 In this embodiment, the conversion model learning data selection unit 11 selects only learning data for generating a conversion model for text shaping, and the conversion model generation unit 12 uses the learning data for text shaping. A limited implementation may be performed, such as generating only the conversion model.

整形用音声認識結果記憶部３１は、整形用音声認識結果データを記憶する。整形用音声認識結果データとは、本システムにおいて整形対象とされる、ある音声認識エンジンを用いてある音声データを音声認識することによって得られる音声認識結果を示す情報である。 The shaping speech recognition result storage unit 31 stores shaping speech recognition result data. The speech recognition result data for shaping is information indicating a speech recognition result obtained by performing speech recognition on certain speech data using a certain speech recognition engine, which is a shaping target in this system.

音声認識結果整形手段３２は、変換モデル記憶部１３Ｂに記憶されているテキスト整形用の変換モデルＤ１３を用いて、整形用音声認識結果データを整形（変換）し、整形済みテキストを生成する。音声認識結果整形手段３２は、例えば、可読性を高めるためのテキスト整形を目的に生成された変換モデルの一例として「『してます』を『しています』に置換」する旨の変換規則が規定されている場合には、入力された修正用音声認識結果データにその変換規則を適用する箇所が含まれていないかを走査し、含まれていた場合には『してます』を『しています』に置換する処理を実施することによって、整形済みテキストを生成してもよい。 The speech recognition result shaping means 32 shapes (converts) the shaping speech recognition result data using the text shaping conversion model D13 stored in the conversion model storage unit 13B, and generates a shaped text. For example, the speech recognition result shaping means 32 defines a conversion rule that “replaces“ I do ”” as “is” as an example of a conversion model generated for the purpose of text formatting to improve readability. If it is included, the input voice recognition result data for correction is scanned for the part to which the conversion rule is applied. A preformatted text may be generated by performing a process of replacing “mass”.

図１０は、整形用音声認識結果データとそれを元に生成される整形済みテキストの例を示す説明図である。図１０に示すように、例えば、「血痕してます」という整形用音声認識結果データが入力された場合には、整形済みテキストとして「血痕しています」を生成する。 FIG. 10 is an explanatory diagram showing examples of the speech recognition result data for shaping and the formatted text generated based on the data. As shown in FIG. 10, for example, when shaping speech recognition result data “blood stain” is input, “blood stain” is generated as the shaped text.

整形済みテキスト記憶部３３は、音声認識結果整形手段３２によって生成された整形済みテキスト、すなわち整形用音声認識結果データに対してテキスト整形用の変換モデルを適用させた結果得られるテキストを記憶する。 The shaped text storage unit 33 stores the shaped text generated by the speech recognition result shaping unit 32, that is, the text obtained as a result of applying the text shaping conversion model to the shaping speech recognition result data.

なお、テキスト整形用の変換モデルの生成方法は、第１の実施形態と同様でよい。また、本実施形態において、音声認識結果整形手段３２は、プログラムに従って動作するＣＰＵ等によって実現される。また、整形用音声認識結果記憶部３１および整形済みテキスト記憶部３３は、メモリ等の記憶装置によって実現される。なお、これら各手段は、音声認識結果変換モデル生成装置１００とは別の装置として実装してもよいし、１つの装置として実装することも可能である。 Note that a method for generating a conversion model for text shaping may be the same as in the first embodiment. In the present embodiment, the speech recognition result shaping means 32 is realized by a CPU or the like that operates according to a program. The shaping speech recognition result storage unit 31 and the shaped text storage unit 33 are realized by a storage device such as a memory. Each of these means may be implemented as an apparatus different from the speech recognition result conversion model generation apparatus 100 or may be implemented as one apparatus.

以上のように、本実施形態によれば、音声認識結果変換モデル生成装置１００に与えた学習データに処された変換処理に、整形用音声認識結果データに対する文章作成用に望む変換処理とは異なるもの（例えば、音声認識誤り修正用の変換処理）が含まれている場合であっても、所望のテキスト整形を効率よく行うことが可能となる。 As described above, according to the present embodiment, the conversion process performed on the learning data given to the speech recognition result conversion model generation device 100 is different from the conversion process desired for creating a sentence for the shaping speech recognition result data. Even if a thing (for example, conversion processing for correcting speech recognition errors) is included, desired text shaping can be performed efficiently.

実施形態４．
次に、本発明の第４の実施形態について説明する。本実施形態では、第１の実施形態で説明した音声認識結果変換モデル生成装置によって生成される変換モデルを用いた応用例として、音声認識結果誤りの修正を目的とした音声認識結果変換モデルと、テキスト整形を目的とした音声認識結果変換モデルとを用いる音声認識結果変換システムについて説明する。図１１は、本実施形態の音声認識結果変換システムの構成例を示すブロック図である。図１１に示す音声認識結果変換システム４００は、図１に示す音声認識結果変換モデル生成装置１００の各構成要素（１１〜１５）と、修正・整形用音声認識結果記憶部４１と、音声認識結果修正手段２２と、音声認識結果整形手段３２と、変換済みテキスト記憶部４３とを備える。 Embodiment 4 FIG.
Next, a fourth embodiment of the present invention will be described. In the present embodiment, as an application example using the conversion model generated by the speech recognition result conversion model generation device described in the first embodiment, a speech recognition result conversion model for the purpose of correcting a speech recognition result error; A speech recognition result conversion system using a speech recognition result conversion model for text shaping will be described. FIG. 11 is a block diagram illustrating a configuration example of the speech recognition result conversion system according to the present embodiment. A speech recognition result conversion system 400 shown in FIG. 11 includes components (11 to 15) of the speech recognition result conversion model generation device 100 shown in FIG. 1, a correction / shaping speech recognition result storage unit 41, and a speech recognition result. The correction unit 22, the speech recognition result shaping unit 32, and the converted text storage unit 43 are provided.

なお、本実施形態における変換モデル記憶部１３は、少なくとも音声認識誤り修正用の変換モデルとテキスト整形用の変換モデルとを記憶していればよい。なお、図１１では、音声認識誤り修正用の変換モデルとテキスト整形用の変換モデルとが含まれることを明示するために、変換モデル記憶部１３Ａ，変換モデル記憶部１３Ｂとして区別して表記しているが、これらは１つの変換モデル記憶部１３によって実現されていてもよい。そのような場合には、例えば、どの用途の変換モデルかが特定できるような識別子が付加されていればよい。 Note that the conversion model storage unit 13 in the present embodiment only needs to store at least a conversion model for correcting speech recognition errors and a conversion model for text shaping. In FIG. 11, in order to clearly indicate that a conversion model for correcting speech recognition errors and a conversion model for text shaping are included, the conversion model storage unit 13A and the conversion model storage unit 13B are distinguished from each other. However, these may be realized by one conversion model storage unit 13. In such a case, for example, an identifier that can identify the conversion model for which application may be added.

図１２は、修正・整形用音声認識結果データと修正・整形済みテキストの例を示す説明図である。図１２に示すように、例えば、「血痕してます」という修正かつ整形用音声認識結果データが入力された場合には、まず修正済みテキストとして「結婚してます」を生成し、これを再度整形用音声認識結果データとして音声認識結果整形手段３２に与えることにより、修正・整形済みテキストとして「結婚しています」を生成する。 FIG. 12 is an explanatory diagram showing examples of correction / shaping speech recognition result data and corrected / shaped text. As shown in FIG. 12, for example, when corrected and recognizing speech recognition result data “blood stains” is input, firstly, “I am married” is generated as the corrected text, By giving the voice recognition result shaping means 32 as the voice recognition result data for shaping, “married” is generated as the corrected and shaped text.

変換済みテキスト記憶部４３は、音声認識結果修正手段２２および音声認識結果整形手段３２によって生成された修正かつ整形済みテキスト、すなわち修正・整形用音声認識結果データに対して音声認識誤りの修正用変換モデルとテキスト整形用の変換モデルとを適用させた結果得られるテキストを記憶する。 The converted text storage unit 43 converts the voice recognition error to the corrected and shaped text generated by the voice recognition result correcting unit 22 and the voice recognition result shaping unit 32, that is, correction / shaping voice recognition result data. The text obtained as a result of applying the model and the conversion model for text formatting is stored.

なお、本実施形態では、２つの変換モデルをともに使用する例を示したが、ユーザにどの変換モデルを使用するかを選択させてもよい。例えば、音声認識結果修正手段２２と音声認識結果生成手段３２に対して、変換モデルを用いた変換処理の指示を出す指示手段を備え、その指示手段が、ユーザが指示した変換モデルに応じて音声認識結果修正手段２２と音声認識結果生成手段３２に制御指示を出すようにしてもよい。なお、音声認識誤り修正用とテキスト整形用とを選択候補とするだけでなく、複数ある整形用変換モデルから、ユーザが指示した１つのモデルを選択させるといった方法も可能である。 In the present embodiment, an example in which two conversion models are used together is shown. However, the user may select which conversion model to use. For example, the voice recognition result correction means 22 and the voice recognition result generation means 32 are provided with an instruction means for issuing an instruction for conversion processing using a conversion model, and the instruction means performs voice according to the conversion model instructed by the user. A control instruction may be issued to the recognition result correction unit 22 and the voice recognition result generation unit 32. Note that not only selection for speech recognition error correction and text shaping can be selected, but also a method of selecting one model designated by the user from a plurality of shaping conversion models is possible.

次に、本発明の概要について説明する。図１３は、本発明の概要を示すブロック図である。図１３に示す変換モデル生成装置５００は、用途別学習データ選別手段５０１と、変換モデル生成手段５０２とを備える。 Next, the outline of the present invention will be described. FIG. 13 is a block diagram showing an outline of the present invention. A conversion model generation apparatus 500 shown in FIG. 13 includes learning data sorting means 501 for each use and conversion model generation means 502.

用途別学習データ選別手段５０１（例えば、変換モデル学習データ選別手段１１）は、ある音声データに対して音声認識処理の結果を示す学習用音声認識結果データ（例えば、学習用音声認識結果データＤ１４）と、音声データを所定の目的をもってテキストに変換した結果を示す学習用整形済みテキスト（例えば、学習用整形済みテキストＤ１５）とを用いて、学習用音声認識結果データによって示される音声認識結果テキストと学習用整形済みテキストとの間の異なり部分を抽出し、抽出された異なり部分の出現傾向に基づいて、予め定められている個別の用途に応じた変換モデルを生成するための学習データを選別する。 The learning data sorting unit 501 for each use (for example, the conversion model learning data sorting unit 11) is a learning speech recognition result data (for example, learning speech recognition result data D14) indicating the result of speech recognition processing for certain speech data. A speech recognition result text indicated by the learning speech recognition result data using a learning formatted text (for example, learning shaped text D15) indicating a result of converting the speech data into a text with a predetermined purpose, and Extract different parts from the pre-formatted text for learning, and select the learning data to generate a conversion model according to each predetermined application based on the appearance tendency of the extracted different parts .

変換モデル生成手段５０２（例えば、変換モデル生成手段１２）は、用途別学習データ選別手段５０１によって選別された学習データを用いて、個別の用途に応じた変換モデルを生成する。 The conversion model generation unit 502 (for example, the conversion model generation unit 12) uses the learning data selected by the application-specific learning data selection unit 501 to generate a conversion model corresponding to an individual application.

また、用途別学習データ選別手段５０１は、例えば、変換モデルを選別するための基準として、異なり部分に存在する部分文字列の組み合わせの出現数を用いてもよい。 The application-specific learning data selection unit 501 may use, for example, the number of appearances of combinations of partial character strings existing in different parts as a reference for selecting a conversion model.

また、例えば、抽出された異なり部分について、当該異なり部分に存在する部分文字列の組み合わせの出現数が所定の閾値よりも大きい場合に、当該異なり部分を有するデータを、テキスト整形を変換の用途とする変換モデルの学習用データとして選別してもよい。また、例えば、抽出された異なり部分について、当該異なり部分に存在する部分文字列の組み合わせの出現数が所定の閾値よりも小さい場合に、当該異なり部分を有するデータを、音声認識結果に含まれる音声認識誤りの修正を変換の用途とする変換モデルの学習用データとして選別してもよい。 In addition, for example, when the number of occurrences of combinations of partial character strings existing in the different parts is larger than a predetermined threshold, the data having the different parts is converted to text formatting and conversion usage. It may be selected as learning data for the conversion model. Further, for example, when the number of appearances of the combination of partial character strings existing in the different part is smaller than a predetermined threshold for the extracted different part, the data including the different part is converted into the voice included in the voice recognition result. You may select as the data for learning of the conversion model which uses correction of recognition error for conversion.

また、用途別学習データ選別手段５０１は、例えば、変換モデルを選別するための基準として、異なり部分において、学習用音声認識結果データによって示される音声認識結果テキストに存在する部分文字列が学習用整形済みテキストに存在するか否かの判定結果を用いてもよい。 In addition, the application-specific learning data selection unit 501 uses, for example, a partial character string existing in the speech recognition result text indicated by the learning speech recognition result data as a reference for selecting a conversion model, in a different part. The determination result as to whether or not it exists in the finished text may be used.

例えば、抽出された異なり部分において、学習用整形済みテキストに存在する部分文字列が学習用音声認識結果データに存在しない場合に、当該異なり部分を有するデータを、テキスト整形を変換の用途とする変換モデルの学習用データとして選別してもよい。また、例えば、抽出された異なり部分において、学習用整形済みテキストに存在する部分文字列が学習用音声認識結果データにも存在する場合に、当該異なり部分を有するデータを、音声認識結果に含まれる音声認識誤りの修正を変換の用途とする変換モデルの学習用データとして選別してもよい。 For example, in the extracted different part, when the partial character string existing in the learning formatted text does not exist in the learning speech recognition result data, the data having the different part is converted using text shaping as a conversion It may be selected as model learning data. Also, for example, in the extracted different portion, when a partial character string existing in the learning formatted text is also present in the learning speech recognition result data, the data having the different portion is included in the speech recognition result. You may select as the learning data of the conversion model which uses correction of a speech recognition error for conversion.

また、用途別学習データ選別手段５０１は、例えば、学習用音声認識結果データに含まれるＮベスト、ワードグラフ、または音声認識結果に含まれる単語の品詞もしくは意味情報、音響尤度、言語尤度、信頼度のいずれか１つ以上によって示される、音声認識結果テキストとして存在しうる尤度を元に、異なり部分を抽出してもよい。 Further, the learning data selecting means 501 for each use includes, for example, N best included in the learning speech recognition result data, word graph, or part of speech or semantic information of a word included in the speech recognition result, acoustic likelihood, language likelihood, Different portions may be extracted based on the likelihood that can exist as speech recognition result text, which is indicated by any one or more of the reliability levels.

また、用途別学習データ選別手段５０１は、例えば、各異なり部分について、当該異なり部分を生成するための変換処理がテキストの作成用途または作成者に依存する処理か否かを、抽出された異なり部分の出現傾向に基づき判別することによって、少なくとも個別の用途として定められている音声認識誤りの修正用途またはある目的のためのテキスト整形用途に応じた変換モデルを生成するための学習データを選別してもよい。 Further, the application-specific learning data selection unit 501 extracts, for each different part, whether or not the conversion process for generating the different part is a text creation use or a process depending on the creator. By discriminating on the basis of the appearance tendency, the learning data for generating a conversion model corresponding to at least a speech recognition error correction application or a text shaping application for a certain purpose determined as an individual application is selected. Also good.

また、図１４は、本発明の音声認識結果変換システムの概要を示すブロック図である。図１４に示す音声認識結果変換システム６００は、図１３に示したような途別学習データ選別手段５０１と変換モデル生成手段５０２の他に、音声認識結果修正手段６０１を備えている。 FIG. 14 is a block diagram showing an outline of the speech recognition result conversion system of the present invention. A speech recognition result conversion system 600 illustrated in FIG. 14 includes a speech recognition result correction unit 601 in addition to the stepwise learning data selection unit 501 and the conversion model generation unit 502 illustrated in FIG.

音声認識結果修正手段６０１（例えば、音声認識結果修正手段２２）は、予め音声認識誤りの修正を、変換モデルを生成する際の個別の用途の一つとする旨が定められることによって、変換モデル生成手段５０２により生成される音声認識誤りの修正用途に応じた変換モデルを用いて、与えられた音声認識結果データに対して変換処理を行うことによって、音声認識結果に含まれる音声認識誤りを修正する。 The voice recognition result correction means 601 (for example, the voice recognition result correction means 22) generates a conversion model by preliminarily determining that correction of a voice recognition error is one of individual uses when generating a conversion model. The speech recognition error included in the speech recognition result is corrected by performing conversion processing on the given speech recognition result data using a conversion model corresponding to the application for correcting the speech recognition error generated by the means 502. .

また、図１５は、本発明の音声認識結果変換システムの他の構成例を示すブロック図である。図１５に示す音声認識結果変換システム６００は、図１３に示したような途別学習データ選別手段５０１と変換モデル生成手段５０２の他に、音声認識結果整形手段６０２を備えている。 FIG. 15 is a block diagram showing another configuration example of the speech recognition result conversion system of the present invention. A speech recognition result conversion system 600 illustrated in FIG. 15 includes a speech recognition result shaping unit 602 in addition to the stepwise learning data selection unit 501 and the conversion model generation unit 502 illustrated in FIG.

音声認識結果整形手段６０２（例えば、音声認識結果整形手段３２）は、予めある目的のためのテキスト整形を、変換モデルを生成する際の個別の用途の一つとする旨が定められることによって、変換モデル生成手段５０２により生成される、ある目的のためのテキスト整形用途に応じた変換モデルを用いて、与えられた音声認識結果データに対して変換処理を行うことによって、ある目的に合致するテキストへの整形を行う。 The speech recognition result shaping means 602 (for example, the speech recognition result shaping means 32) converts the text shaping for a certain purpose in advance as one of the individual uses when generating the conversion model. By converting the given speech recognition result data using the conversion model generated by the model generation unit 502 according to the text shaping application for a certain purpose, the text matching the certain purpose is obtained. Perform shaping.

なお、図１６に示すように、音声認識結果変換システム６００は、音声認識結果修正手段６０１と、音声認識結果整形手段６０２とを備えた構成であってもよい。 As illustrated in FIG. 16, the speech recognition result conversion system 600 may include a speech recognition result correction unit 601 and a speech recognition result shaping unit 602.

そのような場合には、音声認識結果に含まれる音声認識誤りが修正され、かつある目的に合致するテキストへの整形が行われたテキストデータを得ることが可能になる。 In such a case, it is possible to obtain text data in which the speech recognition error included in the speech recognition result is corrected and the text is shaped into text that matches a certain purpose.

本発明は、音声認識技術を用いて音声データをテキスト化する音声認識装置や、音声データまたは音声認識結果から会議録を作成するテキスト整形装置、音声理解をさせて所望のレスポンスを返す音声翻訳装置やカーナビゲーションなどにも好適に適用可能である。 The present invention relates to a speech recognition device that converts speech data into text using speech recognition technology, a text shaping device that creates a conference record from speech data or speech recognition results, and a speech translation device that allows speech understanding to return a desired response. It can also be suitably applied to car navigation.

１００音声認識結果変換モデル生成装置
１１変換モデル学習データ選別手段
１２変換モデル生成手段
１３変換モデル記憶部
１３Ａ音声認識誤り修正用変換モデル記憶部
１３Ｂテキスト成型用変換モデル記憶部
１４学習用音声認識結果記憶部
１５学習用整形済みテキスト記憶部
２００音声認識結果修正システム
２１修正用音声認識結果記憶部
２２音声認識結果修正手段
２３修正済みテキスト記憶部
３００音声認識結果整形システム
３１整形用音声認識結果記憶部
３２音声認識結果整形手段
３３整形済みテキスト記憶部
４００音声認識結果変換システム
４１修正・整形用音声認識結果記憶部
４３変換済みテキスト記憶部
５００変換モデル生成手段
５０１用途別学習データ選別手段
５０２変換モデル生成手段
６００音声認識結果変換システム
６０１音声認識結果修正手段
６０２音声認識結果整形手段 DESCRIPTION OF SYMBOLS 100 Speech recognition result conversion model production | generation apparatus 11 Conversion model learning data selection means 12 Conversion model production | generation means 13 Conversion model memory | storage part 13A Speech recognition error correction conversion model memory | storage part 13B Text shaping conversion model memory | storage part 14 Learning voice recognition result memory | storage Unit 15 Reformed text storage unit for learning 200 Speech recognition result correction system 21 Speech recognition result storage unit for correction 22 Speech recognition result correction unit 23 Corrected text storage unit 300 Speech recognition result shaping system 31 Speech recognition result storage unit for shaping 32 Speech recognition result shaping means 33 Shaped text storage section 400 Speech recognition result conversion system 41 Correction / shaping voice recognition result storage section 43 Converted text storage section 500 Conversion model generation means 501 Application-specific learning data selection means 502 Conversion model generation means 600 Voice recognition result conversion system 601 speech recognition result correction unit 602 speech recognition result formatting means

Claims

Using the learning speech recognition result data indicating the result of speech recognition processing for certain speech data and the learning formatted text indicating the result of converting the speech data into text with a predetermined purpose, the learning speech A different portion between the speech recognition result text indicated by the recognition result data and the learning-prepared text is extracted, and based on the appearance tendency of the extracted different portion, the predetermined application is used. A learning data sorting means for each use for sorting learning data for generating a conversion model,
A conversion model generation device comprising: conversion model generation means for generating a conversion model corresponding to the individual application using the learning data selected by the application-specific learning data selection means.

The conversion model generation device according to claim 1, wherein the use-specific learning data selection unit uses the number of appearances of a combination of partial character strings existing in different parts as a reference for selecting a conversion model.

The use-specific learning data selection means performs text shaping on the data having the different parts when the number of occurrences of the combination of the partial character strings existing in the different parts is larger than a predetermined threshold. The conversion model generation apparatus according to claim 2, wherein the conversion model is selected as learning data for a conversion model to be used for conversion.

The application-specific learning data selection means, for the extracted different parts, when the number of occurrences of the combination of the partial character strings existing in the different parts is smaller than a predetermined threshold, The conversion model generation apparatus according to claim 2 or 3, wherein the correction of the speech recognition award included in the data is selected as learning data for a conversion model for conversion purposes.

The learning data selecting means for each application uses a partial character string existing in the speech recognition result text indicated by the learning speech recognition result data in the different parts as a reference for selecting the conversion model in the learning formatted text. The conversion model production | generation apparatus of any one of Claims 1-4 using the determination result of whether to do.

The use-specific learning data selection means is one of N best, word graph included in the learning speech recognition result data, or part of speech or semantic information of words included in the speech recognition result, acoustic likelihood, language likelihood, or reliability. The conversion model generation apparatus according to any one of claims 1 to 5, wherein a different portion is extracted based on a likelihood that may exist as speech recognition result text indicated by one or more.

The learning data sorting means for each use, when the partial character string existing in the learning shaped text is not present in the learning speech recognition result data in the extracted different part, the data having the different part is subjected to text shaping. The conversion model generation apparatus according to any one of claims 1 to 6, wherein the conversion model is selected as learning data for a conversion model to be used for conversion.

The use-specific learning data selection means recognizes the data having the different parts in the extracted different parts when the partial character strings existing in the learning formatted text are also present in the learning voice recognition result data. The conversion model generation device according to any one of claims 1 to 7, wherein correction of a speech recognition error included in a result is selected as learning data for a conversion model for use in conversion.

Based on the appearance tendency of the extracted different parts, the learning data sorting means for each use determines whether the conversion process for generating the different parts is a process that depends on the purpose of creating the text or the creator. The learning data for generating a conversion model corresponding to at least a speech recognition error correction application or a text shaping application for a certain purpose determined as an individual application is selected by discrimination. The conversion model generation device according to any one of 8.

It has been determined in advance that the correction of speech recognition errors will be one of the individual uses when generating a conversion model,
Using the learning speech recognition result data indicating the result of speech recognition processing for certain speech data and the learning formatted text indicating the result of converting the speech data into text with a predetermined purpose, the learning speech Extracting the different part between the speech recognition result text indicated by the recognition result data and the learning shaped text, and according to the predetermined individual use based on the appearance tendency of the extracted different part Learning data selecting means for each purpose for selecting learning data for generating a converted model,
Using the learning data selected by the application-specific learning data selecting means, a conversion model generating means for generating a conversion model according to the individual application,
Speech recognition included in the speech recognition result by performing conversion processing on the given speech recognition result data using a conversion model according to the application for correcting the speech recognition error generated by the conversion model generating means A speech recognition result conversion system comprising speech recognition result correction means for correcting an error.

It is stipulated that text formatting for a certain purpose is one of the individual uses when generating a conversion model,
Using the learning speech recognition result data indicating the result of speech recognition processing for certain speech data and the learning formatted text indicating the result of converting the speech data into text with a predetermined purpose, the learning speech Extracting the different part between the speech recognition result text indicated by the recognition result data and the learning shaped text, and according to the predetermined individual use based on the appearance tendency of the extracted different part Learning data selecting means for each purpose for selecting learning data for generating a converted model,
Using the learning data selected by the application-specific learning data selecting means, a conversion model generating means for generating a conversion model according to the individual application,
By using the conversion model according to the text shaping application for the certain purpose generated by the conversion model generating means, the given voice recognition result data is converted to meet the certain purpose. A speech recognition result conversion system comprising speech recognition result shaping means for shaping into text.

It is determined in advance that correction of speech recognition errors and text formatting for a certain purpose will be used individually when generating a conversion model,
Using the learning speech recognition result data indicating the result of speech recognition processing for certain speech data and the learning formatted text indicating the result of converting the speech data into text with a predetermined purpose, the learning speech Extracting the different part between the speech recognition result text indicated by the recognition result data and the learning shaped text, and according to the predetermined individual use based on the appearance tendency of the extracted different part Learning data selecting means for each purpose for selecting learning data for generating a converted model,
Speech recognition included in the speech recognition result by performing conversion processing on the given speech recognition result data using a conversion model according to the application for correcting the speech recognition error generated by the conversion model generating means A speech recognition result correcting means for correcting an error;
By using the conversion model according to the text shaping application for the certain purpose generated by the conversion model generating means, the given voice recognition result data is converted to meet the certain purpose. A speech recognition result conversion system comprising speech recognition result shaping means for shaping into text.

Using the learning speech recognition result data indicating the result of speech recognition processing for certain speech data and the learning formatted text indicating the result of converting the speech data into text with a predetermined purpose, the learning speech A different portion between the speech recognition result text indicated by the recognition result data and the learning-prepared text is extracted, and based on the appearance tendency of the extracted different portion, the predetermined application is used. A learning data selection step for each purpose for selecting learning data for generating a conversion model,
A conversion model generation step of generating a conversion model corresponding to the individual application using the learning data selected in the application-specific learning data selection step.

The conversion model generation method according to claim 13, wherein the number of occurrences of combinations of partial character strings existing in different parts is used as a reference for selecting a conversion model in the learning data selection step according to use.

In the use-specific learning data selection step, as a reference for selecting the conversion model, a partial character string existing in the speech recognition result text indicated by the speech recognition result data for learning is changed into a preformatted text for learning in a different part. The conversion model generation method according to claim 13 or 14, wherein a determination result of whether or not it exists is used.

In the learning data selection step by use, any of N best, word graph, part of speech or semantic information of words included in the speech recognition result data for learning, speech likelihood, speech likelihood, language likelihood, reliability The conversion model generation method according to any one of claims 13 to 15, wherein a different portion is extracted based on a likelihood that may be present as speech recognition result text indicated by one or more.

Based on the appearance tendency of the extracted different parts, whether or not the conversion process for generating the different parts is a process that depends on the purpose of creating the text or the creator, The learning data for generating a conversion model corresponding to at least a speech recognition error correction application or a text shaping application for a certain purpose determined as an individual application is selected by discrimination. The conversion model generation method according to any one of 16.

Using the learning speech recognition result data indicating the result of speech recognition processing for certain speech data and the learning formatted text indicating the result of converting the speech data into text with a predetermined purpose, the learning speech A different portion between the speech recognition result text indicated by the recognition result data and the learning-prepared text is extracted, and based on the appearance tendency of the extracted different portion, the predetermined application is used. A learning data selection step for each purpose for selecting learning data for generating a conversion model,
A conversion model generation step for generating a conversion model according to the individual application using the learning data selected in the application-specific learning data selection step;
A speech recognition result conversion step including a speech recognition result conversion step of performing conversion processing on a given speech recognition result using a conversion model generated in the conversion model generation step according to an individual application. Method.

On the computer,
Using the learning speech recognition result data indicating the result of speech recognition processing for certain speech data and the learning formatted text indicating the result of converting the speech data into text with a predetermined purpose, the learning speech A different portion between the speech recognition result text indicated by the recognition result data and the learning-prepared text is extracted, and based on the appearance tendency of the extracted different portion, the predetermined application is used. A learning data selection process for each use for selecting learning data for generating a conversion model,
A conversion model generation program for executing a conversion model generation process for generating a conversion model according to the individual application using the learning data selected in the application-specific learning data selection process.

On the computer,
Using the learning speech recognition result data indicating the result of speech recognition processing for certain speech data and the learning formatted text indicating the result of converting the speech data into text with a predetermined purpose, the learning speech A different portion between the speech recognition result text indicated by the recognition result data and the learning-prepared text is extracted, and based on the appearance tendency of the extracted different portion, the predetermined application is used. A learning data selection process for each use for selecting learning data for generating a conversion model,
A conversion model generation process that generates a conversion model according to the individual application using the learning data selected in the application-specific learning data selection process;
A speech recognition result conversion program for executing a speech recognition result conversion process for performing a conversion process on a given speech recognition result using a conversion model generated by the conversion model generation process according to an individual application. .