JP6735392B1

JP6735392B1 - Audio text conversion device, audio text conversion method, and audio text conversion program

Info

Publication number: JP6735392B1
Application number: JP2019096723A
Authority: JP
Inventors: 喜美子川嶋; 安永　健治; 健治安永
Original assignee: Nippon Telegraph and Telephone West Corp
Current assignee: Nippon Telegraph and Telephone West Corp
Priority date: 2019-05-23
Filing date: 2019-05-23
Publication date: 2020-08-05
Anticipated expiration: 2039-05-23
Also published as: JP2020190671A

Abstract

【課題】より精度が高い音声認識結果を出力する。【解決手段】雑音抑圧部１１が元音声波形ｆ１の雑音を抑制し、発話区間検出部１２が雑音抑圧音声波形ｆ２から発話区間ｔｊを検出し、音声波形切断部１３が元音声波形ｆ１と雑音抑圧音声波形ｆ２を発話区間ｔｊごとに切断して区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊを得て、音声認識部１４が、複数の音声認識エンジンｅｉのそれぞれにより、雑音抑圧前後の区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊのそれぞれを音声認識し、文字数の多い方を音声認識エンジンｅｉによる発話区間ｔｊの音声認識結果Ｒｉｊとし、認識結果補正部１５が発話区間ｔｊごとに音声認識結果Ｒｉｊを比較して音声認識結果を補正する。【選択図】図１A speech recognition result with higher accuracy is output. SOLUTION: A noise suppressing section 11 suppresses noise of an original speech waveform f1, a speech section detecting section 12 detects a speech section tj from a noise suppressing speech waveform f2, and a speech waveform cutting section 13 detects the original speech waveform f1 and noise. The suppressed voice waveform f2 is cut for each utterance section tj to obtain the section voice waveforms f1_tj, f2_tj, and the voice recognition unit 14 causes the plurality of voice recognition engines ei to detect the section voice waveforms f1_tj, f2_tj before and after noise suppression. Each of them is subjected to voice recognition, and the one having the larger number of characters is set as the voice recognition result Rij of the utterance section tj by the voice recognition engine ei, and the recognition result correction unit 15 compares the voice recognition results Rij for each utterance section tj to correct the voice recognition result. To do. [Selection diagram] Figure 1

Description

本発明は、音声認識精度を向上する技術に関する。 The present invention relates to a technique for improving voice recognition accuracy.

近年、音声認識技術が広く利用されている。例えば、ネットワークに接続されたスピーカーにマイクを内蔵し、音声認識による操作を可能とするスマートスピーカーが普及している。様々な企業から音声認識エンジンが提供されており、音声をテキスト化することが容易になっている。 In recent years, voice recognition technology has been widely used. For example, smart speakers that have a microphone built in a speaker connected to a network and that can be operated by voice recognition have become widespread. Various companies have provided speech recognition engines, which makes it easy to convert speech into text.

また、音声認識の精度を向上させるための雑音抑圧技術も検討されている（例えば非特許文献１）。 Further, a noise suppression technique for improving the accuracy of voice recognition has also been studied (for example, Non-Patent Document 1).

“雑音環境下での音声認識精度の向上に向けた音声処理技術”、日本電信電話株式会社、［平成３１年４月２２日検索］、インターネット〈ＵＲＬ：http://www.ntt.co.jp/svlab/activity/category_2/product2_29.html〉"Voice processing technology for improving voice recognition accuracy in noisy environments", Nippon Telegraph and Telephone Corporation, [Search on April 22, 2019], Internet <URL: http://www.ntt.co. jp/svlab/activity/category_2/product2_29.html〉

音声認識エンジンによって認識結果の特性が異なり、音声認識エンジンごとに得意不得意がある。音声認識エンジンごとに学習に用いているデータや音声認識アルゴリズムが異なるので、文章のような整った話し方の音声での認識精度が高い音声認識エンジンや、話し言葉のようなくだけた話し方の音声での認識精度が高い音声認識エンジンがある。音声認識エンジンによっては、認識精度が高いと推定される箇所のみを出力するものもあれば、認識できた箇所すべてを出力するものもある。 The characteristics of the recognition result differ depending on the voice recognition engine, and each voice recognition engine has its strengths and weaknesses. Since the data used for learning and the voice recognition algorithm are different for each voice recognition engine, a voice recognition engine with high recognition accuracy for a speech with a neat speaking style such as a sentence, or a voice with an unnatural speech like a spoken word There is a voice recognition engine with high recognition accuracy. Depending on the voice recognition engine, there are those that output only the portion where the recognition accuracy is estimated to be high, and those that output all the recognized portions.

また、雑音抑圧することで、音声認識精度が向上する箇所とそうでない箇所があり、雑音抑圧すれば認識精度が必ずしも上がるわけではない。例えば、雑音抑圧技術を適用すると、雑音のある個所は雑音が抑圧されるため音声認識精度が向上する。しかし、雑音のない箇所は、雑音抑圧処理が施されることで音質が下がり、音声認識精度が低下してしまうことがある。 Further, there is a portion where the voice recognition accuracy is improved by suppressing the noise and a portion where it is not, and if the noise is suppressed, the recognition accuracy is not necessarily improved. For example, when the noise suppression technique is applied, the noise is suppressed in a noisy place, so that the speech recognition accuracy is improved. However, noise-free processing may reduce the sound quality at a noise-free location, resulting in a decrease in voice recognition accuracy.

本発明は、上記に鑑みてなされたものであり、より精度が高い音声認識結果を出力することを目的とする。 The present invention has been made in view of the above, and an object thereof is to output a voice recognition result with higher accuracy.

本発明に係る音声テキスト化装置は、入力した音声波形の雑音を抑圧する雑音抑圧部と、複数の音声認識エンジンのそれぞれにより、前記音声波形を音声認識した第１の音声認識結果と、雑音を抑圧した雑音抑圧音声波形を音声認識した第２の音声認識結果を得て、前記第１の音声認識結果と前記第２の音声認識結果のうち文字数の多い方を当該音声認識エンジンの音声認識結果として選択する音声認識部と、前記複数の音声認識エンジンの音声認識結果を互いに比較して前記音声認識結果を補正する認識結果補正部と、を有することを特徴とする。 A speech text forming apparatus according to the present invention, a noise suppressing unit for suppressing noise of an input speech waveform, and a first speech recognition result of speech recognition of the speech waveform by each of a plurality of speech recognition engines, and noise A second speech recognition result obtained by recognizing the suppressed noise-suppressed speech waveform is obtained, and one of the first speech recognition result and the second speech recognition result having the larger number of characters is determined by the speech recognition engine. And a recognition result correction unit that compares the voice recognition results of the plurality of voice recognition engines with each other and corrects the voice recognition result.

本発明に係る音声テキスト化方法は、入力した音声波形の雑音を抑圧するステップと、複数の音声認識エンジンのそれぞれにより、前記音声波形を音声認識した第１の音声認識結果と、雑音を抑圧した雑音抑圧音声波形を音声認識した第２の音声認識結果を得るステップと、前記第１の音声認識結果と前記第２の音声認識結果のうち文字数の多い方を当該音声認識エンジンの音声認識結果として選択するステップと、前記複数の音声認識エンジンの音声認識結果を互いに比較して前記音声認識結果を補正するステップと、を有することを特徴とする。 A voice text conversion method according to the present invention suppresses noise in an input voice waveform, first voice recognition result of voice recognition of the voice waveform by each of a plurality of voice recognition engines, and noise is suppressed. A step of obtaining a second voice recognition result obtained by voice-recognizing a noise-suppressed voice waveform, and a method having a larger number of characters of the first voice recognition result and the second voice recognition result is used as the voice recognition result of the voice recognition engine. And a step of comparing the voice recognition results of the plurality of voice recognition engines with each other to correct the voice recognition result.

本発明によれば、より精度が高い音声認識結果を出力することができる。 According to the present invention, it is possible to output a voice recognition result with higher accuracy.

本実施形態の音声テキスト化装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the audio-text conversion apparatus of this embodiment. 本実施形態の音声テキスト化装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the audio-text conversion apparatus of this embodiment. 音声波形切断処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a voice waveform cutting process. 音声波形の頭からの無音区間の長さと認識精度との関係を示す図である。It is a figure which shows the relationship between the length of the silent area from the head of a speech waveform, and recognition accuracy. 音声認識処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a voice recognition process. 認識結果補正処理の流れを示すフローチャートである。It is a flow chart which shows a flow of recognition result amendment processing. 音声認識結果を形態素に分割し、不一致箇所を抽出した例を示す図である。It is a figure which shows the example which divided the speech recognition result into morphemes, and extracted the mismatched part. 補正状態フラグの一例を示す図である。It is a figure which shows an example of a correction state flag.

以下、本発明の実施の形態について図面を用いて説明する。 Embodiments of the present invention will be described below with reference to the drawings.

（音声テキスト化装置の構成）
図１は、本実施形態の音声テキスト化装置１の構成を示す機能ブロック図である。音声テキスト化装置１は、音声を入力し、入力した音声を音声認識した認識結果であるテキストを出力する。音声テキスト化装置１は、テキストに加えて、音声認識結果の補正内容を示す補正状態を出力してもよい。 (Structure of voice text conversion device)
FIG. 1 is a functional block diagram showing the configuration of the voice text conversion device 1 of this embodiment. The voice text conversion device 1 inputs a voice and outputs a text which is a recognition result of voice recognition of the input voice. The voice text conversion device 1 may output a correction state indicating the correction content of the voice recognition result, in addition to the text.

図１に示す音声テキスト化装置１は、雑音抑圧部１１、発話区間検出部１２、音声波形切断部１３、音声認識部１４、および認識結果補正部１５を備える。音声テキスト化装置１が備える各部は、演算処理装置、記憶装置等を備えたコンピュータにより構成して、各部の処理がプログラムによって実行されるものとしてもよい。このプログラムは音声テキスト化装置１が備える記憶装置に記憶されており、磁気ディスク、光ディスク、半導体メモリ等の記録媒体に記録することも、ネットワークを通して提供することも可能である。 The speech text forming apparatus 1 shown in FIG. 1 includes a noise suppression unit 11, a speech section detection unit 12, a speech waveform cutting unit 13, a speech recognition unit 14, and a recognition result correction unit 15. Each unit included in the voice text conversion device 1 may be configured by a computer including an arithmetic processing unit, a storage device, and the like, and the process of each unit may be executed by a program. This program is stored in a storage device included in the voice text conversion device 1, and can be recorded in a recording medium such as a magnetic disk, an optical disk, a semiconductor memory, or provided via a network.

雑音抑圧部１１は、音声認識対象となる元音声波形ｆ１を入力し、雑音抑圧処理を実施して、雑音抑圧音声波形ｆ２を出力する。雑音抑圧処理は、例えば、非特許文献１の音声処理技術や、ノイズキャンセリングイヤホン等に実装されている技術を用いることができる。元音声波形ｆ１と雑音抑圧音声波形ｆ２は、音声波形切断部１３に入力される。 The noise suppression unit 11 inputs the original speech waveform f1 to be speech-recognized, performs noise suppression processing, and outputs a noise-suppressed speech waveform f2. For the noise suppression processing, for example, the audio processing technology of Non-Patent Document 1 or the technology implemented in a noise canceling earphone can be used. The original speech waveform f1 and the noise-suppressed speech waveform f2 are input to the speech waveform cutting unit 13.

発話区間検出部１２は、雑音抑圧音声波形ｆ２を入力し、音声波形の中で人が発話している発話区間ｔｊ（ｊ＝１，２，・・・，ｍ）を検出する。発話区間の検出には、Ｇｏｏｇｌｅ等が公開しているＶＡＤ（ＶｏｉｃｅＡｃｔｉｖｉｔｙＤｅｔｅｃｔｉｏｎ）ライブラリを利用できる。発話区間検出部１２は、元音声波形ｆ１から発話区間を検出してもよい。 The utterance section detection unit 12 receives the noise-suppressed voice waveform f2 and detects a utterance section tj (j=1, 2,..., M) in which a person speaks in the voice waveform. A VAD (Voice Activity Detection) library published by Google or the like can be used for detecting the utterance section. The speech section detection unit 12 may detect the speech section from the original speech waveform f1.

音声波形切断部１３は、元音声波形ｆ１と雑音抑圧音声波形ｆ２のそれぞれを発話区間ｔｊで音声波形を切り出し、切り出した発話区間ごとの音声波形のそれぞれの先頭に無音波形を付加する。音声波形切断部１３は、元音声波形ｆ１から発話区間ｔｊごとに切り出して無音波形を付加した区間音声波形ｆ１＿ｔｊと、雑音抑圧音声波形ｆ２から発話区間ｔｊごとに切り出して無音波形を付加した区間音声波形ｆ２＿ｔｊを音声認識部１４へ出力する。 The voice waveform cutting unit 13 cuts out the voice waveforms of the original voice waveform f1 and the noise-suppressed voice waveform f2 at the utterance section tj, and adds a sound waveform to the beginning of each of the cut voice waveforms for each utterance section. The voice waveform cutting unit 13 cuts out the original voice waveform f1 for each utterance section tj and adds a non-voice waveform to the section voice waveform f1_tj, and the noise-suppressed voice waveform f2 to cut a voice signal for each utterance section tj to add a non-voice waveform. The waveform f2_tj is output to the voice recognition unit 14.

音声認識部１４は、複数の音声認識エンジンｅｉ（ｉ＝１，２，・・・，ｎ）を用いて、発話区間ｔｊごとに、雑音抑圧前後の区間音声波形ｆ１＿ｔｊと区間音声波形ｆ２＿ｔｊを音声認識する。音声認識部１４は、区間音声波形ｆ１＿ｔｊと区間音声波形ｆ２＿ｔｊの認識結果のうち文字数が多い方の認識結果を、音声認識エンジンｅｉによる発話区間ｔｊの音声認識結果Ｒｉｊとする。つまり、音声認識部１４は、発話区間ｔｊごとに、複数の音声認識エンジンｅｉによる音声認識結果Ｒｉｊを出力する。 The voice recognition unit 14 uses a plurality of voice recognition engines ei (i=1, 2,..., N) to voice the section voice waveform f1_tj and the section voice waveform f2_tj before and after noise suppression for each utterance section tj. recognize. The voice recognition unit 14 sets the recognition result with the larger number of characters among the recognition results of the section voice waveform f1_tj and the section voice waveform f2_tj as the voice recognition result Rij of the utterance section tj by the voice recognition engine ei. That is, the voice recognition unit 14 outputs the voice recognition result Rij by the plurality of voice recognition engines ei for each utterance section tj.

音声認識部１４は、複数の音声認識エンジンｅｉを備えてもよいし、外部の音声認識サービスを用いて音声認識してもよい。異なる複数の音声認識エンジンｅｉを用いるのであれば、その形式は問わない。複数の結果を出力する音声認識エンジンに関しては、信頼度が最大の認識結果を採用する。あるいは、複数の結果のうち信頼度が上位のものから複数個を出力し、後段の認識結果補正部１５で比較してもよい。 The voice recognition unit 14 may include a plurality of voice recognition engines ei, or may perform voice recognition using an external voice recognition service. The format does not matter as long as a plurality of different voice recognition engines ei are used. For a voice recognition engine that outputs multiple results, the recognition result with the highest reliability is adopted. Alternatively, a plurality of results may be output from the one having the highest reliability, and the recognition result correction unit 15 in the subsequent stage may compare them.

認識結果補正部１５は、発話区間ｔｊごとに、音声認識エンジンｅｉごとの音声認識結果Ｒｉｊを比較して不一致箇所を特定し、不一致箇所に関して、より多くの音声認識エンジンｅｉの音声認識結果を採用する。音声テキスト化装置１の入力した音声が映像やスライドに付随するものである場合、認識結果補正部１５は、不一致箇所に関して、音声認識結果Ｒｉｊを映像やスライドの文字認識結果と比較し、最も適した内容に補正する。映像やスライドの文字認識結果は、別の装置が映像等を処理して抽出したものを音声テキスト化装置１が入力してもよいし、音声テキスト化装置１が映像等を入力して抽出してもよい。 The recognition result correction unit 15 compares the voice recognition results Rij of the respective voice recognition engines ei for each utterance section tj to identify the non-coincidence points, and adopts the more voice recognition results of the voice recognition engine ei for the non-coincidence points. To do. When the voice input by the voice text conversion device 1 is associated with a video or a slide, the recognition result correction unit 15 compares the voice recognition result Rij with the character recognition result of the video or the slide, and finds the most suitable. Corrected to the contents The character recognition result of a video or a slide may be extracted by processing the video or the like by another device and input by the voice text conversion device 1, or the voice text conversion device 1 may input or extract the video or the like. May be.

認識結果補正部１５は、補正後の音声認識結果であるテキストに加えて、音声認識結果Ｒｉｊの不一致箇所の補正状態を出力する。例えば、認識結果補正部１５は、補正した不一致箇所に対して、音声認識比較での補正または文字認識との比較での補正などの情報を付与する。 The recognition result correction unit 15 outputs the correction state of the mismatched portion of the voice recognition result Rij in addition to the text that is the voice recognition result after correction. For example, the recognition result correction unit 15 adds information such as correction by voice recognition comparison or correction by comparison with character recognition to the corrected non-matching portion.

（音声テキスト化装置の動作）
次に、本実施形態の音声テキスト化装置１の動作について説明する。 (Operation of voice text conversion device)
Next, the operation of the voice text conversion device 1 of this embodiment will be described.

図２は、本実施形態の音声テキスト化装置１の処理の流れを示すフローチャートである。 FIG. 2 is a flowchart showing the flow of processing of the voice text conversion device 1 of this embodiment.

ステップＳ１にて、雑音抑圧部１１は、元音声波形ｆ１に対して雑音抑圧処理を実施し、雑音抑圧音声波形ｆ２を出力する。 In step S1, the noise suppression unit 11 performs noise suppression processing on the original speech waveform f1 and outputs a noise-suppressed speech waveform f2.

ステップＳ２にて、発話区間検出部１２は、雑音抑圧音声波形ｆ２から発話区間ｔｊを検出する。 In step S2, the speech section detection unit 12 detects the speech section tj from the noise-suppressed speech waveform f2.

ステップＳ３にて、音声波形切断部１３は、元音声波形ｆ１と雑音抑圧音声波形ｆ２のそれぞれから発話区間ｔｊを切り出すとともに、切り出した区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊの頭に無音波形を付加する。音声波形切断部１３による音声波形切断処理の詳細は後述する。 In step S3, the voice waveform cutting unit 13 cuts out the utterance section tj from each of the original voice waveform f1 and the noise-suppressed voice waveform f2, and adds a non-voice waveform to the heads of the cut section voice waveforms f1_tj and f2_tj. Details of the voice waveform cutting processing by the voice waveform cutting unit 13 will be described later.

なお、元音声波形ｆ１が短い場合は、ステップＳ２，Ｓ３の処理を行わずに、元音声波形ｆ１と雑音抑圧音声波形ｆ２を音声認識部１４に渡してもよい。 If the original speech waveform f1 is short, the original speech waveform f1 and the noise-suppressed speech waveform f2 may be passed to the speech recognition unit 14 without performing the processes of steps S2 and S3.

ステップＳ４にて、音声認識部１４は、複数の音声認識エンジンｅｉを用いて、区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊのそれぞれを音声認識し、音声認識結果Ｒｉｊを得る。音声認識部１４による音声認識処理の詳細は後述する。 In step S4, the voice recognition unit 14 performs voice recognition on each of the section voice waveforms f1_tj and f2_tj using the plurality of voice recognition engines ei, and obtains a voice recognition result Rij. Details of the voice recognition processing by the voice recognition unit 14 will be described later.

ステップＳ５にて、認識結果補正部１５は、複数の音声認識エンジンｅｉによる音声認識結果Ｒｉｊを比較し、適切な認識結果を採用してテキストを出力する。認識結果補正部１５は、元音声に関連した文字認識結果を用いて音声認識結果を補正してもよい。認識結果補正部１５による認識結果補正処理の詳細は後述する。 In step S5, the recognition result correction unit 15 compares the voice recognition results Rij by the plurality of voice recognition engines ei, adopts an appropriate recognition result, and outputs the text. The recognition result correction unit 15 may correct the voice recognition result using the character recognition result related to the original voice. Details of the recognition result correction processing by the recognition result correction unit 15 will be described later.

（音声波形切断処理）
図３は、音声波形切断処理の流れを示すフローチャートである。音声波形切断部１３は、元音声波形ｆ１、雑音抑圧音声波形ｆ２、および発話区間ｔｊを入力し、音声波形切断処理を実行する。 (Voice waveform cutting process)
FIG. 3 is a flowchart showing the flow of the voice waveform cutting process. The speech waveform cutting unit 13 inputs the original speech waveform f1, the noise-suppressed speech waveform f2, and the utterance section tj, and executes speech waveform cutting processing.

ステップＳ３１にて、音声波形切断部１３は、元音声波形ｆ１を発話区間ｔｊで切り出す。 In step S31, the voice waveform cutting unit 13 cuts out the original voice waveform f1 in the utterance section tj.

ステップＳ３２にて、音声波形切断部１３は、雑音抑圧音声波形ｆ２を発話区間ｔｊで切り出す。 In step S32, the voice waveform cutting unit 13 cuts out the noise-suppressed voice waveform f2 in the utterance section tj.

ステップＳ３３にて、音声波形切断部１３は、元音声波形ｆ１および雑音抑圧音声波形ｆ２を発話区間ｔｊで切り出した音声波形のそれぞれの先頭に無音波形を付加する。音声波形切断部１３は、元音声波形ｆ１を発話区間ｔｊで切り出して無音を付加した区間音声波形ｆ１＿ｔｊと、雑音抑圧音声波形ｆ２を発話区間ｔｊで切り出して無音を付加した区間音声波形ｆ２＿ｔｊを出力する。 In step S33, the speech waveform cutting unit 13 adds a sound waveform to the beginning of each of the speech waveforms obtained by cutting the original speech waveform f1 and the noise-suppressed speech waveform f2 in the utterance section tj. The voice waveform cutting unit 13 outputs a section voice waveform f1_tj obtained by cutting out the original voice waveform f1 in the utterance section tj and adding silence, and a section voice waveform f2_tj obtained by cutting out the noise suppression voice waveform f2 in the utterance section tj and adding silence. To do.

図４に示すように、音声認識の際、発話前の無音区間が所定の長さ以上あれば認識精度が向上する。そのため、音声波形切断部１３は、認識精度が飽和するような無音区間の時間を事前に決定しておき、切り出した区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊの頭に無音区間を付加する。 As shown in FIG. 4, at the time of voice recognition, the recognition accuracy is improved if the silent section before utterance has a predetermined length or more. Therefore, the voice waveform cutting unit 13 predetermines the time of the silent section where the recognition accuracy is saturated, and adds the silent section to the head of the cut section voice waveforms f1_tj and f2_tj.

ステップＳ３４にて、音声波形切断部１３は、全ての発話区間について処理したか否かを判定する。処理していない発話区間が存在する場合は、ステップＳ３１に戻り、次の発話区間ｔｊ＋１を処理する。全ての発話区間を切り出した場合は、音声波形切断処理を終了する。 In step S34, the voice waveform cutting unit 13 determines whether or not all utterance sections have been processed. If there is an unprocessed utterance section, the process returns to step S31 to process the next utterance section tj+1. When all utterance sections have been cut out, the voice waveform cutting process ends.

（音声認識処理）
図５は、音声認識処理の流れを示すフローチャートである。音声認識部１４は、雑音抑圧前後の区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊを入力し、複数の音声認識エンジンのそれぞれを用いて、発話区間ごとに音声認識結果を求める。 (Voice recognition processing)
FIG. 5 is a flowchart showing the flow of voice recognition processing. The voice recognition unit 14 inputs the section voice waveforms f1_tj and f2_tj before and after noise suppression, and obtains a voice recognition result for each utterance section using each of a plurality of voice recognition engines.

ステップＳ４１にて、音声認識部１４は、複数の音声認識エンジンの中から一つの音声認識エンジンｅｉを選択する。 In step S41, the voice recognition unit 14 selects one voice recognition engine ei from the plurality of voice recognition engines.

ステップＳ４２にて、音声認識部１４は、ステップＳ４１で選択した音声認識エンジンｅｉを用いて、元音声波形ｆ１から切り出した区間音声波形ｆ１＿ｔｊを音声認識する。 In step S42, the voice recognition unit 14 performs voice recognition of the section voice waveform f1_tj cut out from the original voice waveform f1 by using the voice recognition engine ei selected in step S41.

ステップＳ４３にて、音声認識部１４は、ステップＳ４１で選択した音声認識エンジンｅｉを用いて、雑音抑圧音声波形ｆ２から切り出した区間音声波形ｆ２＿ｔｊを音声認識する。 In step S43, the voice recognition unit 14 uses the voice recognition engine ei selected in step S41 to perform voice recognition of the section voice waveform f2_tj cut out from the noise suppression voice waveform f2.

ステップＳ４４にて、音声認識部１４は、ステップＳ４２，Ｓ４３で得られた音声認識結果の文字数を比較し、文字数の多い方の音声認識結果を音声認識エンジンｅｉによる発話区間ｔｊの音声認識結果Ｒｉｊとして採用する。雑音抑圧前後の波形の認識結果を比較することで、雑音抑圧により音声認識精度が向上する箇所とそうでない箇所があることを反映できる。雑音抑圧前後の認識文字数を比較し、文字数が多い認識結果を採用することで、認識漏れを防ぐことができる。 In step S44, the voice recognition unit 14 compares the numbers of characters of the voice recognition results obtained in steps S42 and S43, and the voice recognition result of the one with the larger number of characters is the voice recognition result Rij of the utterance section tj by the voice recognition engine ei. To be adopted as. By comparing the recognition results of the waveforms before and after noise suppression, it can be reflected that there is a part where the speech recognition accuracy is improved by noise suppression and a part where it is not. By comparing the number of recognized characters before and after noise suppression and adopting the recognition result with a large number of characters, it is possible to prevent omission of recognition.

ステップＳ４５にて、音声認識部１４は、全ての発話区間について処理したか否かを判定する。処理していない発話区間が存在する場合は、ステップＳ４２に戻り、次の発話区間ｔｊ＋１を処理する。 In step S45, the voice recognition unit 14 determines whether or not all the utterance sections have been processed. If there is an unprocessed utterance section, the process returns to step S42 to process the next utterance section tj+1.

ステップＳ４６にて、音声認識部１４は、全ての音声認識エンジンで処理したか否かを判定する。処理していない音声認識エンジンが存在する場合は、ステップＳ４１に戻り、次の音声認識エンジンｅｉ＋１を選択し、最初の発話区間から順に処理する。なお、ステップＳ４２〜Ｓ４５までの処理を複数の音声認識エンジンで並列に実行してもよい。 In step S46, the voice recognition unit 14 determines whether all the voice recognition engines have processed. If there is a voice recognition engine that has not been processed, the process returns to step S41, the next voice recognition engine ei+1 is selected, and processing is performed in order from the first speech section. The processes of steps S42 to S45 may be executed in parallel by a plurality of voice recognition engines.

（認識結果補正処理）
図６は、認識結果補正処理の流れを示すフローチャートである。認識結果補正部１５は、発話区間ｔｊごとに各音声認識エンジンｅｉの音声認識結果Ｒｉｊを比較し、比較結果に基づいて音声認識結果を補正する。 (Recognition result correction process)
FIG. 6 is a flowchart showing the flow of the recognition result correction process. The recognition result correction unit 15 compares the voice recognition results Rij of the voice recognition engines ei for each utterance section tj, and corrects the voice recognition result based on the comparison result.

ステップＳ５１にて、認識結果補正部１５は、発話区間ｔｊについて、音声認識エンジンごとの音声認識結果を比較して不一致箇所を抽出する。具体的には、認識結果補正部１５は、ＭｅＣａｂやＪｕｍａｎ等を用いて音声認識結果Ｒｉｊを形態素に分割し、ｄｉｆｆｌｉｂ等のライブラリを用いて形態素ごとに音声認識エンジン間での認識結果を比較して不一致箇所を抽出する。 In step S51, the recognition result correction unit 15 compares the speech recognition results of each speech recognition engine with respect to the utterance section tj, and extracts a mismatched portion. Specifically, the recognition result correction unit 15 divides the speech recognition result Rij into morphemes using MeCab or Japanese, and compares the recognition results between the speech recognition engines for each morpheme using a library such as difflib. To extract the mismatched part.

図７に、音声認識結果を形態素に分割し、不一致箇所を抽出した例を示す。同図の例では、発話区間ｔｊにおける６つの音声認識エンジンｅ１〜ｅ６の認識結果を形態素に分割して示している。発話区間ｔｊの、音声認識エンジンｅ１−ｅ３による音声認識結果は「私は山に登り」であり、音声認識エンジンｅ４，ｅ５による音声認識結果は「わしは山に乗り」であり、音声認識エンジンｅ６による音声認識結果は「私は山に乗り」である。各音声認識結果を形態素に分割して比較したとき、「私」と「わし」、「登り」と「乗り」が不一致箇所として抽出される。 FIG. 7 shows an example in which the voice recognition result is divided into morphemes and the mismatched portions are extracted. In the example of the figure, the recognition results of the six voice recognition engines e1 to e6 in the utterance section tj are shown divided into morphemes. In the utterance section tj, the voice recognition results by the voice recognition engines e1-e3 are "I climb a mountain", the voice recognition results by the voice recognition engines e4, e5 are "I am a mountain ride", and the voice recognition engine The result of voice recognition by e6 is "I ride a mountain". When each speech recognition result is divided into morphemes and compared, "I" and "eagle" and "climbing" and "ride" are extracted as non-matching portions.

ステップＳ５２にて、認識結果補正部１５は、不一致箇所について、複数の音声認識エンジンが出力している結果を採用する。例えば、図７の例で、「私」と「わし」で不一致の箇所について、認識結果補正部１５は、「私」と認識した音声認識エンジンの数が「わし」と認識した音声認識エンジンの数よりも多いので、「私」を採用する。また、図７の例で、「登り」と「乗り」で不一致の箇所について、認識結果補正部１５は、音声認識エンジンの数が同数であるので、どちらを採用してもよい。 In step S52, the recognition result correction unit 15 adopts the result output by the plurality of voice recognition engines for the mismatched portion. For example, in the example of FIG. 7, the recognition result correction unit 15 determines that the number of voice recognition engines that recognize “I” is “I” when the number of voice recognition engines that recognize “I” is different from that of the voice recognition engines that recognize “I”. Since it is more than the number, "I" is adopted. Further, in the example of FIG. 7, the recognition result correction unit 15 has the same number of voice recognition engines with respect to the disagreement between “climbing” and “riding”, so either one may be adopted.

ステップＳ５３にて、認識結果補正部１５は、不一致箇所について、文字認識結果と不一致箇所の認識結果とを比較し、より適切な候補を採用する。例えば、発話区間ｔｊの前後１０秒を含めた区間から映像やスライドから文字認識結果を取得し、文字認識結果と不一致箇所の各認識結果の意味ベクトルを比較し、文字認識結果と意味が類似している認識結果を採用する。意味ベクトルは、ｗｏｒｄ２ｖｅｃなどのベクトル化手法を用いて導出できる。図７の例で、映像から「山登り」という文字が取得できた場合、「登り」と「乗り」で不一致の箇所について、認識結果補正部１５は「登り」を採用する。 In step S53, the recognition result correction unit 15 compares the character recognition result and the recognition result of the mismatched portion with respect to the mismatched portion, and adopts a more appropriate candidate. For example, a character recognition result is acquired from a video or a slide from a section including 10 seconds before and after the utterance section tj, and the character recognition result and the meaning vector of each recognition result of a mismatched portion are compared, and the character recognition result and the meaning are similar. The recognition result is adopted. The semantic vector can be derived using a vectorization method such as word2vec. In the example of FIG. 7, when the character “mountain climbing” can be acquired from the video, the recognition result correction unit 15 adopts “climbing” for a portion where “climbing” and “ride” do not match.

ステップＳ５２とステップＳ５３の順序は逆でもよい。ステップＳ５２とステップＳ５３で同じ不一致箇所を補正した場合は、より信頼度の高い方を採用してもよい。 The order of step S52 and step S53 may be reversed. When the same non-matching portion is corrected in step S52 and step S53, the one with higher reliability may be adopted.

ステップＳ５４にて、認識結果補正部１５は、ステップＳ５２およびステップＳ５３での補正状況に基づいて、補正状態フラグを設定する。図８に、補正状態フラグの一例を示す。図８の例では、ステップＳ５２およびステップＳ５３で音声認識結果を補正しなかった場合は補正状態フラグを１とし、ステップＳ５２で音声認識結果間での比較に基づいて音声認識結果を補正した場合は補正状態フラグを２とし、ステップＳ５３で文字認識結果との比較に基づいて音声認識結果を補正した場合は補正状態フラグを３としている。フラグは上記に限るものではない。 In step S54, the recognition result correction unit 15 sets the correction state flag based on the correction status in steps S52 and S53. FIG. 8 shows an example of the correction state flag. In the example of FIG. 8, when the voice recognition result is not corrected in step S52 and step S53, the correction state flag is set to 1, and when the voice recognition result is corrected based on the comparison between the voice recognition results in step S52, The correction state flag is set to 2, and the correction state flag is set to 3 when the voice recognition result is corrected based on the comparison with the character recognition result in step S53. The flag is not limited to the above.

ステップＳ５５にて、認識結果補正部１５は、発話区間ｔｊについて、音声認識結果のテキストＴｊとともにステップＳ５４で設定した補正状態フラグｆｊを出力する。 In step S55, the recognition result correction unit 15 outputs the correction state flag fj set in step S54 together with the text Tj of the voice recognition result for the utterance section tj.

ステップＳ５６にて、認識結果補正部１５は、全ての発話区間について処理したか否かを判定する。処理していない発話区間が存在する場合は、ステップＳ５１に戻り、次の発話区間ｔｊ＋１を処理する。 In step S56, the recognition result correction unit 15 determines whether or not all utterance sections have been processed. If there is an unprocessed utterance section, the process returns to step S51 to process the next utterance section tj+1.

以上説明したように、本実施形態によれば、雑音抑圧部１１が元音声波形ｆ１の雑音を抑制し、発話区間検出部１２が雑音抑圧音声波形ｆ２から発話区間ｔｊを検出し、音声波形切断部１３が元音声波形ｆ１と雑音抑圧音声波形ｆ２を発話区間ｔｊごとに切断して区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊを得て、音声認識部１４が、複数の音声認識エンジンｅｉのそれぞれにより、雑音抑圧前後の区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊのそれぞれを音声認識し、文字数の多い方を音声認識エンジンｅｉによる発話区間ｔｊの音声認識結果Ｒｉｊとし、認識結果補正部１５が発話区間ｔｊごとに音声認識結果Ｒｉｊを比較して音声認識結果を補正することにより、雑音抑圧効果の有無および音声認識エンジンの得意不得意に応じて音声認識の精度を向上できる。 As described above, according to the present embodiment, the noise suppression unit 11 suppresses the noise of the original speech waveform f1, the speech section detection unit 12 detects the speech section tj from the noise suppression speech waveform f2, and the speech waveform disconnection is performed. The unit 13 cuts the original speech waveform f1 and the noise-suppressed speech waveform f2 for each utterance section tj to obtain section speech waveforms f1_tj and f2_tj, and the speech recognition unit 14 suppresses noise by each of the plurality of speech recognition engines ei. Each of the preceding and following section speech waveforms f1_tj, f2_tj is speech-recognized, and the one having the larger number of characters is set as the speech recognition result Rij of the speech section tj by the speech recognition engine ei, and the recognition result correction unit 15 makes the speech recognition result Rij for each speech section tj. By correcting the voice recognition result by comparing the above, the accuracy of the voice recognition can be improved according to the presence or absence of the noise suppression effect and the strength and weakness of the voice recognition engine.

本実施形態によれば、音声波形切断部１３が区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊの頭に無音波形を付加することにより、区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊの音声認識の精度を向上できる。 According to the present embodiment, the voice waveform cutting unit 13 adds the non-voice waveform to the head of the section voice waveforms f1_tj, f2_tj, so that the accuracy of voice recognition of the section voice waveforms f1_tj, f2_tj can be improved.

本実施形態によれば、認識結果補正部１５が元音声波形に付随する映像から抽出した文字認識結果に基づいて音声認識結果を補正することにより、音声の意味に合った音声認識結果が得られる。 According to the present embodiment, the recognition result correction unit 15 corrects the voice recognition result based on the character recognition result extracted from the video associated with the original voice waveform, thereby obtaining the voice recognition result that matches the meaning of the voice. ..

本実施形態によれば、認識結果補正部１５が音声認識結果の補正内容を示す補正状態フラグを出力することにより、音声認識結果の妥当性を判断できるようになる。 According to the present embodiment, the recognition result correction unit 15 outputs the correction state flag indicating the correction content of the voice recognition result, so that the validity of the voice recognition result can be determined.

１…音声テキスト化装置
１１…雑音抑圧部
１２…発話区間検出部
１３…音声波形切断部
１４…音声認識部
１５…認識結果補正部 DESCRIPTION OF SYMBOLS 1... Speech text-ized apparatus 11... Noise suppression part 12... Speech section detection part 13... Speech waveform cutting part 14... Speech recognition part 15... Recognition result correction part

Claims

A noise suppression unit that suppresses noise in the input speech waveform,
Each of the plurality of voice recognition engines obtains a first voice recognition result of voice recognition of the voice waveform and a second voice recognition result of voice recognition of a noise-suppressed voice waveform in which noise is suppressed, and obtains the first voice recognition result. A voice recognition unit that selects one of the voice recognition result and the second voice recognition result having the larger number of characters as the voice recognition result of the voice recognition engine;
And a recognition result correcting unit that corrects the voice recognition result by comparing the voice recognition results of the plurality of voice recognition engines with each other.

An utterance section detection unit that detects an utterance section from the voice waveform,
A voice waveform cutting unit that cuts the voice waveform and the noise-suppressed voice waveform for each utterance section, and adds a silent waveform to the head of the section voice waveform cut for each utterance section,
The voice recognition device according to claim 1, wherein the voice recognition unit performs voice recognition of the section voice waveform cut out from each of the voice waveform and the noise suppression voice waveform for each utterance section.

The voice recognition device according to claim 1 or 2, wherein the recognition result correction unit corrects the voice recognition result based on a character recognition result extracted from an image associated with the voice waveform.

The voice text conversion device according to any one of claims 1 to 3, wherein the recognition result correction unit outputs information indicating correction contents of the voice recognition result.

Suppressing the noise of the input speech waveform,
Obtaining a first voice recognition result of voice recognition of the voice waveform and a second voice recognition result of voice recognition of a noise-suppressed voice waveform with noise suppressed by each of a plurality of voice recognition engines;
Selecting one of the first voice recognition result and the second voice recognition result, which has the larger number of characters, as the voice recognition result of the voice recognition engine;
A step of comparing the speech recognition results of the plurality of speech recognition engines with each other to correct the speech recognition result.

A voice text conversion program for operating a computer as each part of the voice text conversion device according to any one of claims 1 to 4.