JP2017058507A

JP2017058507A - Speech recognition device, speech recognition method, and program

Info

Publication number: JP2017058507A
Application number: JP2015182917A
Authority: JP
Inventors: 太一浅見; Taichi Asami; 厚志安藤; Atsushi Ando
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-09-16
Filing date: 2015-09-16
Publication date: 2017-03-23
Anticipated expiration: 2035-09-16
Also published as: JP6495792B2

Abstract

PROBLEM TO BE SOLVED: To provide a speech recognition device with which it is possible to divide a speech recognition result into sentences without using a sentence boundary discriminator.SOLUTION: The speech recognition device includes a sentence boundary detection unit which, using an utterance-detected speech signal that is a speech signal for which utterance has been detected, a speech recognition result that includes a written expression and a word class generated by speech recognition of the utterance-detected speech signal, and the initial and final times of the utterance or pause of the utterance-detected speech signal, where an utterance section between pauses that are in prescribed length or longer is defined as an utterance, designates the whole or part of a pause in the utterance-detected speech signal that is equal to or longer than a predetermined shortest pause length as a sentence boundary candidate, classifies the sentence boundary candidate into a plurality of clusters on the basis of the feature value of the sentence boundary candidate, and detects the whole or part of a sentence boundary candidate included in a cluster that includes a sentence boundary candidate whose pause length is greater than or equal to a predetermined pause length threshold, as a sentence boundary.SELECTED DRAWING: Figure 4

Description

入力された音声信号を音声認識して、音声認識結果を文ごとに分割して出力する音声認識装置、音声認識方法、およびプログラムに関する。 The present invention relates to a speech recognition apparatus, a speech recognition method, and a program for recognizing an input speech signal and dividing and outputting a speech recognition result for each sentence.

入力された音声信号を音声認識によってテキスト化する際、音声認識結果が文ごとに分割されていれば、文を単位にした自然言語処理（例えば文書要約など）を音声認識結果に適用したときに正しい結果を得やすくなる。 When the input speech signal is converted into text by speech recognition, if the speech recognition result is divided for each sentence, when natural language processing (for example, document summarization) based on the sentence is applied to the speech recognition result It is easier to get correct results.

文ごとに分割された音声認識結果（以下、文境界付き音声認識結果という）を出力する従来技術が非特許文献１に開示されている。以下、図１〜図３を参照して、非特許文献１の音声認識装置について説明する。図１は、非特許文献１の音声認識装置９の構成を示すブロック図である。図２、図３は、非特許文献１の音声認識装置９の動作を示すフローチャートであって、図２は、識別器学習動作を示すフローチャート、図３は、音声認識動作を示すフローチャートである。 Non-Patent Document 1 discloses a conventional technique for outputting a speech recognition result (hereinafter referred to as a speech recognition result with a sentence boundary) divided for each sentence. Hereinafter, the speech recognition apparatus of Non-Patent Document 1 will be described with reference to FIGS. FIG. 1 is a block diagram showing the configuration of the speech recognition device 9 of Non-Patent Document 1. 2 and 3 are flowcharts showing the operation of the speech recognition apparatus 9 of Non-Patent Document 1, in which FIG. 2 is a flowchart showing the discriminator learning operation, and FIG. 3 is a flowchart showing the speech recognition operation.

図１に示すように非特許文献１の音声認識装置９は、文境界ラベル付きテキストコーパス記憶部９０と、文境界識別器学習部９１と、音声区間検出部９２と、音声認識部９３と、文境界識別部９４を含む構成である。文境界ラベル付きテキストコーパス記憶部９０は、正しい文境界位置（文境界ラベル）が付与された大量のテキストコーパス（以下、文境界ラベル付きテキストコーパスという）が記憶されている。図２に示すように、文境界識別器学習部９１は、文境界ラベル付きテキストコーパスから文境界の識別に用いる文境界識別器を学習する（Ｓ９１）。文境界識別器としては、Support Vector Machine（ＳＶＭ）やニューラルネットワークなど、２値分類が可能な既存の識別器が利用可能である。文境界識別器学習部９１は、識別器の種類に応じた既存のアルゴリズムを用いて文境界識別器を学習する。ステップＳ９１は後述するステップＳ９２〜Ｓ９４に先立ち、予め実行されているものとする。 As shown in FIG. 1, the speech recognition device 9 of Non-Patent Document 1 includes a text corpus storage unit 90 with a sentence boundary label, a sentence boundary discriminator learning unit 91, a speech segment detection unit 92, a speech recognition unit 93, The sentence boundary identifying unit 94 is included. The text corpus storage unit 90 with sentence boundary labels stores a large number of text corpora (hereinafter referred to as text corpus with sentence boundary labels) to which correct sentence boundary positions (sentence boundary labels) are assigned. As shown in FIG. 2, the sentence boundary discriminator learning unit 91 learns a sentence boundary discriminator to be used for sentence boundary discrimination from a text corpus with a sentence boundary label (S91). As the sentence boundary classifier, an existing classifier capable of binary classification, such as Support Vector Machine (SVM) or a neural network, can be used. The sentence boundary classifier learning unit 91 learns a sentence boundary classifier using an existing algorithm corresponding to the type of classifier. It is assumed that step S91 is executed in advance prior to steps S92 to S94 described later.

音声区間検出部９２は、所定のポーズ長Ｌ１（例えば５００ｍｓ程度）以上のポーズ（以下では、ポーズ区間ともいう）に挟まれた発声区間である発話を音声信号から検出して、発話検出済み音声信号を出力する（Ｓ９２）。音声区間検出部９２は既存のＶＡＤ技術によって音声信号中のポーズおよび発話を検出する。音声認識部９３は、発話検出済み音声信号を音声認識して、各発話に対する音声認識結果を出力する（Ｓ９３）。 The speech section detection unit 92 detects a speech that is a speech section sandwiched between pauses (hereinafter also referred to as pause sections) longer than a predetermined pause length L1 (for example, about 500 ms) from the speech signal, and speech detected speech A signal is output (S92). The voice section detection unit 92 detects pauses and utterances in the voice signal using the existing VAD technology. The voice recognition unit 93 performs voice recognition on the utterance detected voice signal and outputs a voice recognition result for each utterance (S93).

最も単純な方法では、ステップＳ９３で得られた音声認識結果をそのまま文境界付き音声認識結果と見なす。つまり発話＝文と見なす。現実には、複数の文が一呼吸で発声され、一つの発話に複数の文が含まれるケースや、文の途中に呼吸（所定のポーズ長Ｌ１以上の長さのポーズ）が置かれ、一つの文が複数の発話に分割されるケースが存在するため、後述の文境界識別部９４によって文境界をあらためて検出する必要がある。 In the simplest method, the speech recognition result obtained in step S93 is regarded as a speech recognition result with a sentence boundary as it is. In other words, utterance = sentence. In reality, a plurality of sentences are uttered by one breath, and a case where a plurality of sentences are included in one utterance or a breath (a pose having a length equal to or longer than a predetermined pose length L1) is placed in the middle of the sentence. Since there is a case where one sentence is divided into a plurality of utterances, it is necessary to detect the sentence boundary again by the sentence boundary identifying unit 94 described later.

文境界識別部９４は、ステップＳ９１において事前に学習された文境界識別器を用いて、音声認識結果中の各単語境界の直前／直後の単語の表記や品詞を特徴量として、当該単語境界が文境界であるか否かを識別し、文境界付き音声認識結果を出力する（Ｓ９４）。 The sentence boundary discriminating unit 94 uses the sentence boundary discriminator learned in advance in step S91, and uses the notation and part of speech of the word immediately before / after each word boundary in the speech recognition result as the feature quantity, and the word boundary is determined. It is identified whether or not it is a sentence boundary, and a speech recognition result with a sentence boundary is output (S94).

祖父江翔、山本けい子、田村哲嗣、速水悟、「音声認識結果の文境界推定における識別モデルの評価」、言語処理学会第15回年次大会発表論文集、一般社団法人言語処理学会、平成21年3月、pp.582-585Sho Sobue, Keiko Yamamoto, Tetsugo Tamura, Satoru Hayami, "Evaluation of Discrimination Model for Estimating Sentence Boundary of Speech Recognition Results", Proc. Of the 15th Annual Conference of the Language Processing Society of Japan, Language Processing Society of Japan, 2009 March, pp.582-585

従来技術では事前に学習した文境界識別器を利用するため、学習に用いたテキストコーパス中の文境界の特徴（直前／直後の単語の表記や品詞）とは異なる特徴を持つ文境界は正しく識別することができない。特に人間同士の会話音声においては文境界の特徴には様々なバリエーションがある。例えば、会話相手との親しさの度合い、発言する場がフォーマルな場であるか否か、話者ごとの文末表現の癖などによって文境界での表現が多様に変化する。このようなバリエーションを事前に網羅しておくことは難しく、例えば特定の利用者の音声ではほとんど文境界が検出されなくなるなど、文境界識別の精度が低下し、システムの利便性が落ちる場合がある。 Since the prior art uses a sentence boundary classifier that has been learned in advance, sentence boundaries that have different characteristics from the sentence boundary characteristics (previous / immediate word notation and part of speech) in the text corpus used for learning are correctly identified. Can not do it. In particular, there are various variations in sentence boundary characteristics in conversational speech between humans. For example, the expression at the sentence boundary changes variously depending on the degree of familiarity with the conversation partner, whether or not the place to speak is a formal place, the habit of the sentence end expression for each speaker, and the like. It is difficult to cover such variations in advance. For example, sentence boundaries are almost undetectable with a specific user's voice, which may reduce the accuracy of sentence boundary identification and reduce system convenience. .

また、文境界識別器の学習には、音声認識結果テキストの各単語境界に人手で文境界か否かを表すラベル（文境界ラベル）を付与したデータを用いるのが一般的だが、音声認識結果には認識誤りが含まれるため、正しい文境界ラベルを付与するには元の音声を聴取しながら作業を行う必要があり、作業量が大きいためシステム構築コストの増加の要因となる。 In addition, for learning a sentence boundary classifier, it is common to use data in which each word boundary of a speech recognition result text is manually assigned a label indicating whether it is a sentence boundary (sentence boundary label). Since a recognition error is included, it is necessary to work while listening to the original voice in order to give a correct sentence boundary label, which increases the system construction cost due to the large amount of work.

そこで本発明は、文境界識別器を用いずに、音声認識結果を文ごとに分割することができる音声認識装置を提供することを目的とする。 Accordingly, an object of the present invention is to provide a speech recognition apparatus that can divide a speech recognition result for each sentence without using a sentence boundary discriminator.

本発明の音声認識装置は、文境界検出部を含む。所定のポーズ長以上のポーズに挟まれた発声区間を発話というものとし、文境界検出部は、発話を検出済みの音声信号である発話検出済み音声信号と、発話検出済み音声信号を音声認識して生成した表記と品詞とを含む音声認識結果と、発話検出済み音声信号の発話またはポーズの始端および終端時刻とを用いる。文境界検出部は、発話検出済み音声信号内の予め定めた最短ポーズ長以上の長さとなるポーズの一部または全部を文境界候補とし、文境界候補の特徴量に基づいて文境界候補を複数のクラスタに分類し、予め定めたポーズ長閾値以上のポーズ長となる文境界候補を含むクラスタ内に含まれる文境界候補の一部または全部を文境界として検出する。 The speech recognition apparatus of the present invention includes a sentence boundary detection unit. An utterance section sandwiched between poses longer than a predetermined pose length is called utterance, and the sentence boundary detection unit recognizes the utterance detected voice signal, which is an utterance detected voice signal, and the utterance detected voice signal. The speech recognition result including the notation and the part of speech generated in this way, and the utterance or pause start and end times of the utterance detected speech signal are used. The sentence boundary detection unit sets a part or all of a pose having a length equal to or longer than a predetermined shortest pose length in a speech signal whose utterance has been detected as a sentence boundary candidate, and selects a plurality of sentence boundary candidates based on the feature amount of the sentence boundary candidate. And a part or all of sentence boundary candidates included in the cluster including sentence boundary candidates having a pause length equal to or greater than a predetermined pause length threshold are detected as sentence boundaries.

本発明の音声認識装置によれば、文境界識別器を用いずに、音声認識結果を文ごとに分割することができる。 According to the speech recognition apparatus of the present invention, the speech recognition result can be divided for each sentence without using a sentence boundary discriminator.

非特許文献１の音声認識装置の構成を示すブロック図。The block diagram which shows the structure of the speech recognition apparatus of a nonpatent literature 1. FIG. 非特許文献１の音声認識装置の識別器学習動作を示すフローチャート。10 is a flowchart showing a discriminator learning operation of the speech recognition apparatus of Non-Patent Document 1. 非特許文献１の音声認識装置の音声認識動作を示すフローャート。The float which shows the speech recognition operation | movement of the speech recognition apparatus of a nonpatent literature 1. 実施例１の音声認識装置の構成を示すブロック図。1 is a block diagram illustrating a configuration of a voice recognition device according to Embodiment 1. FIG. 実施例１の音声認識装置の動作を示すフローチャート。3 is a flowchart showing the operation of the speech recognition apparatus according to the first embodiment. 実施例１の音声認識装置の音声区間検出部の出力例を示す図。The figure which shows the output example of the audio | voice area detection part of the speech recognition apparatus of Example 1. FIG. 実施例１の音声認識装置の音声認識部の出力例を示す図。The figure which shows the example of an output of the speech recognition part of the speech recognition apparatus of Example 1. FIG. 実施例１の音声認識装置の文境界検出部の構成を示すブロック図。FIG. 3 is a block diagram illustrating a configuration of a sentence boundary detection unit of the speech recognition apparatus according to the first embodiment. 実施例１の音声認識装置の文境界検出部の動作を示すフローチャート。6 is a flowchart illustrating an operation of a sentence boundary detection unit of the speech recognition apparatus according to the first embodiment. 実施例１の音声認識装置の文境界特徴抽出部の出力例を示す図。The figure which shows the example of an output of the sentence boundary feature extraction part of the speech recognition apparatus of Example 1. FIG. 実施例１の音声認識装置の文境界フラグ付与部の出力例を示す図。The figure which shows the example of an output of the sentence boundary flag provision part of the speech recognition apparatus of Example 1. FIG.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

以下、図４、図５を参照して実施例１の音声認識装置の構成、および動作を説明する。図４は、本実施例の音声認識装置１０の構成を示すブロック図である。図５は、本実施例の音声認識装置１０の動作を示すフローチャートである。図４に示すように音声認識装置１０は、音声区間検出部１０２と、音声認識部１０３と、文境界検出部１０４を含む。本実施例の音声認識装置１０は、文境界識別器を利用しないため、前述した文境界ラベル付きテキストコーパス記憶部９０、文境界識別器学習部９１は不要である。 Hereinafter, the configuration and operation of the speech recognition apparatus according to the first embodiment will be described with reference to FIGS. 4 and 5. FIG. 4 is a block diagram showing the configuration of the speech recognition apparatus 10 of this embodiment. FIG. 5 is a flowchart showing the operation of the speech recognition apparatus 10 of this embodiment. As shown in FIG. 4, the speech recognition apparatus 10 includes a speech section detection unit 102, a speech recognition unit 103, and a sentence boundary detection unit 104. Since the speech recognition apparatus 10 according to the present embodiment does not use a sentence boundary discriminator, the text corpus storage unit 90 with sentence boundary label and the sentence boundary discriminator learning unit 91 described above are unnecessary.

音声区間検出部１０２は、既存のＶＡＤ技術により、発話を音声信号から検出して、発話検出済み音声信号を出力する（Ｓ１０２）。音声区間検出部１０２は、発話検出済み音声信号とともに、発話またはポーズの始端および終端時刻（音声信号の冒頭を時刻０とする）を出力する。発話の始端時刻は当該発話の直前のポーズの終端時刻と等しく、発話の終端時刻は当該発話の直後のポーズの始端時刻と等しいから、発話の始端および終端時刻、またはポーズの始端および終端時刻の少なくともいずれかがあれば十分である。発話またはポーズの始端／終端時刻は、例えば図６に示すテーブル形式のデータ構造で、音声認識装置１０の何れかの記憶領域に格納される。図６は、音声区間検出部１０２の出力例を示す図である。 The voice section detection unit 102 detects the utterance from the voice signal by using the existing VAD technique, and outputs the utterance detected voice signal (S102). The voice section detection unit 102 outputs the start and end times of the utterance or pause (the beginning of the voice signal is set to time 0) together with the utterance detected voice signal. Since the start time of the utterance is equal to the end time of the pose immediately before the utterance and the end time of the utterance is equal to the start time of the pose immediately after the utterance, the start and end times of the utterance or the start and end times of the pose At least one is enough. The start / end times of utterances or pauses are stored in any storage area of the speech recognition apparatus 10 in a table format data structure shown in FIG. FIG. 6 is a diagram illustrating an output example of the speech segment detection unit 102.

次に、音声認識部１０３は、発話検出済み音声信号を音声認識して、発話またはポーズの始端および終端時刻と、表記と品詞を含む音声認識結果を出力する（Ｓ１０３）。音声認識結果は、後述する文境界候補のクラスタリングにおいて話者の傾向などを十分に把握するために、一定量以上（例えば約１００発話以上）あることが望ましい。 Next, the speech recognition unit 103 recognizes the speech signal that has been detected for speech, and outputs a speech recognition result including the start and end times of the speech or pause, the notation, and the part of speech (S103). The speech recognition result is desirably a certain amount or more (for example, about 100 utterances or more) in order to sufficiently grasp the tendency of the speaker in the sentence boundary candidate clustering described later.

なお、音声認識部１０３は、音声区間検出部１０２が出力する始端および終端時刻に加え、音声区間検出部１０２が検出しなかった短いポーズによる始端および終端時刻を出力することに注意する。 Note that the speech recognition unit 103 outputs the start and end times due to a short pause that the speech segment detection unit 102 did not detect, in addition to the start and end times output by the speech segment detection unit 102.

図７に、音声認識部１０３の出力例を示す。図７の出力例における始端時刻が８．６６秒、終端時刻が９．０３秒のポーズは、図６の出力例に含まれない短いポーズ（ステップＳ１０２で検出対象となっていない、所定のポーズ長Ｌ１以下のポーズ）である。 FIG. 7 shows an output example of the voice recognition unit 103. In the output example of FIG. 7, pauses with a start time of 8.66 seconds and an end time of 9.03 seconds are short pauses not included in the output example of FIG. 6 (predetermined pauses that are not detected in step S102). Pose of length L1 or less).

一般的なＶＡＤ技術では、通常５００ｍｓ程度に設定される所定のポーズ長Ｌ１以上のポーズ（およびポーズに挟まれた発話）を検出する。従って、一般的なＶＡＤ技術を用いて音声区間検出部１０２を構成した場合、所定のポーズ長Ｌ１未満の短いポーズは音声区間検出部１０２では検出されない。このため、所定のポーズ長Ｌ１未満の短いポーズは、音声認識部１０３が検出する。 In a general VAD technique, a pose (and an utterance sandwiched between poses) longer than a predetermined pose length L1 that is normally set to about 500 ms is detected. Therefore, when the speech section detection unit 102 is configured using a general VAD technique, the speech section detection unit 102 does not detect a short pose shorter than the predetermined pause length L1. For this reason, the speech recognition unit 103 detects a short pose shorter than the predetermined pose length L1.

次に、文境界検出部１０４は、発話検出済み音声信号内の予め定めた最短ポーズ長Ｔ１以上の長さとなるポーズの一部または全部を文境界候補とし、文境界候補の特徴量に基づいて文境界候補を複数のクラスタに分類し、予め定めたポーズ長閾値Ｔ２以上のポーズ長となる文境界候補を含むクラスタ内に含まれる文境界候補の一部または全部を文境界として検出する（Ｓ１０４）。 Next, the sentence boundary detection unit 104 sets a part or all of poses having a length equal to or longer than a predetermined shortest pose length T1 in the speech signal that has been detected as speech as a sentence boundary candidate, and based on the feature amount of the sentence boundary candidate. Sentence boundary candidates are classified into a plurality of clusters, and a part or all of sentence boundary candidates included in a cluster including sentence boundary candidates having a pause length greater than or equal to a predetermined pause length threshold T2 are detected as sentence boundaries (S104). ).

以下、図８、図９を参照して文境界検出部１０４の詳細な構成例および動作例について説明する。図８は、本実施例の音声認識装置１０の文境界検出部１０４の構成を示すブロック図である。図９は、本実施例の音声認識装置１０の文境界検出部１０４の動作を示すフローチャートである。図８に示すように、文境界検出部１０４は、文境界候補特定部１０４１と、文境界特徴抽出部１０４２と、文境界候補クラスタリング部１０４３と、文境界フラグ付与部１０４４を含む構成である。 Hereinafter, a detailed configuration example and an operation example of the sentence boundary detection unit 104 will be described with reference to FIGS. 8 and 9. FIG. 8 is a block diagram illustrating a configuration of the sentence boundary detection unit 104 of the speech recognition apparatus 10 according to the present embodiment. FIG. 9 is a flowchart showing the operation of the sentence boundary detection unit 104 of the speech recognition apparatus 10 of this embodiment. As shown in FIG. 8, the sentence boundary detecting unit 104 includes a sentence boundary candidate specifying unit 1041, a sentence boundary feature extracting unit 1042, a sentence boundary candidate clustering unit 1043, and a sentence boundary flag adding unit 1044.

文境界候補特定部１０４１は、予め定めた最短ポーズ長Ｔ１以上の全てのポーズ（ただし、冒頭のポーズを除く）を文境界候補とし、文境界候補に文境界候補フラグを付与して、文境界候補フラグ付き音声認識結果を出力する（Ｓ１０４１）。Ｔ１は文境界に置かれるポーズの長さの最短値を規定するパラメータであり、通常はＴ１＝２００ｍｓ程度の値に予め設定される。Ｔ１＝２００ｍｓとして、図７の音声認識結果に文境界候補フラグを付与した例を図１０の右から二番目のカラムに示す。図１０は、後述する文境界特徴抽出部１０４２の出力例を示す図である。文境界候補特定部１０４１は、抽出した文境界候補毎に、文境界候補を特定する識別情報を付与することとしても良い。 The sentence boundary candidate specifying unit 1041 sets all poses (excluding the initial pose) longer than a predetermined shortest pose length T1 as sentence boundary candidates, assigns a sentence boundary candidate flag to the sentence boundary candidate, A speech recognition result with a candidate flag is output (S1041). T1 is a parameter that defines the shortest value of the length of the pose placed at the sentence boundary, and is usually set in advance to a value of about T1 = 200 ms. An example in which a sentence boundary candidate flag is added to the speech recognition result in FIG. 7 with T1 = 200 ms is shown in the second column from the right in FIG. FIG. 10 is a diagram illustrating an output example of a sentence boundary feature extraction unit 1042 described later. The sentence boundary candidate specifying unit 1041 may add identification information for specifying a sentence boundary candidate for each extracted sentence boundary candidate.

最短ポーズ長Ｔ１を音声区間検出部１０２における所定のポーズ長Ｌ１よりも短い値に設定した場合（すなわちＴ１＜Ｌ１）、音声区間検出部１０２で検出された全てのポーズが文境界候補となる。一方、音声認識部１０３で検出された短いポーズについては、その一部または全部が文境界候補となる。 When the shortest pause length T1 is set to a value shorter than the predetermined pause length L1 in the speech segment detection unit 102 (that is, T1 <L1), all poses detected by the speech segment detection unit 102 are sentence boundary candidates. On the other hand, some or all of the short poses detected by the speech recognition unit 103 are sentence boundary candidates.

文境界特徴抽出部１０４２は、予め定めた値Ｎ、Ｍ（Ｎ、Ｍは１以上の整数、Ｎ＝Ｍであってもよい）に基づいて、文境界候補の直前Ｎ単語および直後Ｍ単語の表記および品詞の集合を、当該文境界候補の文境界特徴として抽出し、出力する（Ｓ１０４２）。値Ｎ、Ｍは文境界の特徴が表れる範囲を指定するパラメータであり、例えば、Ｎ＝Ｍ＝２と設定する。なお、音声認識結果の冒頭では直前の単語、末尾では直後の単語が存在しない。この場合には、存在しない単語の表記と品詞は取得されない。Ｎ＝Ｍ＝２として各文境界候補から抽出した文境界特徴の例を図１０の右端のカラムに示す。 The sentence boundary feature extraction unit 1042 uses the predetermined values N and M (N and M may be integers greater than or equal to 1 and N = M) to determine the N words immediately before and M words immediately after the sentence boundary candidates. A set of notations and parts of speech are extracted as sentence boundary features of the sentence boundary candidate and output (S1042). The values N and M are parameters that specify the range in which the sentence boundary feature appears. For example, N = M = 2 is set. Note that there is no immediately preceding word at the beginning of the speech recognition result and no immediately following word at the end. In this case, the notation and part of speech of a nonexistent word are not acquired. An example of sentence boundary features extracted from each sentence boundary candidate with N = M = 2 is shown in the rightmost column of FIG.

文境界特徴抽出部１０４２は、文境界候補を特定する識別情報と、求めた文境界特徴とを対応付けて記憶することとしても良い。たとえば、図１０のテーブルの右端に「文境界候補識別情報」のカラムをさらに追加し、文境界候補フラグのカラムにフラグが立っている（図中○印で表記）各文境界候補について、各文境界候補を特定する識別情報として連番（たとえば、文境界候補番号）が付与されることとしても良い。あるいは、文境界候補フラグのカラムに識別情報を付与し、識別情報自体にフラグとしての機能を持たせてもよい。 The sentence boundary feature extraction unit 1042 may store the identification information for specifying the sentence boundary candidate and the obtained sentence boundary feature in association with each other. For example, a column of “sentence boundary candidate identification information” is further added to the right end of the table of FIG. 10, and a flag is set in the column of sentence boundary candidate flags (indicated by a circle in the figure). A serial number (for example, a sentence boundary candidate number) may be given as identification information for identifying a sentence boundary candidate. Alternatively, identification information may be given to the column of sentence boundary candidate flags, and the identification information itself may have a function as a flag.

文境界候補クラスタリング部１０４３は、文境界特徴間の類似度に基づいて文境界候補を分類して、分類結果である文境界候補クラスタ（以下、単にクラスタともいう）を出力する（Ｓ１０４３）。 The sentence boundary candidate clustering unit 1043 classifies the sentence boundary candidates based on the similarity between the sentence boundary features, and outputs a sentence boundary candidate cluster (hereinafter also simply referred to as a cluster) as a classification result (S1043).

文境界候補クラスタリング部１０４３は、まず、全ての文境界候補のペアの間で類似度を計算する。文境界候補間の類似度としては対応する文境界特徴間のコサイン類似度を用いることができる。例えば第１の文境界候補の文境界特徴をＦ１、第２の文境界候補の文境界特徴をＦ２としたとき、第１の文境界候補と第２の文境界候補の類似度Ｓを以下の式で計算することができる（ただし、｜Ｆ｜は集合Ｆの要素数を表す）。 The sentence boundary candidate clustering unit 1043 first calculates the similarity between all sentence boundary candidate pairs. As the similarity between sentence boundary candidates, the cosine similarity between corresponding sentence boundary features can be used. For example, when the sentence boundary feature of the first sentence boundary candidate is F1 and the sentence boundary feature of the second sentence boundary candidate is F2, the similarity S between the first sentence boundary candidate and the second sentence boundary candidate is expressed as follows: (Where | F | represents the number of elements in the set F).

文境界候補クラスタリング部１０４３は、計算した文境界候補間の類似度と、類似度算出に用いた２つの文境界候補を特定する識別情報とを対応付け、既存のクラスタリング技術を実行して、文境界候補クラスタを得る。文境界候補クラスタリング部１０４３は、データ間の類似度を入力とし、クラスタ数を設定する必要のない（クラスタ数が自動決定される）クラスタリング手法であればどの手法を用いてもよく、例えば参考非特許文献１に記載のChinese Whispers法などを用いることができる。ステップＳ１０４３で取得される文境界候補クラスタの例を図１１の右から二番目のカラムに示す。図１１は、後述する文境界フラグ付与部１０４４の出力例を示す図である。
（参考非特許文献１：Chris Biemann, “Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems,” in Proceedings of the first workshop on graph based methods for natural language processing, pp.73-80, 2006.）
図１１の例では、各文境界候補に対して、どのクラスタに所属するかを示す番号が付与される。図１１の例では、クラスタ番号１が付与されている２つの文境界候補は同じクラスタに所属している。クラスタリング手法により、各クラスタを特定する識別情報毎に、そのクラスタに含まれる「文境界候補を特定する識別情報」が対応付けて記憶される。 The sentence boundary candidate clustering unit 1043 associates the calculated similarity between sentence boundary candidates with identification information for specifying two sentence boundary candidates used for similarity calculation, executes an existing clustering technique, Obtain boundary candidate clusters. The sentence boundary candidate clustering unit 1043 can use any method as long as it is a clustering method that receives the similarity between data and does not need to set the number of clusters (the number of clusters is automatically determined). The Chinese Whispers method described in Patent Document 1 can be used. An example of sentence boundary candidate clusters acquired in step S1043 is shown in the second column from the right in FIG. FIG. 11 is a diagram illustrating an output example of a sentence boundary flag adding unit 1044 described later.
(Reference Non-Patent Document 1: Chris Biemann, “Chinese Whispers-an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems,” in Proceedings of the first workshop on graph based methods for natural language processing, pp.73-80, 2006.)
In the example of FIG. 11, a number indicating which cluster belongs to each sentence boundary candidate is assigned. In the example of FIG. 11, two sentence boundary candidates assigned with cluster number 1 belong to the same cluster. By the clustering technique, “identification information for identifying sentence boundary candidates” included in each cluster is stored in association with each identification information for identifying each cluster.

次に、文境界フラグ付与部１０４４は、文境界候補クラスタのうち予め定めたポーズ長閾値Ｔ２以上のポーズ長となる文境界候補を含むクラスタを文境界クラスタとし、文境界クラスタに含まれる文境界候補の一部または全部を文境界として検出し、文境界付き音声認識結果を出力する（Ｓ１０４４）。 Next, the sentence boundary flag assigning unit 1044 sets a sentence boundary cluster including sentence boundary candidates having a pose length equal to or greater than a predetermined pause length threshold T2 among sentence boundary candidate clusters, and includes sentence boundaries included in the sentence boundary cluster. Some or all of the candidates are detected as sentence boundaries, and a speech recognition result with sentence boundaries is output (S1044).

具体的には、文境界フラグ付与部１０４４は、各クラスタ毎に、クラスタに含まれる文境界候補を特定する識別情報から、文境界候補となったポーズの長さを特定し、ポーズの長さがポーズ長閾値Ｔ２以上であるか否かを判定する。文境界フラグ付与部１０４４は、ポーズ長がＴ２以上のポーズが１つでもあれば、そのクラスタは文境界クラスタであると判断する。ポーズ長閾値Ｔ２は高い確率で文境界とみなせるポーズ長を規定するパラメータであり、例えば、Ｔ２＝１０００ｍｓ程度と設定することができる。図１１の例では、Ｔ２＝１０００ｍｓとした場合、クラスタ番号２が付与されているポーズの長さがＴ２（１０００ｍｓ）以上となっているため、文境界フラグ付与部１０４４は、クラスタ番号２のクラスタを文境界クラスタとする。なお通常は、ポーズ長閾値Ｔ２は音声区間検出部１０２（９２）で検出するポーズ長Ｌ１よりも長い値に設定する。例えば、Ｌ１＝５００ｍｓとし、Ｔ２＝１０００ｍｓなどと設定すればよい。 Specifically, the sentence boundary flag assigning unit 1044 identifies, for each cluster, the length of the pose that has become the sentence boundary candidate from the identification information that identifies the sentence boundary candidate included in the cluster, and the length of the pose. Is determined to be greater than or equal to the pause length threshold T2. The sentence boundary flag assigning unit 1044 determines that the cluster is a sentence boundary cluster if there is even one pause having a pause length of T2 or more. The pause length threshold T2 is a parameter that defines the pause length that can be regarded as a sentence boundary with a high probability. For example, T2 can be set to about 1000 ms. In the example of FIG. 11, when T2 = 1000 ms, the length of the pose to which cluster number 2 is assigned is equal to or greater than T2 (1000 ms), so the sentence boundary flag assigning unit 1044 Is a sentence boundary cluster. Normally, the pause length threshold T2 is set to a value longer than the pause length L1 detected by the speech section detection unit 102 (92). For example, L1 = 500 ms and T2 = 1000 ms may be set.

文境界フラグ付与部１０４４は、文境界クラスタに含まれる文境界候補を、文境界として検出し、文境界フラグを付与する。図１１の例では、クラスタ番号２に属する文境界候補全てに文境界フラグが付与される。文境界フラグ付与部１０４４は、文境界付き音声認識結果を最終的な結果として出力する。 The sentence boundary flag assigning unit 1044 detects sentence boundary candidates included in the sentence boundary cluster as sentence boundaries, and assigns sentence boundary flags. In the example of FIG. 11, sentence boundary flags are assigned to all sentence boundary candidates belonging to cluster number 2. The sentence boundary flag assigning unit 1044 outputs a speech recognition result with sentence boundaries as a final result.

上述したように、文境界検出部１０４によって、文境界識別器を用いずに文境界が付与された音声認識結果が出力される。文境界検出部１０４が対象の音声認識結果そのものを分析して文境界検出を行うため、当該音声認識結果における会話相手や発言の場、話者ごとの表現のクセを考慮して正しく文境界を検出することができる。 As described above, the sentence boundary detection unit 104 outputs a speech recognition result to which a sentence boundary is added without using a sentence boundary discriminator. Since the sentence boundary detection unit 104 analyzes the target speech recognition result itself and performs sentence boundary detection, the sentence boundary is correctly determined in consideration of the conversation partner, the place of speech, and the habit of expression for each speaker in the speech recognition result. Can be detected.

＜実施例１の音声認識装置１０が奏する効果＞
本実施例の音声認識装置１０によれば、事前に学習した文境界識別器を用いずに音声認識結果を正しく文ごとに分割できるようになり、前述した特定の利用者の音声で文境界精度が大きく低下する等のケースが減るため、利用者にとってのシステムの利便性が向上する。また、文境界識別器の学習に用いる、人手で正しい文境界ラベルを付与した音声認識結果を作成する必要がなくなるため、システム運用者のコストを低減させることができる。 <Effects produced by the speech recognition apparatus 10 according to the first embodiment>
According to the speech recognition apparatus 10 of the present embodiment, the speech recognition result can be correctly divided for each sentence without using a sentence boundary discriminator learned in advance, and sentence boundary accuracy can be obtained with the voice of the specific user described above. As a result, the convenience of the system for the user is improved. In addition, since it is not necessary to create a speech recognition result that is manually assigned with a correct sentence boundary label, which is used for learning of a sentence boundary classifier, the cost of the system operator can be reduced.

＜実施例１の音声認識装置１０の技術的要点＞
本実施例の音声認識装置１０の技術的要点は、「予め設定したポーズ長閾値Ｔ２（例えば、１０００ｍｓ）以上のポーズは文境界である可能性が高い」という傾向と、「一つの音声認識結果の中では（つまり会話相手／発言の場／話者が同一であれば）同じ文境界特徴が繰り返し現れる」という傾向を活用して文境界を検出する点である。 <Technical points of the speech recognition apparatus 10 of the first embodiment>
The technical point of the speech recognition apparatus 10 of the present embodiment is that there is a tendency that “a pose longer than a preset pause length threshold T2 (for example, 1000 ms) is a sentence boundary” and “one speech recognition result” In this case, the sentence boundary is detected by utilizing the tendency that the same sentence boundary feature appears repeatedly (that is, if the conversation partner / speaking place / speaker are the same).

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

An utterance section between poses longer than a predetermined pose length is called utterance,
The speech detected speech signal that is a speech signal in which the speech has been detected, a speech recognition result including speech and a part of speech generated by speech recognition of the speech detected speech signal, and the speech of the speech detected speech signal Or, using the start and end times of the pause,
A part or all of a pose having a length equal to or longer than a predetermined shortest pose length in the speech signal that has been detected in speech is set as a sentence boundary candidate, and the sentence boundary candidate is classified into a plurality of clusters based on the feature amount of the sentence boundary candidate. A speech recognition apparatus including a sentence boundary detection unit that classifies and detects part or all of sentence boundary candidates included in a cluster including sentence boundary candidates that have a pause length equal to or greater than a predetermined pause length threshold as sentence boundaries.

The speech recognition device according to claim 1,
The sentence boundary detection unit
A sentence boundary candidate clustering unit that classifies the sentence boundary candidates into a plurality of clusters based on a notation of a predetermined number of words immediately before and after the sentence boundary candidates and a similarity between sentence boundary features that are a set of parts of speech. Including speech recognition device.

The speech recognition device according to claim 1 or 2,
The sentence boundary detection unit
A speech recognition apparatus that outputs a speech recognition result with a sentence boundary in which the sentence boundary is added to the speech recognition result.

The speech recognition device according to any one of claims 1 to 3,
A voice section detecting unit for detecting the utterance from a voice signal and outputting the utterance detected voice signal;
A speech recognition apparatus comprising: a speech recognition unit that recognizes the speech signal with the speech detected and outputs the start and end times and the speech recognition result.

An utterance section between poses longer than a predetermined pose length is called utterance,
The speech detected speech signal that is a speech signal in which the speech has been detected, a speech recognition result including speech and a part of speech generated by speech recognition of the speech detected speech signal, and the speech of the speech detected speech signal Or, using the start and end times of the pause,
A part or all of a pose having a length equal to or longer than a predetermined shortest pose length in the speech signal that has been detected in speech is set as a sentence boundary candidate, and the sentence boundary candidate is classified into a plurality of clusters based on the feature amount of the sentence boundary candidate. A speech recognition method in which the speech recognition apparatus executes a step of classifying and detecting a part or all of sentence boundary candidates included in a cluster including sentence boundary candidates having a pause length equal to or greater than a predetermined pause length threshold as sentence boundaries .

A program for causing a computer to function as the voice recognition device according to any one of claims 1 to 4.