JP2017058507A - Speech recognition device, speech recognition method, and program - Google Patents

Speech recognition device, speech recognition method, and program Download PDF

Info

Publication number
JP2017058507A
JP2017058507A JP2015182917A JP2015182917A JP2017058507A JP 2017058507 A JP2017058507 A JP 2017058507A JP 2015182917 A JP2015182917 A JP 2015182917A JP 2015182917 A JP2015182917 A JP 2015182917A JP 2017058507 A JP2017058507 A JP 2017058507A
Authority
JP
Japan
Prior art keywords
speech
sentence boundary
speech recognition
sentence
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2015182917A
Other languages
Japanese (ja)
Other versions
JP6495792B2 (en
Inventor
太一 浅見
Taichi Asami
太一 浅見
厚志 安藤
Atsushi Ando
厚志 安藤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2015182917A priority Critical patent/JP6495792B2/en
Publication of JP2017058507A publication Critical patent/JP2017058507A/en
Application granted granted Critical
Publication of JP6495792B2 publication Critical patent/JP6495792B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

PROBLEM TO BE SOLVED: To provide a speech recognition device with which it is possible to divide a speech recognition result into sentences without using a sentence boundary discriminator.SOLUTION: The speech recognition device includes a sentence boundary detection unit which, using an utterance-detected speech signal that is a speech signal for which utterance has been detected, a speech recognition result that includes a written expression and a word class generated by speech recognition of the utterance-detected speech signal, and the initial and final times of the utterance or pause of the utterance-detected speech signal, where an utterance section between pauses that are in prescribed length or longer is defined as an utterance, designates the whole or part of a pause in the utterance-detected speech signal that is equal to or longer than a predetermined shortest pause length as a sentence boundary candidate, classifies the sentence boundary candidate into a plurality of clusters on the basis of the feature value of the sentence boundary candidate, and detects the whole or part of a sentence boundary candidate included in a cluster that includes a sentence boundary candidate whose pause length is greater than or equal to a predetermined pause length threshold, as a sentence boundary.SELECTED DRAWING: Figure 4

Description

入力された音声信号を音声認識して、音声認識結果を文ごとに分割して出力する音声認識装置、音声認識方法、およびプログラムに関する。   The present invention relates to a speech recognition apparatus, a speech recognition method, and a program for recognizing an input speech signal and dividing and outputting a speech recognition result for each sentence.

入力された音声信号を音声認識によってテキスト化する際、音声認識結果が文ごとに分割されていれば、文を単位にした自然言語処理(例えば文書要約など)を音声認識結果に適用したときに正しい結果を得やすくなる。   When the input speech signal is converted into text by speech recognition, if the speech recognition result is divided for each sentence, when natural language processing (for example, document summarization) based on the sentence is applied to the speech recognition result It is easier to get correct results.

文ごとに分割された音声認識結果(以下、文境界付き音声認識結果という)を出力する従来技術が非特許文献1に開示されている。以下、図1〜図3を参照して、非特許文献1の音声認識装置について説明する。図1は、非特許文献1の音声認識装置9の構成を示すブロック図である。図2、図3は、非特許文献1の音声認識装置9の動作を示すフローチャートであって、図2は、識別器学習動作を示すフローチャート、図3は、音声認識動作を示すフローチャートである。   Non-Patent Document 1 discloses a conventional technique for outputting a speech recognition result (hereinafter referred to as a speech recognition result with a sentence boundary) divided for each sentence. Hereinafter, the speech recognition apparatus of Non-Patent Document 1 will be described with reference to FIGS. FIG. 1 is a block diagram showing the configuration of the speech recognition device 9 of Non-Patent Document 1. 2 and 3 are flowcharts showing the operation of the speech recognition apparatus 9 of Non-Patent Document 1, in which FIG. 2 is a flowchart showing the discriminator learning operation, and FIG. 3 is a flowchart showing the speech recognition operation.

図1に示すように非特許文献1の音声認識装置9は、文境界ラベル付きテキストコーパス記憶部90と、文境界識別器学習部91と、音声区間検出部92と、音声認識部93と、文境界識別部94を含む構成である。文境界ラベル付きテキストコーパス記憶部90は、正しい文境界位置(文境界ラベル)が付与された大量のテキストコーパス(以下、文境界ラベル付きテキストコーパスという)が記憶されている。図2に示すように、文境界識別器学習部91は、文境界ラベル付きテキストコーパスから文境界の識別に用いる文境界識別器を学習する(S91)。文境界識別器としては、Support Vector Machine(SVM)やニューラルネットワークなど、2値分類が可能な既存の識別器が利用可能である。文境界識別器学習部91は、識別器の種類に応じた既存のアルゴリズムを用いて文境界識別器を学習する。ステップS91は後述するステップS92〜S94に先立ち、予め実行されているものとする。   As shown in FIG. 1, the speech recognition device 9 of Non-Patent Document 1 includes a text corpus storage unit 90 with a sentence boundary label, a sentence boundary discriminator learning unit 91, a speech segment detection unit 92, a speech recognition unit 93, The sentence boundary identifying unit 94 is included. The text corpus storage unit 90 with sentence boundary labels stores a large number of text corpora (hereinafter referred to as text corpus with sentence boundary labels) to which correct sentence boundary positions (sentence boundary labels) are assigned. As shown in FIG. 2, the sentence boundary discriminator learning unit 91 learns a sentence boundary discriminator to be used for sentence boundary discrimination from a text corpus with a sentence boundary label (S91). As the sentence boundary classifier, an existing classifier capable of binary classification, such as Support Vector Machine (SVM) or a neural network, can be used. The sentence boundary classifier learning unit 91 learns a sentence boundary classifier using an existing algorithm corresponding to the type of classifier. It is assumed that step S91 is executed in advance prior to steps S92 to S94 described later.

音声区間検出部92は、所定のポーズ長L1(例えば500ms程度)以上のポーズ(以下では、ポーズ区間ともいう)に挟まれた発声区間である発話を音声信号から検出して、発話検出済み音声信号を出力する(S92)。音声区間検出部92は既存のVAD技術によって音声信号中のポーズおよび発話を検出する。音声認識部93は、発話検出済み音声信号を音声認識して、各発話に対する音声認識結果を出力する(S93)。   The speech section detection unit 92 detects a speech that is a speech section sandwiched between pauses (hereinafter also referred to as pause sections) longer than a predetermined pause length L1 (for example, about 500 ms) from the speech signal, and speech detected speech A signal is output (S92). The voice section detection unit 92 detects pauses and utterances in the voice signal using the existing VAD technology. The voice recognition unit 93 performs voice recognition on the utterance detected voice signal and outputs a voice recognition result for each utterance (S93).

最も単純な方法では、ステップS93で得られた音声認識結果をそのまま文境界付き音声認識結果と見なす。つまり発話=文と見なす。現実には、複数の文が一呼吸で発声され、一つの発話に複数の文が含まれるケースや、文の途中に呼吸(所定のポーズ長L1以上の長さのポーズ)が置かれ、一つの文が複数の発話に分割されるケースが存在するため、後述の文境界識別部94によって文境界をあらためて検出する必要がある。   In the simplest method, the speech recognition result obtained in step S93 is regarded as a speech recognition result with a sentence boundary as it is. In other words, utterance = sentence. In reality, a plurality of sentences are uttered by one breath, and a case where a plurality of sentences are included in one utterance or a breath (a pose having a length equal to or longer than a predetermined pose length L1) is placed in the middle of the sentence. Since there is a case where one sentence is divided into a plurality of utterances, it is necessary to detect the sentence boundary again by the sentence boundary identifying unit 94 described later.

文境界識別部94は、ステップS91において事前に学習された文境界識別器を用いて、音声認識結果中の各単語境界の直前/直後の単語の表記や品詞を特徴量として、当該単語境界が文境界であるか否かを識別し、文境界付き音声認識結果を出力する(S94)。   The sentence boundary discriminating unit 94 uses the sentence boundary discriminator learned in advance in step S91, and uses the notation and part of speech of the word immediately before / after each word boundary in the speech recognition result as the feature quantity, and the word boundary is determined. It is identified whether or not it is a sentence boundary, and a speech recognition result with a sentence boundary is output (S94).

祖父江翔、山本けい子、田村哲嗣、速水悟、「音声認識結果の文境界推定における識別モデルの評価」、言語処理学会第15回年次大会発表論文集、一般社団法人言語処理学会、平成21年3月、pp.582-585Sho Sobue, Keiko Yamamoto, Tetsugo Tamura, Satoru Hayami, "Evaluation of Discrimination Model for Estimating Sentence Boundary of Speech Recognition Results", Proc. Of the 15th Annual Conference of the Language Processing Society of Japan, Language Processing Society of Japan, 2009 March, pp.582-585

従来技術では事前に学習した文境界識別器を利用するため、学習に用いたテキストコーパス中の文境界の特徴(直前/直後の単語の表記や品詞)とは異なる特徴を持つ文境界は正しく識別することができない。特に人間同士の会話音声においては文境界の特徴には様々なバリエーションがある。例えば、会話相手との親しさの度合い、発言する場がフォーマルな場であるか否か、話者ごとの文末表現の癖などによって文境界での表現が多様に変化する。このようなバリエーションを事前に網羅しておくことは難しく、例えば特定の利用者の音声ではほとんど文境界が検出されなくなるなど、文境界識別の精度が低下し、システムの利便性が落ちる場合がある。   Since the prior art uses a sentence boundary classifier that has been learned in advance, sentence boundaries that have different characteristics from the sentence boundary characteristics (previous / immediate word notation and part of speech) in the text corpus used for learning are correctly identified. Can not do it. In particular, there are various variations in sentence boundary characteristics in conversational speech between humans. For example, the expression at the sentence boundary changes variously depending on the degree of familiarity with the conversation partner, whether or not the place to speak is a formal place, the habit of the sentence end expression for each speaker, and the like. It is difficult to cover such variations in advance. For example, sentence boundaries are almost undetectable with a specific user's voice, which may reduce the accuracy of sentence boundary identification and reduce system convenience. .

また、文境界識別器の学習には、音声認識結果テキストの各単語境界に人手で文境界か否かを表すラベル(文境界ラベル)を付与したデータを用いるのが一般的だが、音声認識結果には認識誤りが含まれるため、正しい文境界ラベルを付与するには元の音声を聴取しながら作業を行う必要があり、作業量が大きいためシステム構築コストの増加の要因となる。   In addition, for learning a sentence boundary classifier, it is common to use data in which each word boundary of a speech recognition result text is manually assigned a label indicating whether it is a sentence boundary (sentence boundary label). Since a recognition error is included, it is necessary to work while listening to the original voice in order to give a correct sentence boundary label, which increases the system construction cost due to the large amount of work.

そこで本発明は、文境界識別器を用いずに、音声認識結果を文ごとに分割することができる音声認識装置を提供することを目的とする。   Accordingly, an object of the present invention is to provide a speech recognition apparatus that can divide a speech recognition result for each sentence without using a sentence boundary discriminator.

本発明の音声認識装置は、文境界検出部を含む。所定のポーズ長以上のポーズに挟まれた発声区間を発話というものとし、文境界検出部は、発話を検出済みの音声信号である発話検出済み音声信号と、発話検出済み音声信号を音声認識して生成した表記と品詞とを含む音声認識結果と、発話検出済み音声信号の発話またはポーズの始端および終端時刻とを用いる。文境界検出部は、発話検出済み音声信号内の予め定めた最短ポーズ長以上の長さとなるポーズの一部または全部を文境界候補とし、文境界候補の特徴量に基づいて文境界候補を複数のクラスタに分類し、予め定めたポーズ長閾値以上のポーズ長となる文境界候補を含むクラスタ内に含まれる文境界候補の一部または全部を文境界として検出する。   The speech recognition apparatus of the present invention includes a sentence boundary detection unit. An utterance section sandwiched between poses longer than a predetermined pose length is called utterance, and the sentence boundary detection unit recognizes the utterance detected voice signal, which is an utterance detected voice signal, and the utterance detected voice signal. The speech recognition result including the notation and the part of speech generated in this way, and the utterance or pause start and end times of the utterance detected speech signal are used. The sentence boundary detection unit sets a part or all of a pose having a length equal to or longer than a predetermined shortest pose length in a speech signal whose utterance has been detected as a sentence boundary candidate, and selects a plurality of sentence boundary candidates based on the feature amount of the sentence boundary candidate. And a part or all of sentence boundary candidates included in the cluster including sentence boundary candidates having a pause length equal to or greater than a predetermined pause length threshold are detected as sentence boundaries.

本発明の音声認識装置によれば、文境界識別器を用いずに、音声認識結果を文ごとに分割することができる。   According to the speech recognition apparatus of the present invention, the speech recognition result can be divided for each sentence without using a sentence boundary discriminator.

非特許文献1の音声認識装置の構成を示すブロック図。The block diagram which shows the structure of the speech recognition apparatus of a nonpatent literature 1. FIG. 非特許文献1の音声認識装置の識別器学習動作を示すフローチャート。10 is a flowchart showing a discriminator learning operation of the speech recognition apparatus of Non-Patent Document 1. 非特許文献1の音声認識装置の音声認識動作を示すフローャート。The float which shows the speech recognition operation | movement of the speech recognition apparatus of a nonpatent literature 1. 実施例1の音声認識装置の構成を示すブロック図。1 is a block diagram illustrating a configuration of a voice recognition device according to Embodiment 1. FIG. 実施例1の音声認識装置の動作を示すフローチャート。3 is a flowchart showing the operation of the speech recognition apparatus according to the first embodiment. 実施例1の音声認識装置の音声区間検出部の出力例を示す図。The figure which shows the output example of the audio | voice area detection part of the speech recognition apparatus of Example 1. FIG. 実施例1の音声認識装置の音声認識部の出力例を示す図。The figure which shows the example of an output of the speech recognition part of the speech recognition apparatus of Example 1. FIG. 実施例1の音声認識装置の文境界検出部の構成を示すブロック図。FIG. 3 is a block diagram illustrating a configuration of a sentence boundary detection unit of the speech recognition apparatus according to the first embodiment. 実施例1の音声認識装置の文境界検出部の動作を示すフローチャート。6 is a flowchart illustrating an operation of a sentence boundary detection unit of the speech recognition apparatus according to the first embodiment. 実施例1の音声認識装置の文境界特徴抽出部の出力例を示す図。The figure which shows the example of an output of the sentence boundary feature extraction part of the speech recognition apparatus of Example 1. FIG. 実施例1の音声認識装置の文境界フラグ付与部の出力例を示す図。The figure which shows the example of an output of the sentence boundary flag provision part of the speech recognition apparatus of Example 1. FIG.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。   Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

以下、図4、図5を参照して実施例1の音声認識装置の構成、および動作を説明する。図4は、本実施例の音声認識装置10の構成を示すブロック図である。図5は、本実施例の音声認識装置10の動作を示すフローチャートである。図4に示すように音声認識装置10は、音声区間検出部102と、音声認識部103と、文境界検出部104を含む。本実施例の音声認識装置10は、文境界識別器を利用しないため、前述した文境界ラベル付きテキストコーパス記憶部90、文境界識別器学習部91は不要である。   Hereinafter, the configuration and operation of the speech recognition apparatus according to the first embodiment will be described with reference to FIGS. 4 and 5. FIG. 4 is a block diagram showing the configuration of the speech recognition apparatus 10 of this embodiment. FIG. 5 is a flowchart showing the operation of the speech recognition apparatus 10 of this embodiment. As shown in FIG. 4, the speech recognition apparatus 10 includes a speech section detection unit 102, a speech recognition unit 103, and a sentence boundary detection unit 104. Since the speech recognition apparatus 10 according to the present embodiment does not use a sentence boundary discriminator, the text corpus storage unit 90 with sentence boundary label and the sentence boundary discriminator learning unit 91 described above are unnecessary.

音声区間検出部102は、既存のVAD技術により、発話を音声信号から検出して、発話検出済み音声信号を出力する(S102)。音声区間検出部102は、発話検出済み音声信号とともに、発話またはポーズの始端および終端時刻(音声信号の冒頭を時刻0とする)を出力する。発話の始端時刻は当該発話の直前のポーズの終端時刻と等しく、発話の終端時刻は当該発話の直後のポーズの始端時刻と等しいから、発話の始端および終端時刻、またはポーズの始端および終端時刻の少なくともいずれかがあれば十分である。発話またはポーズの始端/終端時刻は、例えば図6に示すテーブル形式のデータ構造で、音声認識装置10の何れかの記憶領域に格納される。図6は、音声区間検出部102の出力例を示す図である。   The voice section detection unit 102 detects the utterance from the voice signal by using the existing VAD technique, and outputs the utterance detected voice signal (S102). The voice section detection unit 102 outputs the start and end times of the utterance or pause (the beginning of the voice signal is set to time 0) together with the utterance detected voice signal. Since the start time of the utterance is equal to the end time of the pose immediately before the utterance and the end time of the utterance is equal to the start time of the pose immediately after the utterance, the start and end times of the utterance or the start and end times of the pose At least one is enough. The start / end times of utterances or pauses are stored in any storage area of the speech recognition apparatus 10 in a table format data structure shown in FIG. FIG. 6 is a diagram illustrating an output example of the speech segment detection unit 102.

次に、音声認識部103は、発話検出済み音声信号を音声認識して、発話またはポーズの始端および終端時刻と、表記と品詞を含む音声認識結果を出力する(S103)。音声認識結果は、後述する文境界候補のクラスタリングにおいて話者の傾向などを十分に把握するために、一定量以上(例えば約100発話以上)あることが望ましい。   Next, the speech recognition unit 103 recognizes the speech signal that has been detected for speech, and outputs a speech recognition result including the start and end times of the speech or pause, the notation, and the part of speech (S103). The speech recognition result is desirably a certain amount or more (for example, about 100 utterances or more) in order to sufficiently grasp the tendency of the speaker in the sentence boundary candidate clustering described later.

なお、音声認識部103は、音声区間検出部102が出力する始端および終端時刻に加え、音声区間検出部102が検出しなかった短いポーズによる始端および終端時刻を出力することに注意する。   Note that the speech recognition unit 103 outputs the start and end times due to a short pause that the speech segment detection unit 102 did not detect, in addition to the start and end times output by the speech segment detection unit 102.

図7に、音声認識部103の出力例を示す。図7の出力例における始端時刻が8.66秒、終端時刻が9.03秒のポーズは、図6の出力例に含まれない短いポーズ(ステップS102で検出対象となっていない、所定のポーズ長L1以下のポーズ)である。   FIG. 7 shows an output example of the voice recognition unit 103. In the output example of FIG. 7, pauses with a start time of 8.66 seconds and an end time of 9.03 seconds are short pauses not included in the output example of FIG. 6 (predetermined pauses that are not detected in step S102). Pose of length L1 or less).

一般的なVAD技術では、通常500ms程度に設定される所定のポーズ長L1以上のポーズ(およびポーズに挟まれた発話)を検出する。従って、一般的なVAD技術を用いて音声区間検出部102を構成した場合、所定のポーズ長L1未満の短いポーズは音声区間検出部102では検出されない。このため、所定のポーズ長L1未満の短いポーズは、音声認識部103が検出する。   In a general VAD technique, a pose (and an utterance sandwiched between poses) longer than a predetermined pose length L1 that is normally set to about 500 ms is detected. Therefore, when the speech section detection unit 102 is configured using a general VAD technique, the speech section detection unit 102 does not detect a short pose shorter than the predetermined pause length L1. For this reason, the speech recognition unit 103 detects a short pose shorter than the predetermined pose length L1.

次に、文境界検出部104は、発話検出済み音声信号内の予め定めた最短ポーズ長T1以上の長さとなるポーズの一部または全部を文境界候補とし、文境界候補の特徴量に基づいて文境界候補を複数のクラスタに分類し、予め定めたポーズ長閾値T2以上のポーズ長となる文境界候補を含むクラスタ内に含まれる文境界候補の一部または全部を文境界として検出する(S104)。   Next, the sentence boundary detection unit 104 sets a part or all of poses having a length equal to or longer than a predetermined shortest pose length T1 in the speech signal that has been detected as speech as a sentence boundary candidate, and based on the feature amount of the sentence boundary candidate. Sentence boundary candidates are classified into a plurality of clusters, and a part or all of sentence boundary candidates included in a cluster including sentence boundary candidates having a pause length greater than or equal to a predetermined pause length threshold T2 are detected as sentence boundaries (S104). ).

以下、図8、図9を参照して文境界検出部104の詳細な構成例および動作例について説明する。図8は、本実施例の音声認識装置10の文境界検出部104の構成を示すブロック図である。図9は、本実施例の音声認識装置10の文境界検出部104の動作を示すフローチャートである。図8に示すように、文境界検出部104は、文境界候補特定部1041と、文境界特徴抽出部1042と、文境界候補クラスタリング部1043と、文境界フラグ付与部1044を含む構成である。   Hereinafter, a detailed configuration example and an operation example of the sentence boundary detection unit 104 will be described with reference to FIGS. 8 and 9. FIG. 8 is a block diagram illustrating a configuration of the sentence boundary detection unit 104 of the speech recognition apparatus 10 according to the present embodiment. FIG. 9 is a flowchart showing the operation of the sentence boundary detection unit 104 of the speech recognition apparatus 10 of this embodiment. As shown in FIG. 8, the sentence boundary detecting unit 104 includes a sentence boundary candidate specifying unit 1041, a sentence boundary feature extracting unit 1042, a sentence boundary candidate clustering unit 1043, and a sentence boundary flag adding unit 1044.

文境界候補特定部1041は、予め定めた最短ポーズ長T1以上の全てのポーズ(ただし、冒頭のポーズを除く)を文境界候補とし、文境界候補に文境界候補フラグを付与して、文境界候補フラグ付き音声認識結果を出力する(S1041)。T1は文境界に置かれるポーズの長さの最短値を規定するパラメータであり、通常はT1=200ms程度の値に予め設定される。T1=200msとして、図7の音声認識結果に文境界候補フラグを付与した例を図10の右から二番目のカラムに示す。図10は、後述する文境界特徴抽出部1042の出力例を示す図である。文境界候補特定部1041は、抽出した文境界候補毎に、文境界候補を特定する識別情報を付与することとしても良い。   The sentence boundary candidate specifying unit 1041 sets all poses (excluding the initial pose) longer than a predetermined shortest pose length T1 as sentence boundary candidates, assigns a sentence boundary candidate flag to the sentence boundary candidate, A speech recognition result with a candidate flag is output (S1041). T1 is a parameter that defines the shortest value of the length of the pose placed at the sentence boundary, and is usually set in advance to a value of about T1 = 200 ms. An example in which a sentence boundary candidate flag is added to the speech recognition result in FIG. 7 with T1 = 200 ms is shown in the second column from the right in FIG. FIG. 10 is a diagram illustrating an output example of a sentence boundary feature extraction unit 1042 described later. The sentence boundary candidate specifying unit 1041 may add identification information for specifying a sentence boundary candidate for each extracted sentence boundary candidate.

最短ポーズ長T1を音声区間検出部102における所定のポーズ長L1よりも短い値に設定した場合(すなわちT1<L1)、音声区間検出部102で検出された全てのポーズが文境界候補となる。一方、音声認識部103で検出された短いポーズについては、その一部または全部が文境界候補となる。   When the shortest pause length T1 is set to a value shorter than the predetermined pause length L1 in the speech segment detection unit 102 (that is, T1 <L1), all poses detected by the speech segment detection unit 102 are sentence boundary candidates. On the other hand, some or all of the short poses detected by the speech recognition unit 103 are sentence boundary candidates.

文境界特徴抽出部1042は、予め定めた値N、M(N、Mは1以上の整数、N=Mであってもよい)に基づいて、文境界候補の直前N単語および直後M単語の表記および品詞の集合を、当該文境界候補の文境界特徴として抽出し、出力する(S1042)。値N、Mは文境界の特徴が表れる範囲を指定するパラメータであり、例えば、N=M=2と設定する。なお、音声認識結果の冒頭では直前の単語、末尾では直後の単語が存在しない。この場合には、存在しない単語の表記と品詞は取得されない。N=M=2として各文境界候補から抽出した文境界特徴の例を図10の右端のカラムに示す。   The sentence boundary feature extraction unit 1042 uses the predetermined values N and M (N and M may be integers greater than or equal to 1 and N = M) to determine the N words immediately before and M words immediately after the sentence boundary candidates. A set of notations and parts of speech are extracted as sentence boundary features of the sentence boundary candidate and output (S1042). The values N and M are parameters that specify the range in which the sentence boundary feature appears. For example, N = M = 2 is set. Note that there is no immediately preceding word at the beginning of the speech recognition result and no immediately following word at the end. In this case, the notation and part of speech of a nonexistent word are not acquired. An example of sentence boundary features extracted from each sentence boundary candidate with N = M = 2 is shown in the rightmost column of FIG.

文境界特徴抽出部1042は、文境界候補を特定する識別情報と、求めた文境界特徴とを対応付けて記憶することとしても良い。たとえば、図10のテーブルの右端に「文境界候補識別情報」のカラムをさらに追加し、文境界候補フラグのカラムにフラグが立っている(図中○印で表記)各文境界候補について、各文境界候補を特定する識別情報として連番(たとえば、文境界候補番号)が付与されることとしても良い。あるいは、文境界候補フラグのカラムに識別情報を付与し、識別情報自体にフラグとしての機能を持たせてもよい。   The sentence boundary feature extraction unit 1042 may store the identification information for specifying the sentence boundary candidate and the obtained sentence boundary feature in association with each other. For example, a column of “sentence boundary candidate identification information” is further added to the right end of the table of FIG. 10, and a flag is set in the column of sentence boundary candidate flags (indicated by a circle in the figure). A serial number (for example, a sentence boundary candidate number) may be given as identification information for identifying a sentence boundary candidate. Alternatively, identification information may be given to the column of sentence boundary candidate flags, and the identification information itself may have a function as a flag.

文境界候補クラスタリング部1043は、文境界特徴間の類似度に基づいて文境界候補を分類して、分類結果である文境界候補クラスタ(以下、単にクラスタともいう)を出力する(S1043)。   The sentence boundary candidate clustering unit 1043 classifies the sentence boundary candidates based on the similarity between the sentence boundary features, and outputs a sentence boundary candidate cluster (hereinafter also simply referred to as a cluster) as a classification result (S1043).

文境界候補クラスタリング部1043は、まず、全ての文境界候補のペアの間で類似度を計算する。文境界候補間の類似度としては対応する文境界特徴間のコサイン類似度を用いることができる。例えば第1の文境界候補の文境界特徴をF1、第2の文境界候補の文境界特徴をF2としたとき、第1の文境界候補と第2の文境界候補の類似度Sを以下の式で計算することができる(ただし、|F|は集合Fの要素数を表す)。   The sentence boundary candidate clustering unit 1043 first calculates the similarity between all sentence boundary candidate pairs. As the similarity between sentence boundary candidates, the cosine similarity between corresponding sentence boundary features can be used. For example, when the sentence boundary feature of the first sentence boundary candidate is F1 and the sentence boundary feature of the second sentence boundary candidate is F2, the similarity S between the first sentence boundary candidate and the second sentence boundary candidate is expressed as follows: (Where | F | represents the number of elements in the set F).

Figure 2017058507
Figure 2017058507

文境界候補クラスタリング部1043は、計算した文境界候補間の類似度と、類似度算出に用いた2つの文境界候補を特定する識別情報とを対応付け、既存のクラスタリング技術を実行して、文境界候補クラスタを得る。文境界候補クラスタリング部1043は、データ間の類似度を入力とし、クラスタ数を設定する必要のない(クラスタ数が自動決定される)クラスタリング手法であればどの手法を用いてもよく、例えば参考非特許文献1に記載のChinese Whispers法などを用いることができる。ステップS1043で取得される文境界候補クラスタの例を図11の右から二番目のカラムに示す。図11は、後述する文境界フラグ付与部1044の出力例を示す図である。
(参考非特許文献1:Chris Biemann, “Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems,” in Proceedings of the first workshop on graph based methods for natural language processing, pp.73-80, 2006.)
図11の例では、各文境界候補に対して、どのクラスタに所属するかを示す番号が付与される。図11の例では、クラスタ番号1が付与されている2つの文境界候補は同じクラスタに所属している。クラスタリング手法により、各クラスタを特定する識別情報毎に、そのクラスタに含まれる「文境界候補を特定する識別情報」が対応付けて記憶される。
The sentence boundary candidate clustering unit 1043 associates the calculated similarity between sentence boundary candidates with identification information for specifying two sentence boundary candidates used for similarity calculation, executes an existing clustering technique, Obtain boundary candidate clusters. The sentence boundary candidate clustering unit 1043 can use any method as long as it is a clustering method that receives the similarity between data and does not need to set the number of clusters (the number of clusters is automatically determined). The Chinese Whispers method described in Patent Document 1 can be used. An example of sentence boundary candidate clusters acquired in step S1043 is shown in the second column from the right in FIG. FIG. 11 is a diagram illustrating an output example of a sentence boundary flag adding unit 1044 described later.
(Reference Non-Patent Document 1: Chris Biemann, “Chinese Whispers-an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems,” in Proceedings of the first workshop on graph based methods for natural language processing, pp.73-80, 2006.)
In the example of FIG. 11, a number indicating which cluster belongs to each sentence boundary candidate is assigned. In the example of FIG. 11, two sentence boundary candidates assigned with cluster number 1 belong to the same cluster. By the clustering technique, “identification information for identifying sentence boundary candidates” included in each cluster is stored in association with each identification information for identifying each cluster.

次に、文境界フラグ付与部1044は、文境界候補クラスタのうち予め定めたポーズ長閾値T2以上のポーズ長となる文境界候補を含むクラスタを文境界クラスタとし、文境界クラスタに含まれる文境界候補の一部または全部を文境界として検出し、文境界付き音声認識結果を出力する(S1044)。   Next, the sentence boundary flag assigning unit 1044 sets a sentence boundary cluster including sentence boundary candidates having a pose length equal to or greater than a predetermined pause length threshold T2 among sentence boundary candidate clusters, and includes sentence boundaries included in the sentence boundary cluster. Some or all of the candidates are detected as sentence boundaries, and a speech recognition result with sentence boundaries is output (S1044).

具体的には、文境界フラグ付与部1044は、各クラスタ毎に、クラスタに含まれる文境界候補を特定する識別情報から、文境界候補となったポーズの長さを特定し、ポーズの長さがポーズ長閾値T2以上であるか否かを判定する。文境界フラグ付与部1044は、ポーズ長がT2以上のポーズが1つでもあれば、そのクラスタは文境界クラスタであると判断する。ポーズ長閾値T2は高い確率で文境界とみなせるポーズ長を規定するパラメータであり、例えば、T2=1000ms程度と設定することができる。図11の例では、T2=1000msとした場合、クラスタ番号2が付与されているポーズの長さがT2(1000ms)以上となっているため、文境界フラグ付与部1044は、クラスタ番号2のクラスタを文境界クラスタとする。なお通常は、ポーズ長閾値T2は音声区間検出部102(92)で検出するポーズ長L1よりも長い値に設定する。例えば、L1=500msとし、T2=1000msなどと設定すればよい。   Specifically, the sentence boundary flag assigning unit 1044 identifies, for each cluster, the length of the pose that has become the sentence boundary candidate from the identification information that identifies the sentence boundary candidate included in the cluster, and the length of the pose. Is determined to be greater than or equal to the pause length threshold T2. The sentence boundary flag assigning unit 1044 determines that the cluster is a sentence boundary cluster if there is even one pause having a pause length of T2 or more. The pause length threshold T2 is a parameter that defines the pause length that can be regarded as a sentence boundary with a high probability. For example, T2 can be set to about 1000 ms. In the example of FIG. 11, when T2 = 1000 ms, the length of the pose to which cluster number 2 is assigned is equal to or greater than T2 (1000 ms), so the sentence boundary flag assigning unit 1044 Is a sentence boundary cluster. Normally, the pause length threshold T2 is set to a value longer than the pause length L1 detected by the speech section detection unit 102 (92). For example, L1 = 500 ms and T2 = 1000 ms may be set.

文境界フラグ付与部1044は、文境界クラスタに含まれる文境界候補を、文境界として検出し、文境界フラグを付与する。図11の例では、クラスタ番号2に属する文境界候補全てに文境界フラグが付与される。文境界フラグ付与部1044は、文境界付き音声認識結果を最終的な結果として出力する。   The sentence boundary flag assigning unit 1044 detects sentence boundary candidates included in the sentence boundary cluster as sentence boundaries, and assigns sentence boundary flags. In the example of FIG. 11, sentence boundary flags are assigned to all sentence boundary candidates belonging to cluster number 2. The sentence boundary flag assigning unit 1044 outputs a speech recognition result with sentence boundaries as a final result.

上述したように、文境界検出部104によって、文境界識別器を用いずに文境界が付与された音声認識結果が出力される。文境界検出部104が対象の音声認識結果そのものを分析して文境界検出を行うため、当該音声認識結果における会話相手や発言の場、話者ごとの表現のクセを考慮して正しく文境界を検出することができる。   As described above, the sentence boundary detection unit 104 outputs a speech recognition result to which a sentence boundary is added without using a sentence boundary discriminator. Since the sentence boundary detection unit 104 analyzes the target speech recognition result itself and performs sentence boundary detection, the sentence boundary is correctly determined in consideration of the conversation partner, the place of speech, and the habit of expression for each speaker in the speech recognition result. Can be detected.

<実施例1の音声認識装置10が奏する効果>
本実施例の音声認識装置10によれば、事前に学習した文境界識別器を用いずに音声認識結果を正しく文ごとに分割できるようになり、前述した特定の利用者の音声で文境界精度が大きく低下する等のケースが減るため、利用者にとってのシステムの利便性が向上する。また、文境界識別器の学習に用いる、人手で正しい文境界ラベルを付与した音声認識結果を作成する必要がなくなるため、システム運用者のコストを低減させることができる。
<Effects produced by the speech recognition apparatus 10 according to the first embodiment>
According to the speech recognition apparatus 10 of the present embodiment, the speech recognition result can be correctly divided for each sentence without using a sentence boundary discriminator learned in advance, and sentence boundary accuracy can be obtained with the voice of the specific user described above. As a result, the convenience of the system for the user is improved. In addition, since it is not necessary to create a speech recognition result that is manually assigned with a correct sentence boundary label, which is used for learning of a sentence boundary classifier, the cost of the system operator can be reduced.

<実施例1の音声認識装置10の技術的要点>
本実施例の音声認識装置10の技術的要点は、「予め設定したポーズ長閾値T2(例えば、1000ms)以上のポーズは文境界である可能性が高い」という傾向と、「一つの音声認識結果の中では(つまり会話相手/発言の場/話者が同一であれば)同じ文境界特徴が繰り返し現れる」という傾向を活用して文境界を検出する点である。
<Technical points of the speech recognition apparatus 10 of the first embodiment>
The technical point of the speech recognition apparatus 10 of the present embodiment is that there is a tendency that “a pose longer than a preset pause length threshold T2 (for example, 1000 ms) is a sentence boundary” and “one speech recognition result” In this case, the sentence boundary is detected by utilizing the tendency that the same sentence boundary feature appears repeatedly (that is, if the conversation partner / speaking place / speaker are the same).

<補記>
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置(例えば通信ケーブル)が接続可能な通信部、CPU(Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい)、メモリであるRAMやROM、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、CPU、RAM、ROM、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、CD−ROMなどの記録媒体を読み書きできる装置(ドライブ)などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。
<Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている(外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるROMに記憶させておくこととしてもよい)。また、これらのプログラムの処理によって得られるデータなどは、RAMや外部記憶装置などに適宜に記憶される。   The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置(あるいはROMなど)に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にCPUで解釈実行・処理される。その結果、CPUが所定の機能(上記、…部、…手段などと表した各構成要件)を実現する。   In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。   The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ(本発明の装置)における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。   As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、DVD(Digital Versatile Disc)、DVD−RAM(Random Access Memory)、CD−ROM(Compact Disc Read Only Memory)、CD−R(Recordable)/RW(ReWritable)等を、光磁気記録媒体として、MO(Magneto-Optical disc)等を、半導体メモリとしてEEP−ROM(Electronically Erasable and Programmable-Read Only Memory)等を用いることができる。   The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD−ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。   The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。   A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。   In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims (6)

所定のポーズ長以上のポーズに挟まれた発声区間を発話というものとし、
前記発話を検出済みの音声信号である発話検出済み音声信号と、前記発話検出済み音声信号を音声認識して生成した表記と品詞とを含む音声認識結果と、前記発話検出済み音声信号の前記発話または前記ポーズの始端および終端時刻とを用いて、
前記発話検出済み音声信号内の予め定めた最短ポーズ長以上の長さとなるポーズの一部または全部を文境界候補とし、前記文境界候補の特徴量に基づいて前記文境界候補を複数のクラスタに分類し、予め定めたポーズ長閾値以上のポーズ長となる文境界候補を含むクラスタ内に含まれる文境界候補の一部または全部を文境界として検出する文境界検出部
を含む音声認識装置。
An utterance section between poses longer than a predetermined pose length is called utterance,
The speech detected speech signal that is a speech signal in which the speech has been detected, a speech recognition result including speech and a part of speech generated by speech recognition of the speech detected speech signal, and the speech of the speech detected speech signal Or, using the start and end times of the pause,
A part or all of a pose having a length equal to or longer than a predetermined shortest pose length in the speech signal that has been detected in speech is set as a sentence boundary candidate, and the sentence boundary candidate is classified into a plurality of clusters based on the feature amount of the sentence boundary candidate. A speech recognition apparatus including a sentence boundary detection unit that classifies and detects part or all of sentence boundary candidates included in a cluster including sentence boundary candidates that have a pause length equal to or greater than a predetermined pause length threshold as sentence boundaries.
請求項1に記載の音声認識装置であって、
前記文境界検出部は、
前記文境界候補の直前及び直後の所定の数の単語の表記および品詞の集合である文境界特徴間の類似度に基づいて、前記文境界候補を複数のクラスタに分類する文境界候補クラスタリング部
を含む音声認識装置。
The speech recognition device according to claim 1,
The sentence boundary detection unit
A sentence boundary candidate clustering unit that classifies the sentence boundary candidates into a plurality of clusters based on a notation of a predetermined number of words immediately before and after the sentence boundary candidates and a similarity between sentence boundary features that are a set of parts of speech. Including speech recognition device.
請求項1または2に記載の音声認識装置であって、
前記文境界検出部は、
前記文境界を前記音声認識結果に付与した文境界付き音声認識結果を出力する
音声認識装置。
The speech recognition device according to claim 1 or 2,
The sentence boundary detection unit
A speech recognition apparatus that outputs a speech recognition result with a sentence boundary in which the sentence boundary is added to the speech recognition result.
請求項1から3の何れかに記載の音声認識装置であって、
音声信号から前記発話を検出して前記発話検出済み音声信号を出力する音声区間検出部と、
前記発話検出済み音声信号を音声認識して、前記始端および終端時刻と、前記音声認識結果とを出力する音声認識部
を含む音声認識装置。
The speech recognition device according to any one of claims 1 to 3,
A voice section detecting unit for detecting the utterance from a voice signal and outputting the utterance detected voice signal;
A speech recognition apparatus comprising: a speech recognition unit that recognizes the speech signal with the speech detected and outputs the start and end times and the speech recognition result.
所定のポーズ長以上のポーズに挟まれた発声区間を発話というものとし、
前記発話を検出済みの音声信号である発話検出済み音声信号と、前記発話検出済み音声信号を音声認識して生成した表記と品詞とを含む音声認識結果と、前記発話検出済み音声信号の前記発話または前記ポーズの始端および終端時刻とを用いて、
前記発話検出済み音声信号内の予め定めた最短ポーズ長以上の長さとなるポーズの一部または全部を文境界候補とし、前記文境界候補の特徴量に基づいて前記文境界候補を複数のクラスタに分類し、予め定めたポーズ長閾値以上のポーズ長となる文境界候補を含むクラスタ内に含まれる文境界候補の一部または全部を文境界として検出するステップ
を音声認識装置が実行する音声認識方法。
An utterance section between poses longer than a predetermined pose length is called utterance,
The speech detected speech signal that is a speech signal in which the speech has been detected, a speech recognition result including speech and a part of speech generated by speech recognition of the speech detected speech signal, and the speech of the speech detected speech signal Or, using the start and end times of the pause,
A part or all of a pose having a length equal to or longer than a predetermined shortest pose length in the speech signal that has been detected in speech is set as a sentence boundary candidate, and the sentence boundary candidate is classified into a plurality of clusters based on the feature amount of the sentence boundary candidate. A speech recognition method in which the speech recognition apparatus executes a step of classifying and detecting a part or all of sentence boundary candidates included in a cluster including sentence boundary candidates having a pause length equal to or greater than a predetermined pause length threshold as sentence boundaries .
コンピュータを請求項1から4の何れかに記載の音声認識装置として機能させるプログラム。   A program for causing a computer to function as the voice recognition device according to any one of claims 1 to 4.
JP2015182917A 2015-09-16 2015-09-16 Speech recognition apparatus, speech recognition method, and program Active JP6495792B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2015182917A JP6495792B2 (en) 2015-09-16 2015-09-16 Speech recognition apparatus, speech recognition method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2015182917A JP6495792B2 (en) 2015-09-16 2015-09-16 Speech recognition apparatus, speech recognition method, and program

Publications (2)

Publication Number Publication Date
JP2017058507A true JP2017058507A (en) 2017-03-23
JP6495792B2 JP6495792B2 (en) 2019-04-03

Family

ID=58391467

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2015182917A Active JP6495792B2 (en) 2015-09-16 2015-09-16 Speech recognition apparatus, speech recognition method, and program

Country Status (1)

Country Link
JP (1) JP6495792B2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018174250A1 (en) 2017-03-24 2018-09-27 三菱ケミカル株式会社 Prepreg and fiber-reinforced composite material
WO2019007245A1 (en) * 2017-07-04 2019-01-10 阿里巴巴集团控股有限公司 Processing method, control method and recognition method, and apparatus and electronic device therefor
CN110689877A (en) * 2019-09-17 2020-01-14 华为技术有限公司 Voice end point detection method and device
JP2020024277A (en) * 2018-08-07 2020-02-13 国立研究開発法人情報通信研究機構 Data segmentation device
CN110942764A (en) * 2019-11-15 2020-03-31 北京达佳互联信息技术有限公司 Stream type voice recognition method
CN112786023A (en) * 2020-12-23 2021-05-11 竹间智能科技(上海)有限公司 Mark model construction method and voice broadcasting system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002341891A (en) * 2001-05-14 2002-11-29 Nec Corp Speech recognition device and speech recognition method
JP2010230695A (en) * 2007-10-22 2010-10-14 Toshiba Corp Speech boundary estimation apparatus and method
JP2010257425A (en) * 2009-04-28 2010-11-11 Nippon Hoso Kyokai <Nhk> Topic boundary detection device and computer program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002341891A (en) * 2001-05-14 2002-11-29 Nec Corp Speech recognition device and speech recognition method
JP2010230695A (en) * 2007-10-22 2010-10-14 Toshiba Corp Speech boundary estimation apparatus and method
JP2010257425A (en) * 2009-04-28 2010-11-11 Nippon Hoso Kyokai <Nhk> Topic boundary detection device and computer program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
祖父江翔 他: "音声認識結果の文境界推定における識別モデルの評価", 言語処理学会第15回年次大会発表論文集, JPN6018030120, 2 March 2009 (2009-03-02), pages 582 - 585 *
鈴木伸尚 他: "文単位で分割されたテキストで学習した言語モデルによる単語信頼度を用いた文境界検出", FIT2011 第10回情報科学技術フォーラム講演論文集 第2分冊, JPN6018030121, 22 August 2011 (2011-08-22), pages 35 - 38 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018174250A1 (en) 2017-03-24 2018-09-27 三菱ケミカル株式会社 Prepreg and fiber-reinforced composite material
WO2019007245A1 (en) * 2017-07-04 2019-01-10 阿里巴巴集团控股有限公司 Processing method, control method and recognition method, and apparatus and electronic device therefor
JP2020024277A (en) * 2018-08-07 2020-02-13 国立研究開発法人情報通信研究機構 Data segmentation device
CN110689877A (en) * 2019-09-17 2020-01-14 华为技术有限公司 Voice end point detection method and device
CN110942764A (en) * 2019-11-15 2020-03-31 北京达佳互联信息技术有限公司 Stream type voice recognition method
CN110942764B (en) * 2019-11-15 2022-04-22 北京达佳互联信息技术有限公司 Stream type voice recognition method
CN112786023A (en) * 2020-12-23 2021-05-11 竹间智能科技(上海)有限公司 Mark model construction method and voice broadcasting system

Also Published As

Publication number Publication date
JP6495792B2 (en) 2019-04-03

Similar Documents

Publication Publication Date Title
JP6495792B2 (en) Speech recognition apparatus, speech recognition method, and program
CN109065031B (en) Voice labeling method, device and equipment
CN109887497B (en) Modeling method, device and equipment for speech recognition
US11282524B2 (en) Text-to-speech modeling
US10475484B2 (en) Method and device for processing speech based on artificial intelligence
CN109754783B (en) Method and apparatus for determining boundaries of audio sentences
US10573297B2 (en) System and method for determining the compliance of agent scripts
US9251808B2 (en) Apparatus and method for clustering speakers, and a non-transitory computer readable medium thereof
JP2017058483A (en) Voice processing apparatus, voice processing method, and voice processing program
JP2016536652A (en) Real-time speech evaluation system and method for mobile devices
WO2019065263A1 (en) Pronunciation error detection device, method for detecting pronunciation error, and program
CN104464734A (en) Simultaneous speech processing apparatus, method and program
EP4322029A1 (en) Method and apparatus for generating video corpus, and related device
CN112825249A (en) Voice processing method and device
JP2022120024A (en) Audio signal processing method, model training method, and their device, electronic apparatus, storage medium, and computer program
JP6812381B2 (en) Voice recognition accuracy deterioration factor estimation device, voice recognition accuracy deterioration factor estimation method, program
JP2018081169A (en) Speaker attribute estimation system, learning device, estimation device, speaker attribute estimation method, and program
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
JP7409381B2 (en) Utterance section detection device, utterance section detection method, program
JP5997813B2 (en) Speaker classification apparatus, speaker classification method, and speaker classification program
CN112259084A (en) Speech recognition method, apparatus and storage medium
JP6486789B2 (en) Speech recognition apparatus, speech recognition method, and program
JP7279800B2 (en) LEARNING APPARATUS, ESTIMATION APPARATUS, THEIR METHOD, AND PROGRAM
JP2018132678A (en) Turn-taking timing identification apparatus, turn-taking timing identification method, program and recording medium
JP2016162163A (en) Information processor and information processing program

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20170829

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20180711

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20180807

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20180919

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20190305

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20190307

R150 Certificate of patent or registration of utility model

Ref document number: 6495792

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150