JP2015169834A

JP2015169834A - Speech analysis method, speech analysis program, and speech analysis apparatus

Info

Publication number: JP2015169834A
Application number: JP2014045555A
Authority: JP
Inventors: 田中　正清; Masakiyo Tanaka; 正清田中; 福岡　俊之; Toshiyuki Fukuoka; 俊之福岡; 村瀬　健太郎; Kentaro Murase; 健太郎村瀬
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2014-03-07
Filing date: 2014-03-07
Publication date: 2015-09-28
Anticipated expiration: 2034-03-07
Also published as: JP6281330B2

Abstract

PROBLEM TO BE SOLVED: To provide a speech analysis method, a speech analysis program, and a speech analysis device for accurately correlating voice data to text.SOLUTION: The speech analysis method has a computer execute the steps of: dividing voice data into a plurality of sections; extracting a plurality of words from text; performing voice recognition between the plurality of sections on the basis of the extracted plurality of words; selecting, from the extracted plurality of words, one or more words having the easiness or reliability of voice recognition higher than or equal to a certain level, as a reference word; selecting, from the plurality of sections, a section in which the voice of the reference word is recognized; selecting, from the selected section, one or more sections in which the voice of a word other than the reference word is recognized, as a first section corresponding to the text; and detecting, beginning with a section adjacent to the first section, a range of sections in which the voice of a word other than the reference word is recognized, and selecting a section within the range as a second section corresponding to the text.

Description

本件は、音声分析方法、音声分析プログラム、及び音声分析装置に関する。 The present invention relates to a voice analysis method, a voice analysis program, and a voice analysis apparatus.

音声及び映像の記録再生技術の進歩に伴って、会議の様子が記録された音声データまたは映像データを高い臨場感で再生するとともに、議事録の参照を可能とする議事録システムが普及している。利用者（会議の欠席者など）は、議事録システムを使用することにより、会議の詳細な内容を知ることができる。 With the advancement of audio and video recording / playback technology, the minutes system that allows users to refer to the minutes as well as to reproduce the audio data or video data recorded in the meeting with a high sense of presence. . Users (such as those who are absent from the conference) can know the detailed content of the conference by using the minutes system.

議事録システムに関し、例えば特許文献１には、音声データまたは映像データを、単語同士の類似性に基づいて、文書データに対応付ける点が記載されている。 Regarding the minutes system, for example, Patent Document 1 describes that audio data or video data is associated with document data based on the similarity between words.

国際公開２００５／０２７０９２号International Publication No. 2005/027092

会議の音声データを議事録のテキストに対応付けることにより、利用者は、音声内容に対応する議事録の記載箇所を随時、参照できるため、議事録システムの利便性が向上する。しかし、音声データの対応付けが的確性を欠けば、利用者は、かえって会議内容の把握が困難になる。なお、このような問題は、議事録システムだけでなく、講演会や講義などを記録及び再生する他のシステムに関しても存在する。 By associating the audio data of the meeting with the text of the minutes, the user can refer to the description location of the minutes corresponding to the audio contents at any time, so that the convenience of the minutes system is improved. However, if the correspondence of the audio data is not accurate, the user becomes difficult to grasp the contents of the meeting. Such a problem exists not only in the minutes system but also in other systems that record and reproduce lectures and lectures.

そこで本件は上記の課題に鑑みてなされたものであり、音声データをテキストに的確に対応付ける音声分析方法、音声分析プログラム、及び音声分析装置を提供することを目的とする。 Accordingly, the present invention has been made in view of the above problems, and an object thereof is to provide a speech analysis method, a speech analysis program, and a speech analysis device that accurately associate speech data with text.

本明細書に記載の音声分析方法は、音声データを複数の区間に分割する工程と、テキストから複数の単語を抽出する工程と、抽出した前記複数の単語に基づき、前記複数の区間の音声認識をそれぞれ行う工程と、抽出した前記複数の単語から、一定以上の音声認識の容易性または確実性を有する１以上の単語を、基準単語として選択する工程と、前記複数の区間から、前記基準単語の音声が認識された区間を選択する工程と、該選択した区間から、前記複数の単語のうち、前記基準単語以外の単語の音声が認識された１以上の区間を、前記テキストに対応する第１区間として選択する工程と、前記第１区間と隣接する区間を起点として、前記複数の単語のうち、前記基準単語以外の単語の音声が認識された区間の範囲を検出し、前記範囲内の区間を、前記テキストに対応する第２区間として選択する工程とを、コンピュータが実行する。 The speech analysis method described in this specification includes a step of dividing speech data into a plurality of sections, a step of extracting a plurality of words from text, and speech recognition of the plurality of sections based on the extracted plurality of words. Respectively, a step of selecting one or more words having a certain level of ease of speech recognition or certainty as a reference word from the plurality of extracted words, and the reference word from the plurality of sections. Selecting a section in which the voice of the word is recognized, and one or more sections in which the voices of words other than the reference word are recognized among the plurality of words from the selected section corresponding to the text. The step of selecting as one section, and the section adjacent to the first section as a starting point, the range of the section in which the speech of words other than the reference word is recognized among the plurality of words, Between, and selecting as a second section corresponding to said text, computer executes.

本明細書に記載の音声分析プログラムは、音声データを複数の区間に分割し、テキストから複数の単語を抽出し、抽出した前記複数の単語に基づき、前記複数の区間の音声認識をそれぞれ行い、抽出した前記複数の単語から、一定以上の音声認識の容易性または確実性を有する１以上の単語を、基準単語として選択し、前記複数の区間から、前記基準単語の音声が認識された区間を選択し、該選択した区間から、前記複数の単語のうち、前記基準単語以外の単語の音声が認識された１以上の区間を、前記テキストに対応する第１区間として選択し、前記第１区間と隣接する区間を起点として、前記複数の単語のうち、前記基準単語以外の単語の音声が認識された区間の範囲を検出し、前記範囲内の区間を、前記テキストに対応する第２区間として選択する、処理とを、コンピュータに実行させる。 The speech analysis program described in the present specification divides speech data into a plurality of sections, extracts a plurality of words from text, performs speech recognition of the plurality of sections based on the extracted plurality of words, From the plurality of extracted words, one or more words having a certain level of ease or certainty of speech recognition are selected as reference words, and a section in which the voice of the reference word is recognized from the plurality of sections. Selecting one or more sections in which the speech of words other than the reference word among the plurality of words is recognized as the first section corresponding to the text from the selected section, and the first section And a range of a section in which the speech of a word other than the reference word is recognized among the plurality of words, and a section within the range is defined as a second section corresponding to the text. Selecting Te, and processing, it causes the computer to execute.

本明細書に記載の音声分析装置は、音声データを複数の区間に分割する分割部と、テキストから複数の単語を抽出する抽出部と、抽出した前記複数の単語に基づき、前記複数の区間の音声認識をそれぞれ行う音声認識処理部と、抽出した前記複数の単語から、一定以上の音声認識の容易性または確実性を有する１以上の単語を、基準単語として選択する第１選択部と、前記複数の区間から、前記基準単語の音声が認識された区間を選択する第２選択部と、該選択した区間から、前記複数の単語のうち、前記基準単語以外の単語の音声が認識された１以上の区間を、前記テキストに対応する第１区間として選択する第３選択部と、前記第１区間と隣接する区間を起点として、前記複数の単語のうち、前記基準単語以外の単語の音声が認識された区間の範囲を検出し、前記範囲内の区間を、前記テキストに対応する第２区間として選択する第４選択部とを有する。 The speech analysis device described in the present specification includes a dividing unit that divides speech data into a plurality of sections, an extraction unit that extracts a plurality of words from text, and a plurality of sections based on the extracted plurality of words. A speech recognition processing unit that performs speech recognition, a first selection unit that selects, as a reference word, one or more words that have a certain level of ease or certainty of speech recognition from the plurality of extracted words; A second selection unit that selects a section in which the voice of the reference word is recognized from a plurality of sections, and a voice of a word other than the reference word among the plurality of words is recognized from the selected section. A voice of words other than the reference word among the plurality of words starts from a third selection unit that selects the above section as a first section corresponding to the text, and a section adjacent to the first section. Recognized Detecting a range between, a section within the range, and a fourth selection unit for selecting as a second section corresponding to the text.

音声データをテキストに的確に対応付けることができる。 Audio data can be accurately associated with text.

実施例に係る音声分析装置を示す構成図である。It is a block diagram which shows the speech analyzer which concerns on an Example. 音声分析装置の機能構成の一例を示す構成図である。It is a block diagram which shows an example of a function structure of a speech analyzer. 文書データ及び分割後の文書データの一例を示す図である。It is a figure which shows an example of document data and the document data after a division | segmentation. 抽出単語データベースの一例を示す表である。It is a table | surface which shows an example of an extraction word database. 音声データの分割例を示す図である。It is a figure which shows the example of a division | segmentation of audio | voice data. 音声データ及び分割後の音声データを示す図である。It is a figure which shows audio | voice data and the audio | voice data after a division | segmentation. テキストと音声データの区間の対応付けの比較例を示す図である。It is a figure which shows the comparative example of matching of the area | region of a text and audio | voice data. テキストと音声データの区間の対応付けの他の比較例を示す図である。It is a figure which shows the other comparative example of matching of the area | region of a text and audio | voice data. 主区間の候補の選択の一例を示す図である。It is a figure which shows an example of selection of the candidate of a main area. 主区間の選択の一例を示す図である。It is a figure which shows an example of selection of the main area. 分析結果データベースの一例を示す表である。It is a table | surface which shows an example of an analysis result database. 副区間の選択の一例を示す図である。It is a figure which shows an example of selection of a subsection. 実施例に係る音声分析プログラムのフローチャートである。It is a flowchart of the speech analysis program which concerns on an Example.

図１は、実施例に係る音声分析装置１を示す構成図である。音声分析装置１は、会議などの音声データを音声認識処理により分析し、議事録などの文書データに対応付ける。 FIG. 1 is a configuration diagram illustrating a speech analysis apparatus 1 according to an embodiment. The voice analysis apparatus 1 analyzes voice data such as a meeting by voice recognition processing and associates it with document data such as minutes.

音声分析装置１は、例えばサーバ装置などのコンピュータ装置である。音声分析装置１は、ＣＰＵ１０、ＲＯＭ（Read Only Memory）１１、ＲＡＭ（Random Access Memory）１２、ＨＤＤ１３、通信処理部１４、可搬型記憶媒体用ドライブ１５、入力処理部１６、及び画像処理部１７などを備えている。 The voice analysis device 1 is a computer device such as a server device. The voice analysis apparatus 1 includes a CPU 10, a ROM (Read Only Memory) 11, a RAM (Random Access Memory) 12, an HDD 13, a communication processing unit 14, a portable storage medium drive 15, an input processing unit 16, an image processing unit 17, and the like. It has.

ＣＰＵ１０は、演算処理手段であり、音声分析プログラムに従って、音声分析方法を実行する。ＣＰＵ１０は、各部１１〜１７とバス１８を介して接続されている。なお、音声分析装置１は、ソフトウェアにより動作するものに限定されず、ＣＰＵ１０に代えて、特定用途向け集積回路などのハードウェアが用いられてもよい。 The CPU 10 is arithmetic processing means, and executes a voice analysis method according to a voice analysis program. The CPU 10 is connected to the units 11 to 17 via the bus 18. Note that the voice analysis device 1 is not limited to one that operates by software, and hardware such as an application specific integrated circuit may be used instead of the CPU 10.

ＲＡＭ１２は、ＣＰＵ１０のワーキングメモリとして用いられる。また、ＲＯＭ１１及びＨＤＤ１３は、ＣＰＵ１０を動作させる音声分析プログラムなどを記憶する記憶手段として用いられる。通信処理部１４は、例えばネットワークカードであり、ＬＡＮなどのネットワークを介して他の装置と通信を行う通信手段である。 The RAM 12 is used as a working memory for the CPU 10. The ROM 11 and HDD 13 are used as storage means for storing a voice analysis program for operating the CPU 10. The communication processing unit 14 is, for example, a network card, and is a communication unit that communicates with other devices via a network such as a LAN.

可搬型記憶媒体用ドライブ１５は、可搬型記憶媒体１５０に対して、情報の書き込みや情報の読み出しを行う装置である。可搬型記憶媒体１５０の例としては、ＵＳＢメモリ（USB: Universal Serial Bus）、ＣＤ−Ｒ（Compact Disc Recordable）、及びメモリカードなどが挙げられる。なお、音声分析プログラムは、可搬型記憶媒体１５０に格納されてもよい。 The portable storage medium drive 15 is a device that writes information to and reads information from the portable storage medium 150. Examples of the portable storage medium 150 include a USB memory (USB: Universal Serial Bus), a CD-R (Compact Disc Recordable), and a memory card. Note that the voice analysis program may be stored in the portable storage medium 150.

音声分析装置１は、情報の入力操作を行うための入力デバイス１６０、及び、画像を表示するためのディスプレイ１７０を、さらに備える。入力デバイス１６０は、キーボード及びマウスなどの入力手段であり、入力された情報は、入力処理部１６を介してＣＰＵ１０に出力される。ディスプレイ１７０は、液晶ディスプレイなどの画像表示手段であり、表示される画像データは、ＣＰＵ１０から画像処理部１７を介してディスプレイに出力される。なお、入力デバイス１６０及びディスプレイ１７０に代えて、これらの機能を備えるタッチパネルなどのデバイスを用いることもできる。 The voice analysis apparatus 1 further includes an input device 160 for performing an information input operation and a display 170 for displaying an image. The input device 160 is input means such as a keyboard and a mouse, and input information is output to the CPU 10 via the input processing unit 16. The display 170 is image display means such as a liquid crystal display, and displayed image data is output from the CPU 10 to the display via the image processing unit 17. In addition, it can replace with the input device 160 and the display 170, and can also use devices, such as a touchscreen provided with these functions.

ＣＰＵ１０は、ＲＯＭ１１、またはＨＤＤ１３などに格納されているプログラム、または可搬型記憶媒体用ドライブ１５が可搬型記憶媒体１５０から読み取ったプログラムを実行する。このプログラムには、ＯＳ（Operating System）だけでなく、上記の音声分析プログラムも含まれる。なお、プログラムは、他の装置から通信処理部１４を介してダウンロードされたものであってもよい。 The CPU 10 executes a program stored in the ROM 11 or the HDD 13 or a program read from the portable storage medium 150 by the portable storage medium drive 15. This program includes not only the OS (Operating System) but also the above-described voice analysis program. Note that the program may be downloaded from another apparatus via the communication processing unit 14.

ＣＰＵ１０は、音声分析プログラムを実行すると、複数の機能が形成される。以下に、音声分析装置１の機能を説明する。 When the CPU 10 executes the voice analysis program, a plurality of functions are formed. Below, the function of the speech analyzer 1 will be described.

図２は、音声分析装置１の機能構成例を示す構成図である。図２には、ＣＰＵ１０に形成される機能及びＨＤＤ１３の格納情報の一例が示されている。 FIG. 2 is a configuration diagram illustrating an example of a functional configuration of the voice analysis device 1. FIG. 2 shows an example of functions formed in the CPU 10 and information stored in the HDD 13.

ＣＰＵ１０は、文書データ分割部１０１と、単語抽出部（抽出部）１０２と、基準単語選択部（第１選択部）１０３と、音声認識処理部１０４と、音声データ分割部（分割部）１０５とを有する。ＣＰＵ１０は、さらに、主区間候補選択部（第２選択部）１０６と、主区間選択部（第３選択部）１０７と、副区間選択部（第４選択部）１０８とを有する。ＨＤＤ１３には、文書データベース（文書ＤＢ）１３１と、辞書データベース（辞書ＤＢ）１３２と、抽出単語データベース（抽出単語ＤＢ）１３３と、音声データベース（音声ＤＢ）１３４と、分析結果データベース（分析結果ＤＢ）１３５とが格納されている。 The CPU 10 includes a document data dividing unit 101, a word extracting unit (extracting unit) 102, a reference word selecting unit (first selecting unit) 103, a voice recognition processing unit 104, and a voice data dividing unit (dividing unit) 105. Have The CPU 10 further includes a main section candidate selection section (second selection section) 106, a main section selection section (third selection section) 107, and a sub section selection section (fourth selection section) 108. The HDD 13 includes a document database (document DB) 131, a dictionary database (dictionary DB) 132, an extracted word database (extracted word DB) 133, a speech database (speech DB) 134, and an analysis result database (analysis result DB). 135 are stored.

なお、各データベース１３１〜１３５は、他の装置の記憶手段（ＨＤＤなど）に記憶されてもよい。この場合、ＣＰＵ１０は、通信処理部１４からネットワークを介して各データベース１３１〜１３５にアクセスする。 Each database 131 to 135 may be stored in a storage unit (HDD or the like) of another device. In this case, the CPU 10 accesses the databases 131 to 135 from the communication processing unit 14 via the network.

文書ＤＢ１３１は、入力デバイス１６０から入力された文書データ、または可搬型記憶媒体１５０から可搬型記憶媒体用ドライブ１５を介して入力された文書データを含む。本例では、文書データとして、会議の議事録を挙げるが、これに限定されない。 The document DB 131 includes document data input from the input device 160 or document data input from the portable storage medium 150 via the portable storage medium drive 15. In this example, the minutes of the meeting are listed as the document data, but the present invention is not limited to this.

文書データ分割部１０１は、文書ＤＢ１３１内の文書データを複数のテキスト（テキストデータ）に分割する。文書データ分割部１０１は、例えば、文書データから、複数の個別のテキストを生成することにより、または、文書データに、文書内のテキスト同士を区切るための目印を付与することにより、文書データを分割する。 The document data dividing unit 101 divides the document data in the document DB 131 into a plurality of texts (text data). The document data dividing unit 101 divides the document data, for example, by generating a plurality of individual texts from the document data, or by adding a mark for separating the texts in the document to the document data. To do.

図３（ａ）及び図３（ｂ）には、文書データ及び分割後の文書データの一例がそれぞれ示されている。文書データ分割部１０１は、例えば、議事録の文章中の改行、字下げ（インデント）、または空行を、テキスト間の境界として、文書データを分割する。これにより、議事録の文章「夏季合宿について、・・・次回は来週金曜日。」は、複数のテキスト（Ａ）〜（Ｃ）に分割される。文書データ分割部１０１は、文書データの分割処理の完了を単語抽出部１０２に通知する。 FIG. 3A and FIG. 3B show examples of document data and divided document data, respectively. The document data dividing unit 101 divides the document data using, for example, line breaks, indentation (indentation), or blank lines in the minutes of the minutes as boundaries between texts. As a result, the text of the minutes “About the summer training camp, next Friday ...” is divided into a plurality of texts (A) to (C). The document data division unit 101 notifies the word extraction unit 102 of the completion of the document data division processing.

単語抽出部１０２は、文書データの分割処理の完了通知を受けると、辞書ＤＢ１３２を参照することにより、各テキスト（Ａ）〜（Ｃ）から単語を抽出する。辞書ＤＢ１３２には、複数の単語が登録されている。単語抽出部１０２は、抽出した単語を、テキスト（Ａ）〜（Ｃ）ごとに抽出単語ＤＢ１３３に登録する。 When the word extraction unit 102 receives a notification of completion of the document data division process, the word extraction unit 102 extracts words from the texts (A) to (C) by referring to the dictionary DB 132. A plurality of words are registered in the dictionary DB 132. The word extraction unit 102 registers the extracted words in the extracted word DB 133 for each of the texts (A) to (C).

図４には、抽出単語ＤＢ１３３の一例が示されている。図４において、「テキスト」は、テキストの識別情報（本例ではＡ〜Ｃ）を示し、「単語」は、抽出された単語を示す。なお、「基準フラグ」及び「音声データ区間」については後述する。 FIG. 4 shows an example of the extracted word DB 133. In FIG. 4, “text” indicates text identification information (A to C in this example), and “word” indicates an extracted word. The “reference flag” and “voice data section” will be described later.

本例では、テキスト（Ａ）から、「夏季」、「合宿」、「意見」、「募集」、「金曜日」、及び「香川」が抽出される。また、テキスト（Ｂ）からは、「秋」、「試験期間」、「３号館」、「出入り」、「場合」、「事前」、「佐川」、及び「連絡」が抽出され、テキスト（Ｃ）からは、「次回」、「来週」、及び「金曜日」が抽出される。抽出された単語は、後述するように、音声データの音声認識処理及び各テキスト（Ａ）〜（Ｃ）への対応付けに用いられる。単語抽出部１０２は、単語の抽出処理の完了を基準単語選択部１０３に通知する。 In this example, “summer season”, “training camp”, “opinion”, “recruitment”, “Friday”, and “Kagawa” are extracted from the text (A). Also, from the text (B), “autumn”, “test period”, “building 3”, “in / out”, “case”, “previous”, “Sagawa”, and “contact” are extracted, and the text (C ), “Next time”, “next week”, and “Friday” are extracted. As will be described later, the extracted words are used for speech recognition processing of speech data and association with each text (A) to (C). The word extraction unit 102 notifies the reference word selection unit 103 of completion of the word extraction process.

基準単語選択部１０３は、単語の抽出処理の完了通知を受けると、テキスト（Ａ）〜（Ｃ）ごとに、抽出した単語から一定以上の音声認識の容易性または確実性を有する１以上の単語を、基準単語として選択する。基準単語は、後述するように、テキスト（Ａ）〜（Ｃ）に対応する音声データの主区間を選択するために用いられる。 When the reference word selection unit 103 receives a notification of completion of the word extraction process, the reference word selection unit 103 has, for each text (A) to (C), one or more words that have a certain level of ease or certainty of speech recognition from the extracted words. Are selected as reference words. As will be described later, the reference word is used to select a main section of speech data corresponding to the texts (A) to (C).

より具体的には、基準単語選択部１０３は、抽出した複数の単語から、モーラ数が最多または一定数以上である１以上の単語を、基準単語として選択する。ここで、モーラ数とは、音節数、つまり単語に含まれる母音及び“ｎ”の音数である。 More specifically, the reference word selection unit 103 selects, as a reference word, one or more words having the largest number of mora or a certain number or more from the plurality of extracted words. Here, the number of mora is the number of syllables, that is, the number of vowels and “n” included in a word.

本例では、基準単語選択部１０３は、モーラ数が５以上である単語を基準単語として選択する。このため、テキスト（Ａ）及び（Ｂ）内の「金曜日」（モーラ数＝５）と、テキスト（Ｃ）内の「３号館」（モーラ数＝６）とが、基準単語として選択される。基準単語選択部１０３は、選択した基準単語を抽出単語ＤＢ１３３に登録する。 In this example, the reference word selection unit 103 selects a word having a mora number of 5 or more as a reference word. Therefore, “Friday” (number of mora = 5) in the texts (A) and (B) and “No. 3 building” (number of mora = 6) in the text (C) are selected as reference words. The reference word selection unit 103 registers the selected reference word in the extracted word DB 133.

図４に例示された抽出単語ＤＢ１３３内の「基準フラグ」は、「１」の場合、当該単語が基準単語であることを示し、「０」の場合、当該単語が基準単語ではないことを示す。したがって、テキスト（Ａ）の欄において、「金曜日」の「基準フラグ」は「１」を示し、他の単語の「基準フラグ」は「０」を示す。また、テキスト（Ｂ）の欄において、「３号館」の「基準フラグ」は「１」を示し、他の単語の「基準フラグ」は「０」を示す。テキスト（Ｃ）の欄において、「金曜日」の「基準フラグ」は「１」を示し、他の単語の「基準フラグ」は「０」を示す。 The “reference flag” in the extracted word DB 133 illustrated in FIG. 4 indicates that the word is a reference word when “1”, and indicates that the word is not a reference word when “0”. . Therefore, in the text (A) column, “reference flag” of “Friday” indicates “1”, and “reference flag” of other words indicates “0”. In the text (B) column, “reference flag” of “No. 3 building” indicates “1”, and “reference flag” of other words indicates “0”. In the text (C) column, “Friday” “reference flag” indicates “1” and other words “reference flag” indicates “0”.

このように、モーラ数を用いることにより、テキスト（Ａ）〜（Ｃ）から抽出した複数の単語から、音声認識が容易な単語を簡単に検出できる。なお、基準単語の選択手段は、これに限定されない。 Thus, by using the number of mora, it is possible to easily detect a word that can be easily recognized from a plurality of words extracted from the texts (A) to (C). The reference word selection means is not limited to this.

基準単語選択部１０３は、音声認識における単語らしさを示すスコアが最多または一定値以上である１以上の単語を、基準単語として選択してもよい。スコアは、確率モデルを用いたモデルにおいて、例えば、動的計画法などを用いて算出された単語モデルと音声のパラメータ間距離の逆数などを用いて算出される。したがって、この場合、基準単語選択部１０３は、単語のスコアを音声認識処理部１０４から取得する。これにより、テキスト（Ａ）〜（Ｃ）から抽出した複数の単語から、音声認識の結果の確実性が高い単語を簡単に検出できる。なお、基準単語選択部１０３は、モーラ数及びスコアの両方に基づいて、基準単語を選択してもよい。基準単語選択部１０３は、基準単語の選択処理の完了を主区間候補選択部１０６に通知する。 The reference word selection unit 103 may select, as a reference word, one or more words having the highest score indicating the likelihood of words in speech recognition or a certain value or more. The score is calculated using, for example, a reciprocal of a distance between a word model and a speech parameter calculated using dynamic programming or the like in a model using a probability model. Therefore, in this case, the reference word selection unit 103 acquires a word score from the speech recognition processing unit 104. Thereby, the word with high certainty of the result of speech recognition can be easily detected from a plurality of words extracted from the texts (A) to (C). Note that the reference word selection unit 103 may select a reference word based on both the number of mora and the score. The reference word selection unit 103 notifies the main section candidate selection unit 106 of the completion of the reference word selection process.

音声ＤＢ１３４は、他の装置から通信処理部１４を介して入力された音声データ、または可搬型記憶媒体１５０から可搬型記憶媒体用ドライブ１５を介して入力された音声データを含む。本例では、音声データとして、ＩＣ（Integrated Circuit）レコーダなどにより記録された会議の音声データを挙げるが、これに限定されず、例えば映像データに含まれる音声データであってもよい。なお、文書ＤＢ１３１内の文書データの内容（議事録）は、音声データが示す会議内容に従って作成される。 The voice DB 134 includes voice data input from another device via the communication processing unit 14 or voice data input from the portable storage medium 150 via the portable storage medium drive 15. In this example, the audio data of the conference recorded by an IC (Integrated Circuit) recorder or the like is exemplified as the audio data. However, the audio data is not limited to this, and may be audio data included in video data, for example. Note that the content (minutes) of the document data in the document DB 131 is created according to the content of the conference indicated by the audio data.

音声データ分割部１０５は、音声ＤＢ１３４内の音声データを複数の区間に分割する。音声データ分割部１０５は、例えば、音声データを、区間ごとの個別のデータに分離することにより、または、音声データに、区間同士を区切るための目印を付与することにより、音声データを分割する。 The audio data dividing unit 105 divides the audio data in the audio DB 134 into a plurality of sections. The audio data dividing unit 105 divides the audio data, for example, by separating the audio data into individual data for each section, or by adding a mark for separating the sections to the audio data.

図５（ａ）及び図５（ｂ）には、音声データの分割例（１）及び分割例（２）がそれぞれ示さている。図５（ａ）及び図５（ｂ）に示された音声データの波形は、時間に対する音声の強さの変化を示す。 FIG. 5A and FIG. 5B show a division example (1) and a division example (2) of audio data, respectively. The waveform of the audio data shown in FIGS. 5A and 5B shows a change in the strength of the sound with respect to time.

分割例（１）において、音声データ分割部１０５は、一定時間Ｔごとに音声データを分割することにより、音声データの区間（１）〜（３）を取得する。分割例（１）の手法によれば、音声データの各区間（１）〜（３）のデータ量を均等にできる。 In the division example (1), the voice data division unit 105 divides the voice data every certain time T, thereby acquiring the sections (1) to (3) of the voice data. According to the method of division example (1), the data amount of each section (1) to (3) of the audio data can be made equal.

また、分割例（２）において、音声データ分割部１０５は、音声データから、所定時間以上継続する無音区間Ｍを検出し、音声データを無音区間Ｍにより区切ることで、音声データの区間（１）〜（３）を取得する。分割例（２）の手法によれば、音声データの各区間（１）〜（３）のデータ量を、一文程度とすることができる。 Also, in the division example (2), the voice data division unit 105 detects a silent section M that continues for a predetermined time or more from the voice data, and divides the voice data by the silent section M to thereby obtain a section (1) of the voice data. Obtain (3). According to the method of the division example (2), the data amount of each section (1) to (3) of the audio data can be reduced to about one sentence.

このように、音声データの分割は、文書データの分割に対応して行われるわけではないため、音声分析装置１は、テキスト及び音声データの区間を、必ずしも１対１の関係で対応付けるわけではない。つまり、１つのテキストに音声データの複数の区間が対応付けられる場合や、音声データの異なる区間が、同一のテキストに対応付けられる場合も存在する。 As described above, since the division of the voice data is not performed corresponding to the division of the document data, the voice analysis apparatus 1 does not necessarily associate the sections of the text and the voice data in a one-to-one relationship. . That is, there are cases where a plurality of sections of speech data are associated with one text, and sections where speech data are different are associated with the same text.

図６（ａ）及び図６（ｂ）には、音声データ及び分割後の音声データがそれぞれ示されている。本例において、音声データ分割部１０５は、上記の分割例（２）の手法により音声データを分割するが、これに限定されず、例えば、上記の分割例（１）の手法を用いてもよい。なお、図６（ａ）及び図６（ｂ）には、音声データの音声が、文字として表現されている。 FIGS. 6A and 6B show the audio data and the divided audio data, respectively. In this example, the audio data dividing unit 105 divides the audio data by the method of the above division example (2), but is not limited thereto, and for example, the method of the above division example (1) may be used. . In FIG. 6A and FIG. 6B, the voice of the voice data is expressed as characters.

音声データ分割部１０５は、音声データが示す音声「今年も夏季の合宿を開催します。・・・次回の会議は来週の金曜日に行います。」を区間（１）〜（４）に分割する。分割処理は、一定時間の無音区間を検出することにより行われるため、音声は、一文を単位として分割される。 The voice data dividing unit 105 divides the voice indicated by the voice data “Summer training camp will be held this year. The next meeting will be held next Friday” into sections (1) to (4). . Since the division process is performed by detecting a silent section of a certain time, the voice is divided in units of one sentence.

このため、音声データの区間（１）には、「今年も夏季の合宿を開催します。」の音声が含まれ、音声データの区間（２）には、「行先、やりたい事など・・・伝えてください。」の音声が含まれる。また、音声データの区間（３）には、「秋の試験期間中は、・・・連絡するようにしてください。」の音声が含まれ、音声データの区間（４）には、「次回の会議は来週の金曜日に行います。」の音声が含まれる。 For this reason, the voice data section (1) includes the voice of “I will hold a summer camp this year”, and the voice data section (2) contains “Destination, what I want to do… Please tell me ". In addition, the voice data section (3) includes the voice “Please contact me during the fall test period.” The voice data section (4) The meeting will be held next Friday. "

音声認識処理部１０４は、テキスト（Ａ）〜（Ｃ）から抽出した複数の単語に基づき、音声データの分割により得た複数の区間の音声認識をそれぞれ行う。音声認識処理には、例えば、隠れマルコフモデルを用いた確率モデルの方法などが用いられる。音声認識処理部１０４は、抽出単語ＤＢ１３３を参照し、テキスト（Ａ）〜（Ｃ）ごとに、抽出された単語の音声が、音声データの区間（１）〜（４）で認識されるか否かを判定し、判定結果を抽出単語ＤＢ１３３に登録する。 The speech recognition processing unit 104 performs speech recognition of a plurality of sections obtained by dividing speech data based on a plurality of words extracted from the texts (A) to (C). For the speech recognition process, for example, a probabilistic model method using a hidden Markov model is used. The speech recognition processing unit 104 refers to the extracted word DB 133 and determines whether the speech of the extracted word is recognized in the sections (1) to (4) of the speech data for each of the texts (A) to (C). And the determination result is registered in the extracted word DB 133.

図４に例示された抽出単語ＤＢ１３３において、「音声データ区間」欄の「１」〜「４」欄は、音声データの区間（１）〜（４）における各単語の音声認識の有無（「１」：音声認識有り、「０」：音声認識無し）をそれぞれ示す。例えば、テキスト（Ａ）内の「夏季」及び「合宿」の音声は、音声データの区間（１）で認識されるが、他の区間（２）〜（４）では認識されない。このため、テキスト（Ａ）内の「夏季」及び「合宿」に対応する「音声データ区間」欄の「１」欄は、それぞれ「１」を示し、当該「音声データ区間」欄の「２」〜「４」欄は、それぞれ「０」を示す。音声認識処理部１０４は、音声認識処理が完了を主区間候補選択部１０６に通知する。 In the extracted word DB 133 illustrated in FIG. 4, the “1” to “4” columns in the “voice data section” column indicate whether or not each word is recognized in the voice data sections (1) to (4) (“1 ": With voice recognition," 0 ": without voice recognition). For example, the voices “Summer” and “Training Camp” in the text (A) are recognized in the section (1) of the voice data, but are not recognized in the other sections (2) to (4). Therefore, the “1” field in the “voice data section” field corresponding to “summer” and “camp” in the text (A) indicates “1”, and “2” in the “voice data section” field. The “4” columns indicate “0” respectively. The voice recognition processing unit 104 notifies the main section candidate selection unit 106 of the completion of the voice recognition process.

音声分析装置１は、音声データの各区間（１）〜（４）の音声認識の判定結果が登録された抽出単語ＤＢ１３３を参照することにより、各テキスト（Ａ）〜（Ｃ）に音声データの区間（１）〜（４）を対応付ける。ここで、仮に、対応付けを上記の基準単語のみに基づいて行った場合、以下の問題が生ずる。 The speech analysis apparatus 1 refers to the extracted word DB 133 in which the speech recognition determination results of the sections (1) to (4) of the speech data are registered, so that the speech data of each speech (A) to (C) is stored. Associate sections (1) to (4). Here, if the association is performed based only on the above reference word, the following problem occurs.

図７には、テキスト（Ａ）及び（Ｂ）と音声データの区間（１）〜（３）の対応付けの比較例が示されている。本例において、音声データの区間（２）は、基準単語の「金曜日」（点線枠参照）の音声を含むため、「金曜日」（点線枠参照）を含むテキスト（Ａ）に対応付けられる（丸印参照）。音声データの区間（３）は、基準単語の「３号館」（点線枠参照）の音声を含むため、「３号館」（点線枠参照）を含むテキスト（Ｂ）に対応付けられる（丸印参照）。 FIG. 7 shows a comparative example of correspondence between texts (A) and (B) and sections (1) to (3) of voice data. In this example, the section (2) of the voice data includes the voice of the reference word “Friday” (see the dotted frame), and is associated with the text (A) including “Friday” (see the dotted frame) (circle). See sign). The section (3) of the voice data includes the voice of the reference word “Building No. 3” (see the dotted frame) and is associated with the text (B) including “No. 3 Building” (see the dotted frame) (see the circle). ).

しかし、音声データの区間（１）は、基準単語の「金曜日」及び「３号館」の何れの音声も含んでいないため、音声データの区間（１）の音声の内容が、テキスト（Ａ）の内容と共通するにも関わらず、テキスト（Ａ）への対応付けができない。したがって、本例では、テキスト（Ａ）に対応付けられる音声データの区間の範囲が、正確に特定されない。 However, since the voice data section (1) does not include the voices of the reference words “Friday” and “No. 3 building”, the voice content of the voice data section (1) is the text (A). Despite being in common with the content, it cannot be associated with the text (A). Therefore, in this example, the range of the voice data section associated with the text (A) is not accurately specified.

また、図８には、テキスト（Ａ）〜（Ｃ）と音声データの区間（１）〜（４）の対応付けの他の比較例が示されている。本例において、音声データの区間（３）は、基準単語の「３号館」（点線枠参照）の音声を含むため、「３号館」（点線枠参照）を含むテキスト（Ｂ）に対応付けられる（丸印参照）。 FIG. 8 shows another comparative example in which texts (A) to (C) are associated with voice data sections (1) to (4). In this example, the section (3) of the voice data includes the voice of the reference word “Building No. 3” (refer to the dotted frame) and is therefore associated with the text (B) including “No. 3 Building” (refer to the dotted frame). (See circle).

しかし、基準単語の「金曜日」（点線枠参照）は、複数のテキスト（Ａ），（Ｂ）及び音声データの複数の区間（２），（４）に含まれるため、音声データの区間（２），（４）を、テキスト（Ａ），（Ｂ）の何れに対応付けるかの判定が困難である（「？」参照）。このように、基準単語が複数のテキスト（Ａ），（Ｂ）及び音声データの複数の区間（２），（４）において重複する場合、テキスト及び音声データの区間の対応付けが困難である。 However, since the reference word “Friday” (see the dotted frame) is included in the plurality of texts (A) and (B) and the plurality of sections (2) and (4) of the voice data, the section (2 ), (4) is difficult to determine which of the texts (A), (B) is associated (see “?”). Thus, when the reference word overlaps in a plurality of texts (A) and (B) and a plurality of sections (2) and (4) of speech data, it is difficult to associate the sections of the text and speech data.

したがって、基準単語のみを用いた場合、テキスト（Ａ）〜（Ｃ）及び音声データ（１）〜（４）の区間を的確に対応付けることができない。このため、音声分析装置１は、以下に述べるように、テキスト内の基準単語が音声認識された音声データの区間から、他の単語も音声認識された区間を、テキストに対応する主区間（第１区間）を選択する。さらに、音声分析装置１は、主区間と隣接する区間を起点として、基準単語以外の単語の音声が認識された区間の範囲を検出し、範囲内の区間を、テキストに対応する副区間（第２区間）として選択する。これにより、音声分析装置１は、音声データの主区間及び副区間をテキストに的確に対応付ける。 Therefore, when only the reference word is used, the sections of the texts (A) to (C) and the voice data (1) to (4) cannot be accurately associated. For this reason, as will be described below, the speech analysis apparatus 1 converts a section in which other words are also speech-recognized from a section of speech data in which a reference word in the text is speech-recognized, to a main section (first section) corresponding to the text. 1 section). Furthermore, the speech analysis apparatus 1 detects a range of a section in which the speech of a word other than the reference word is recognized starting from a section adjacent to the main section, and the section within the range is defined as a sub-section (first section) corresponding to the text. 2 section). Thereby, the speech analyzer 1 associates the main section and the sub section of the sound data accurately with the text.

主区間候補選択部１０６は、音声認識処理及び基準単語の選択処理の各完了通知を受けると、音声データの複数の区間（１）〜（４）から、基準単語の音声が認識された区間を、主区間の候補として選択する。より具体的には、主区間候補選択部１０６は、抽出単語ＤＢ１３３を参照することで、音声データの区間（１）〜（４）について、「基準フラグ」が「１」である単語の音声認識の有無を検出する。なお、主区間の候補の選択は、テキスト（Ａ）〜（Ｃ）ごとに行われる。 When the main section candidate selection unit 106 receives each notification of completion of the voice recognition process and the selection process of the reference word, the main section candidate selection unit 106 selects a section in which the voice of the reference word is recognized from the plurality of sections (1) to (4) of the voice data. Select as a candidate for the main section. More specifically, the main section candidate selection unit 106 refers to the extracted word DB 133 to recognize words whose “reference flag” is “1” for the sections (1) to (4) of the voice data. The presence or absence of is detected. Note that selection of candidates for the main section is performed for each of the texts (A) to (C).

より具体的には、主区間候補選択部１０６は、「基準フラグ」が「１」である単語（基準単語）に対応する「音声データ区間」の「１」〜「４」欄のうち、「１」を示す欄を検出する。主区間候補選択部１０６は、例えば、テキスト（Ａ）内の基準単語「金曜日」に対応する「音声データ区間」の「２」欄及び「４」欄を検出する。 More specifically, the main section candidate selection unit 106 selects “1” to “4” in the “voice data section” corresponding to the word (reference word) whose “reference flag” is “1”. A column indicating “1” is detected. The main section candidate selection unit 106 detects, for example, the “2” field and the “4” field of the “voice data section” corresponding to the reference word “Friday” in the text (A).

図９には、主区間の候補の選択の一例が示されている。図９は、図６に示された音声データの分割後に行われる主区間の候補の選択の様子を示す。 FIG. 9 shows an example of main section candidate selection. FIG. 9 shows how main section candidates are selected after the audio data shown in FIG. 6 is divided.

本例では、主区間候補選択部１０６は、テキスト（Ａ）に対応する主区間の候補を選択する。テキスト（Ａ）の基準単語である「金曜日」（点線枠参照）は、音声データの区間（２）及び区間（４）において音声認識される。このため、主区間候補選択部１０６は、音声データの区間（２），（４）を、テキスト（Ａ）に対応する主区間の候補として選択する。 In this example, the main section candidate selection unit 106 selects a main section candidate corresponding to the text (A). The reference word “Friday” (see the dotted line frame) of the text (A) is recognized by speech in the sections (2) and (4) of the voice data. Therefore, the main section candidate selection unit 106 selects the sections (2) and (4) of the speech data as main section candidates corresponding to the text (A).

なお、テキスト（Ｂ）については、基準単語である「３号館」を含む音声データの区間は、区間（３）のみであるため、区間（３）が、テキスト（Ｂ）に対応する主区間の候補として選択される。また、テキスト（Ｃ）は、「金曜日」を含むため、テキスト（Ａ）と同様に、音声データの区間（２），（４）が、テキスト（Ｃ）に対応する主区間の候補として選択される。主区間候補選択部１０６は、主区間の候補の選択処理の完了を、主区間選択部１０７に通知する。 As for the text (B), since the section of the audio data including the reference word “No. 3 building” is only the section (3), the section (3) is the main section corresponding to the text (B). Selected as a candidate. In addition, since the text (C) includes “Friday”, the sections (2) and (4) of the voice data are selected as candidates for the main section corresponding to the text (C), similarly to the text (A). The The main section candidate selection unit 106 notifies the main section selection unit 107 of completion of the main section candidate selection process.

主区間選択部１０７は、主区間の候補の選択処理の完了通知を受けると、主区間の候補として選択された区間から、基準単語以外の単語の音声が認識された１以上の区間を、テキスト（Ａ）〜（Ｃ）に対応する主区間として選択する。より具体的には、主区間選択部１０７は、抽出単語ＤＢ１３３を参照することで、主区間の候補として選択された音声データの区間（１）〜（４）について、「基準フラグ」が「０」である単語の音声認識の有無を検出する。なお、主区間の選択は、テキスト（Ａ）〜（Ｃ）ごとに行われる。 When the main section selection unit 107 receives a notification of completion of the main section candidate selection process, the main section selection unit 107 converts one or more sections in which speech of words other than the reference word is recognized from the sections selected as main section candidates into text The main section corresponding to (A) to (C) is selected. More specifically, the main section selection unit 107 refers to the extracted word DB 133 so that the “reference flag” is “0” for the sections (1) to (4) of the voice data selected as candidates for the main section. The presence / absence of voice recognition of the word “” is detected. The main section is selected for each text (A) to (C).

図１０には、主区間の選択の一例が示されている。図１０は、図９に示された主区間の候補の選択後に行われる主区間の選択の様子を示す。 FIG. 10 shows an example of selection of the main section. FIG. 10 shows how the main section is selected after the main section candidates shown in FIG. 9 are selected.

本例では、主区間選択部１０７は、主区間の候補として選択された音声データの区間（２）及び区間（４）から、テキスト（Ａ）に対応する主区間を選択する。テキスト（Ａ）の基準単語（「金曜日」）以外の単語である「意見」及び「香川」（点線枠参照）は、音声データの区間（２）において音声認識されるが、音声データの区間（４）では音声認識されない。このため、主区間選択部１０７は、音声データの区間（２）を、テキスト（Ａ）に対応する主区間として選択する（丸印参照）。また、音声データの区間（４）は、テキスト（Ａ）と共通する基準単語以外の単語の音声が含まれないため、選択されない（×印参照）。 In this example, the main section selection unit 107 selects the main section corresponding to the text (A) from the sections (2) and (4) of the audio data selected as the main section candidates. Words other than the reference word (“Friday”) of the text (A), “opinion” and “Kagawa” (see the dotted frame) are recognized in the voice data section (2), but the voice data section ( In 4), voice recognition is not performed. For this reason, the main section selection unit 107 selects the section (2) of the voice data as the main section corresponding to the text (A) (see a circle). The section (4) of the voice data is not selected because it does not include the voices of words other than the reference word that is common to the text (A) (see the crosses).

このように、主区間選択部１０７は、テキスト（Ａ）〜（Ｃ）に対応する音声データの主区間を、基準単語だけでなく、基準単語以外の単語にも基づいて選択する。したがって、音声分析装置１は、基準単語が複数のテキスト（Ａ）〜（Ｃ）及び音声データ（１）〜（４）の複数の区間において重複する場合でも、図８に示された比較例とは異なり、テキスト及び音声データの区間の的確な対応付けが可能である。 As described above, the main section selection unit 107 selects the main section of the audio data corresponding to the texts (A) to (C) based on not only the reference word but also a word other than the reference word. Therefore, even if the reference word is overlapped in a plurality of sections of the plurality of texts (A) to (C) and the speech data (1) to (4), the speech analysis apparatus 1 is different from the comparative example shown in FIG. In contrast, it is possible to accurately associate sections of text and audio data.

なお、テキスト（Ｂ）の基準単語である「３号館」は、音声データの区間（３）のみに含まれるため、区間（３）が、テキスト（Ｂ）に対応する主区間として選択される。テキスト（Ｃ）については、音声データの区間（４）が、テキスト（Ｃ）と共通する「次回」及び「来週」の音声を含むため、テキスト（Ｃ）に対応する主区間として選択される。 Note that “No. 3 building”, which is the reference word of the text (B), is included only in the section (3) of the audio data, so the section (3) is selected as the main section corresponding to the text (B). The text (C) is selected as the main section corresponding to the text (C) because the section (4) of the voice data includes the “next week” and “next week” voices common to the text (C).

上述したように、主区間選択部１０７は、主区間の候補として選択された音声データの区間（１）〜（４）から、テキスト（Ａ）〜（Ｃ）と共通する基準単語以外の単語の音声認識の有無に基づいて、テキスト（Ａ）〜（Ｃ）に対応する主区間を選択する。このとき、主区間選択部１０７は、主区間の候補から、テキスト（Ａ）〜（Ｃ）内の複数の単語のうち、基準単語以外の単語の音声が最も多く認識された区間、または基準単語以外の単語の音声が一定数以上認識された１以上の区間を、主区間として選択してもよい。 As described above, the main section selection unit 107 selects words other than the reference word common to the texts (A) to (C) from the sections (1) to (4) of the speech data selected as the main section candidates. Based on the presence or absence of voice recognition, the main section corresponding to the texts (A) to (C) is selected. At this time, the main section selection unit 107 determines, from the candidates for the main section, the section in which the voices of words other than the reference word among the plurality of words in the texts (A) to (C) are most recognized, or the reference word One or more sections in which a certain number or more of voices of other words are recognized may be selected as the main section.

例えば、基準単語以外の単語の音声が３個以上認識された区間を、主区間として選択する場合、音声データの区間（２）は、音声認識できたテキスト（Ａ）内の単語数が、「金曜日」を除くと、「意見」及び「香川」の２個だけであるので、主区間として選択されない。また、音声データの区間（４）は、音声認識できたテキスト（Ｃ）内の単語数が、「金曜日」を除くと、「次回」及び「来週」の２個だけであるので、主区間として選択されない。しかし、音声データの区間（３）は、音声認識できたテキスト（Ｂ）内の単語数が、「３号館」を除くと、「秋」、「試験期間」、「出入り」などの７個であるので、主区間として選択される。 For example, when a section in which three or more voices of words other than the reference word are recognized is selected as the main section, the number of words in the text (A) that can be recognized in the section (2) of the voice data is “ Excluding “Friday”, there are only two of “opinion” and “Kagawa”, so they are not selected as the main section. In addition, the section (4) of the voice data has only two words “next time” and “next week” except for “Friday” except for “Friday”. Not selected. However, in the section (3) of the speech data, the number of words in the text (B) that can be recognized by the speech is 7 such as “autumn”, “test period”, “in / out”, etc. Since there is, it is selected as the main section.

このように、基準単語以外の単語の音声が一定数以上認識された１以上の区間を主区間とすれば、主区間の選択を厳密に行うことができる。したがって、主区間の候補として選択された音声データの区間が多数存在する場合、主区間を高精度に選択することができる。この効果は、基準単語以外の単語の音声が最も多く認識された区間を主区間とする場合も、同様に得られる。 As described above, if one or more sections in which a certain number of words other than the reference word are recognized are set as the main section, the main section can be selected strictly. Therefore, when there are many sections of audio data selected as main section candidates, the main section can be selected with high accuracy. This effect can be obtained in the same way when the section in which the speech of words other than the reference word is recognized most is the main section.

なお、主区間選択部１０７は、主区間の選択の基準とする数値を、音声データの各区間（１）〜（４）に含まれる単語ごとに割り当ててもよい。この場合、例えば、モーラ数が大きい単語や当該テキスト（Ａ）〜（Ｂ）のみに含まれる単語には、大きな数値を割り当て、他の単語には、小さい数値を割り当てることにより、テキスト（Ａ）〜（Ｂ）に対応する主区間が、より正確に選択される。 The main section selection unit 107 may assign a numerical value used as a reference for selecting the main section for each word included in each of the sections (1) to (4) of the audio data. In this case, for example, a large numerical value is assigned to a word having a large number of mora or a word included only in the texts (A) to (B), and a small numerical value is assigned to the other words, whereby the text (A) The main section corresponding to ~ (B) is selected more accurately.

主区間選択部１０７は、選択した主区間を、当該テキスト（Ａ）〜（Ｃ）に対応付けるために、分析結果ＤＢ１３５に登録する。より具体的には、主区間選択部１０７は、テキストの識別情報（本例ではＡ〜Ｃ）、及び主区間として選択した音声データの区間（１）〜（４）の識別情報（本例では１〜４）を、分析結果ＤＢ１３５に出力する。 The main section selection unit 107 registers the selected main section in the analysis result DB 135 in order to associate the selected main section with the texts (A) to (C). More specifically, the main section selection unit 107 identifies identification information (A to C in this example) and identification information (1 to 4) of voice data selected as the main section (in this example). 1 to 4) are output to the analysis result DB 135.

図１１には、分析結果ＤＢ１３５の一例が示されている。図１１において、「テキスト」は、テキストの識別情報（本例ではＡ〜Ｃ）を示す。また、「音声データ区間」欄の「１」〜「４」欄は、音声データの区間（１）〜（４）の各テキスト（Ａ）〜（Ｃ）に対する対応関係の有無（「１」：対応関係有り、「０」：対応関係無し）をそれぞれ示す。 FIG. 11 shows an example of the analysis result DB 135. In FIG. 11, “text” indicates text identification information (A to C in this example). Also, the “1” to “4” fields in the “voice data section” field indicate the presence / absence of the correspondence relationship between the texts (A) to (C) in the sections (1) to (4) of the voice data (“1”: Corresponding relationship exists, “0”: no corresponding relationship).

図１１の例では、テキスト（Ａ）に対応する「音声データ区間」欄の「１」〜「４」欄のうち、「１」欄及び「２」欄のみが「１」であるため、テキスト（Ａ）に対応する音声データの区間は、区間（１）及び区間（２）であることが示されている。ここで、音声データの区間（１）は、上記の例において、主区間として選択されたものであるが、音声データの区間（２）は、後述する副区間として選択されたものである。 In the example of FIG. 11, among the “1” to “4” columns of the “voice data section” column corresponding to the text (A), only the “1” column and the “2” column are “1”. The sections of the audio data corresponding to (A) are shown as section (1) and section (2). Here, the section (1) of the voice data is selected as the main section in the above example, while the section (2) of the voice data is selected as the subsection described later.

また、図１１の例では、上記の例に従い、テキスト（Ｂ）に対応する「音声データ区間」欄の「３」欄に、音声データの区間（３）が登録され、テキスト（Ｃ）に対応する「音声データ区間」欄の「４」欄に、音声データの区間（４）が登録されている。主区間選択部１０７は、主区間の登録後、主区間の選択処理の完了を副区間選択部１０８に通知する。 In the example of FIG. 11, according to the above example, the voice data section (3) is registered in the “3” column of the “voice data section” column corresponding to the text (B), and the text (C) is supported. The section (4) of voice data is registered in the “4” field of the “voice data section” field. After registering the main section, the main section selection unit 107 notifies the sub-section selection unit 108 of completion of the main section selection process.

副区間選択部１０８は、主区間の選択処理の完了通知を受けると、主区間と隣接する区間を起点として、テキスト（Ａ）〜（Ｃ）内の複数の単語のうち、基準単語以外の単語の音声が認識された区間の範囲を検出する。そして、副区間選択部１０８は、検出した範囲内の区間を、テキスト（Ａ）〜（Ｃ）に対応する副区間（第２区間）として選択する。つまり、副区間選択部１０８は、音声データの区間（１）〜（４）のうち、時系列上、主区間の前方及び後方に連なる各区間から、副区間に該当するものを検出する。 Upon receiving the notification of completion of the main section selection process, the subsection selection unit 108 starts with a section adjacent to the main section, and a word other than the reference word among the plurality of words in the texts (A) to (C). The range of the section in which the voice is recognized is detected. Then, the subsection selection unit 108 selects a section within the detected range as a subsection (second section) corresponding to the texts (A) to (C). That is, the sub-section selection unit 108 detects a section corresponding to the sub-section from the sections (1) to (4) of the audio data that are connected in front of and behind the main section in time series.

より具体的には、副区間選択部１０８は、抽出単語ＤＢ１３３を参照することで、主区間に隣接する区間（１）〜（４）から順次に、「基準フラグ」が「０」である単語の音声認識の有無を検出する。なお、副区間の選択は、テキスト（Ａ）〜（Ｃ）ごとに行われる。 More specifically, the sub-interval selection unit 108 refers to the extracted word DB 133 so that words whose “reference flag” is “0” sequentially from the sections (1) to (4) adjacent to the main section. The presence or absence of voice recognition is detected. In addition, selection of a subsection is performed for every text (A)-(C).

図１２には、副区間の選択の一例が示されている。図１２は、図１０に示された主区間の選択後に行われる副区間の選択の様子を示す。 FIG. 12 shows an example of selection of sub-intervals. FIG. 12 shows how the sub-section is selected after the main section shown in FIG. 10 is selected.

本例では、副区間選択部１０８は、テキスト（Ａ）に対応する副区間を選択する。副区間選択部１０８は、テキスト（Ａ）の主区間（２）に隣接する区間（１），（３）を起点として、テキスト（Ａ）の基準単語以外の単語が音声認識された区間の範囲を検出する。 In this example, the subsection selection unit 108 selects a subsection corresponding to the text (A). The sub-interval selection unit 108 starts from the sections (1) and (3) adjacent to the main section (2) of the text (A), and the range of sections in which words other than the reference word of the text (A) are voice-recognized. Is detected.

テキスト（Ａ）の基準単語以外の単語である「夏季」及び「合宿」（点線枠参照）は、テキスト（Ａ）の主区間（２）に隣接する区間（１）において音声認識される。このため、副区間選択部１０８は、音声データの区間（１）を、テキスト（Ａ）に対応する副区間として選択する。（丸印参照）。 The words “summer season” and “camp” (see the dotted frame) other than the reference word of the text (A) are recognized by speech in the section (1) adjacent to the main section (2) of the text (A). For this reason, the subsection selection unit 108 selects the section (1) of the audio data as a subsection corresponding to the text (A). (See circle).

副区間選択部１０８は、副区間として選択した音声データの区間（１）にさらに隣接する区間が存在しないので、主区間の他方の隣接区間（３）について、基準単語以外の単語の音声認識の有無を検出する。しかし、音声データの区間（３）は、テキスト（Ａ）と共通する基準単語以外の単語の音声が含まれないため、副区間として選択されない（×印参照）。このため、副区間選択部１０８は、副区間の範囲の検出を終了する。 Since there is no section further adjacent to the section (1) of the speech data selected as the subsection, the subsection selection unit 108 performs speech recognition of words other than the reference word for the other adjacent section (3) of the main section. Detect the presence or absence. However, the section (3) of the speech data is not selected as a sub-section because it does not include the speech of words other than the reference word that is common to the text (A) (see the x mark). For this reason, the subsection selection unit 108 ends the detection of the subsection range.

このように、副区間選択部１０８は、テキスト（Ａ）〜（Ｃ）に対応する音声データの副区間の範囲を、基準単語以外の単語に基づいて検出する。したがって、音声分析装置１は、基準単語の音声が認識されない音声データの区間（１）〜（４）でも、テキスト（Ａ）〜（Ｃ）に対応付けることができる。これにより、音声分析装置１は、図７に示された比較例とは異なり、テキスト（Ａ）〜（Ｃ）に対応付けられる音声データの区間の範囲を、正確に特定することが可能である。 As described above, the sub-section selection unit 108 detects the range of the sub-section of the audio data corresponding to the texts (A) to (C) based on words other than the reference word. Therefore, the speech analysis apparatus 1 can associate the texts (A) to (C) with the sections (1) to (4) of the speech data in which the speech of the reference word is not recognized. Thus, unlike the comparative example shown in FIG. 7, the voice analysis device 1 can accurately specify the range of the voice data section associated with the texts (A) to (C). .

また、テキスト（Ｂ）について、主区間である区間（３）に隣接する区間（２）及び区間（４）は、テキスト（Ｂ）と共通する単語の音声が含まれないため、副区間として選択されない。テキスト（Ｃ）については、テキスト（Ｃ）に対応する主区間（４）に隣接する区間（３）は、テキスト（Ｃ）と共通する単語の音声が含まれないため、副区間として選択されない。 In addition, for text (B), sections (2) and (4) adjacent to section (3), which is the main section, are selected as sub-sections because they do not contain the word speech common to text (B). Not. As for the text (C), the section (3) adjacent to the main section (4) corresponding to the text (C) is not selected as a sub-section because it does not include the word speech common to the text (C).

上述したように、副区間選択部１０８は、主区間と隣接する区間を起点として、テキスト（Ａ）〜（Ｃ）内の基準単語以外の単語の音声が認識された区間の範囲を検出し、範囲内の区間を、テキスト（Ａ）〜（Ｃ）に対応する副区間として選択する。このとき、副区間選択部１０８は、テキスト（Ａ）〜（Ｃ）内の基準単語以外の単語の音声が一定数以上認識された１以上の区間の範囲を検出してもよい。 As described above, the sub-section selection unit 108 detects a range of a section in which the speech of words other than the reference word in the texts (A) to (C) is recognized, starting from a section adjacent to the main section, A section within the range is selected as a subsection corresponding to the texts (A) to (C). At this time, the sub-section selection unit 108 may detect a range of one or more sections in which a certain number or more of voices of words other than the reference word in the texts (A) to (C) are recognized.

例えば、基準単語以外の単語の音声が３個以上認識された区間の範囲を検出する場合、音声データの区間（１）は、音声認識できたテキスト（Ａ）内の単語数が、「金曜日」を除くと、「夏季」及び「合宿」の２個だけであるので、副区間として選択されない。 For example, when detecting the range of a section in which three or more speeches of words other than the reference word are recognized, the number of words in the speech (A) in which the speech data can be recognized is “Friday”. Excluding, there are only two, “summer season” and “camp”, so they are not selected as sub-intervals.

このように、基準単語以外の単語の音声が一定数以上認識された１以上の区間を副区間とすれば、副区間の範囲の検出を厳密に行うことができる。なお、副区間選択部１０８は、副区間の範囲の検出の基準とする数値を、音声データの各区間（１）〜（４）に含まれる単語ごとに割り当ててもよい。この場合、例えば、モーラ数が大きい単語や当該テキスト（Ａ）〜（Ｂ）のみに含まれる単語には、大きな数値を割り当て、他の単語には、小さい数値を割り当てることにより、テキスト（Ａ）〜（Ｂ）に対応する副区間が、より正確に選択される。 In this way, if one or more sections in which a certain number or more of speeches of words other than the reference word are recognized are set as sub-sections, the range of the sub-section can be strictly detected. Note that the sub-section selection unit 108 may assign a numerical value as a reference for detecting the range of the sub-section for each word included in each of the sections (1) to (4) of the audio data. In this case, for example, a large numerical value is assigned to a word having a large number of mora or a word included only in the texts (A) to (B), and a small numerical value is assigned to the other words, whereby the text (A) The sub-interval corresponding to ~ (B) is selected more accurately.

また、副区間選択部１０８は、副区間の範囲を検出するとき、各テキスト（Ａ）〜（Ｂ）内の基準単語以外の単語の音声認識の数（該単語の一致数）を検出し、該検出した区間を、音声認識の数が最も多いテキストに対応する副区間として選択してもよい。 Further, when the sub-section selection unit 108 detects the range of the sub-section, the sub-section selection unit 108 detects the number of speech recognition of words other than the reference word in each of the texts (A) to (B) (the number of matching words), The detected section may be selected as a subsection corresponding to the text having the largest number of speech recognitions.

副区間選択部１０８は、選択した副区間を、当該テキスト（Ａ）〜（Ｃ）に対応付けるために、分析結果ＤＢ１３５に登録する。より具体的には、主区間選択部１０７は、テキストの識別情報（本例ではＡ〜Ｃ）、及び副区間として選択した音声データの区間（１）〜（４）の識別情報（本例では１〜４）を、分析結果ＤＢ１３５に出力する。 The sub-section selecting unit 108 registers the selected sub-section in the analysis result DB 135 in order to associate it with the texts (A) to (C). More specifically, the main section selection unit 107 identifies identification information (in this example, A to C) of text and identification information (in this example) of sections (1) to (4) of audio data selected as sub-sections. 1 to 4) are output to the analysis result DB 135.

これにより、分析結果ＤＢ１３５には、各テキスト（Ａ）〜（Ｃ）と音声データの区間（１）〜（４）の対応関係が登録される。本実施例では、音声データ（１）〜（４）及びテキスト（Ａ）〜（Ｃ）に共通する基準単語及び他の単語を用いて、関連性が高い音声データの１以上の区間（主区間及び副区間）をテキストに対応付けるので、分析結果ＤＢ１３５には的確な対応関係が登録される。 Thereby, in the analysis result DB 135, correspondences between the texts (A) to (C) and the sections (1) to (4) of the voice data are registered. In the present embodiment, one or more sections (main sections) of highly related speech data using reference words and other words common to the speech data (1) to (4) and the texts (A) to (C). And the sub-interval) are associated with the text, so that an accurate correspondence relationship is registered in the analysis result DB 135.

このようにして得られた分析結果ＤＢ１３５は、例えば、議事録システムの利用者が、議事録を参照する場合に用いられる。これにより、利用者は、再生された会議の音声データの内容に対応する議事録の正確な記載箇所を参照できるため、議事録システムの利便性が向上する。 The analysis result DB 135 obtained in this way is used, for example, when a user of the minutes system refers to the minutes. Thereby, the user can refer to the exact description location of the minutes corresponding to the content of the audio data of the reproduced meeting, so that the convenience of the minutes system is improved.

次に、上述した音声分析方法を実行する音声分析プログラムについて述べる。図１３は、実施例に係る音声分析プログラムのフローチャートである。 Next, a speech analysis program that executes the speech analysis method described above will be described. FIG. 13 is a flowchart of the speech analysis program according to the embodiment.

まず、音声データ分割部１０５は、音声ＤＢ１３４内の音声データを複数の区間（１）〜（４）に分割する（ステップＳｔ１）。音声データの分割手法としては、例えば、図５（ａ）または図５（ｂ）に示された手法が用いられる。 First, the audio data dividing unit 105 divides the audio data in the audio DB 134 into a plurality of sections (1) to (4) (step St1). As a method for dividing the audio data, for example, the method shown in FIG. 5A or 5B is used.

次に、文書データ分割部１０１は、文書ＤＢ１３１内の文書データを複数のテキスト（Ａ）〜（Ｃ）に分割する（ステップＳｔ２）。文書データの分割手法としては、上述したように、文章中の改行、字下げ（インデント）、または空行などを、テキスト間の境界として、文書データを分割する方法が挙げられる。 Next, the document data dividing unit 101 divides the document data in the document DB 131 into a plurality of texts (A) to (C) (step St2). As a document data dividing method, as described above, there is a method of dividing document data using a line break, indentation, or blank line in a sentence as a boundary between texts.

次に、単語抽出部１０２は、辞書ＤＢ１３２に基づいて、各テキスト（Ａ）〜（Ｃ）から単語を抽出する（ステップＳｔ３）。次に、単語抽出部１０２は、抽出した単語を抽出単語ＤＢ１３３（図４参照）に登録する（ステップＳｔ４）。なお、ステップＳｔ２〜Ｓｔ４の処理は、ステップＳｔ１の処理の前に実行されてもよい。 Next, the word extraction part 102 extracts a word from each text (A)-(C) based on dictionary DB132 (step St3). Next, the word extraction unit 102 registers the extracted word in the extracted word DB 133 (see FIG. 4) (step St4). Note that the processing of steps St2 to St4 may be executed before the processing of step St1.

次に、音声認識処理部１０４は、各テキスト（Ａ）〜（Ｃ）から単語を抽出された単語に基づいて、音声データの各区間（１）〜（４）の音声認識処理を行う（ステップＳｔ５）。次に、音声認識処理部１０４は、テキスト（Ａ）〜（Ｃ）ごとに、各単語の音声認識の有無を抽出単語ＤＢ１３３に登録する（ステップＳｔ６）。 Next, the speech recognition processing unit 104 performs speech recognition processing for each section (1) to (4) of the speech data based on the words extracted from the texts (A) to (C) (steps). St5). Next, the speech recognition processing unit 104 registers the presence / absence of speech recognition of each word in the extracted word DB 133 for each of the texts (A) to (C) (step St6).

次に、基準単語選択部１０３は、テキスト（Ａ）〜（Ｃ）の１つを選択する（ステップＳｔ７）。なお、以降の処理は、テキスト（Ａ）が選択されたと仮定して述べるが、ステップＳｔ７の処理では何れのテキスト（Ａ）〜（Ｃ）が選択されてもよい。 Next, the reference word selection unit 103 selects one of the texts (A) to (C) (step St7). The subsequent processing is described assuming that the text (A) is selected, but any text (A) to (C) may be selected in the processing of step St7.

次に、基準単語選択部１０３は、選択したテキスト（Ａ）から抽出された単語から、一定以上の音声認識の容易性または確実性を有する１以上の単語（「金曜日」）を、基準単語として選択し、抽出単語ＤＢ１３３に登録する（ステップＳｔ８）。 Next, the reference word selection unit 103 uses, as reference words, one or more words (“Friday”) having a certain level of ease or certainty of speech recognition from words extracted from the selected text (A). The selected word is registered in the extracted word DB 133 (step St8).

ステップＳｔ７の処理において、基準単語選択部１０３は、上述したように、選択したテキスト（Ａ）から抽出された複数の単語から、モーラ数が最多または一定数以上である１以上の単語を、基準単語として選択してもよい。あるいは、基準単語選択部１０３は、選択したテキスト（Ａ）から抽出された複数の単語から、音声認識における単語らしさを示すスコアが最多または一定値以上である１以上の単語を、基準単語として選択してもよい。 In the process of step St7, as described above, the reference word selection unit 103 selects one or more words having the largest number of mora or a certain number or more from the plurality of words extracted from the selected text (A). It may be selected as a word. Alternatively, the reference word selection unit 103 selects, as a reference word, one or more words having a maximum score or a certain value or more indicating a word likelihood in speech recognition from a plurality of words extracted from the selected text (A). May be.

次に、主区間候補選択部１０６は、音声データの区間（１）〜（４）から、基準単語の音声が認識された区間を、主区間の候補として選択する（ステップＳｔ９）。図９を参照して述べたように、主区間候補選択部１０６は、テキスト（Ａ）の基準単語である「金曜日」が音声認識された音声データの区間（２）及び区間（４）を、テキスト（Ａ）に対応する主区間の候補として選択する。 Next, the main section candidate selection unit 106 selects a section in which the speech of the reference word is recognized from the sections (1) to (4) of the voice data as main section candidates (step St9). As described with reference to FIG. 9, the main section candidate selecting unit 106 selects the section (2) and the section (4) of the speech data in which “Friday”, which is the reference word of the text (A), is speech-recognized. A candidate for the main section corresponding to the text (A) is selected.

次に、主区間選択部１０７は、主区間の候補から、テキスト（Ａ）の複数の単語のうち、基準単語以外の単語の音声が認識された１以上の区間を、テキスト（Ａ）に対応する主区間（第１区間）として選択する（ステップＳｔ１０）。図１０を参照して述べたように、主区間選択部１０７は、主区間の候補の区間（２）及び区間（４）から、テキスト（Ａ）の「意見」及び「香川」が音声認識された音声データの区間（２）を、テキスト（Ａ）に対応する主区間として選択する。 Next, the main section selection unit 107 corresponds to one or more sections in which the voices of words other than the reference word are recognized among the plurality of words of the text (A) from the candidates of the main section. To be selected as the main section (first section) to be performed (step St10). As described with reference to FIG. 10, the main section selection unit 107 performs voice recognition of “opinion” and “Kagawa” of the text (A) from the candidate sections (2) and (4) of the main section. The voice data section (2) is selected as the main section corresponding to the text (A).

ステップＳｔ１０の処理において、主区間選択部１０７は、上述したように、主区間の候補として選択された区間（２），（４）から、テキスト（Ａ）の複数の単語のうち、基準単語以外の単語の音声が最も多く認識された区間を、主区間として選択してもよい。あるいは、主区間選択部１０７は、主区間の候補として選択された区間（２），（４）から、テキスト（Ａ）の複数の単語のうち、基準単語以外の単語の音声が一定数以上認識された１以上の区間を、主区間として選択してもよい。 In the processing of step St10, as described above, the main section selection unit 107, except for the reference word among the plurality of words of the text (A), from the sections (2) and (4) selected as the main section candidates. The section in which the most speech of the word is recognized may be selected as the main section. Alternatively, the main section selection unit 107 recognizes a predetermined number or more of words other than the reference word among the plurality of words of the text (A) from the sections (2) and (4) selected as the main section candidates. One or more sections thus selected may be selected as the main section.

次に、主区間選択部１０７は、テキスト（Ａ）と主区間の対応関係を分析結果ＤＢ１３５（図１１参照）に登録する（ステップＳｔ１１）。これにより、音声データの区間（２）が、テキスト（Ａ）に対応付けられる。 Next, the main section selection unit 107 registers the correspondence between the text (A) and the main section in the analysis result DB 135 (see FIG. 11) (step St11). Thereby, the section (2) of the voice data is associated with the text (A).

次に、副区間選択部１０８は、テキスト（Ａ）に対応する主区間と隣接する区間（１），（３）を選択する（ステップＳｔ１２）。次に、副区間選択部１０８は、抽出単語ＤＢ１３３を参照することで、選択した隣接区間（１），（３）における、テキスト（Ａ）の基準単語以外の単語の音声認識の有無を判定する（ステップＳｔ１３）。 Next, the subsection selection unit 108 selects sections (1) and (3) adjacent to the main section corresponding to the text (A) (step St12). Next, the sub-section selection unit 108 refers to the extracted word DB 133 to determine whether or not speech recognition is performed on words other than the reference word of the text (A) in the selected adjacent sections (1) and (3). (Step St13).

基準単語以外の単語の音声認識がある場合（ステップＳｔ１３のＹｅｓ）、副区間選択部１０８は、選択した隣接区間（１）を副区間として選択する（ステップＳｔ１４）。図１２を参照して述べたように、副区間選択部１０８は、テキスト（Ａ）に対応する主区間の隣接区間（１），（３）から、テキスト（Ａ）の「夏季」及び「合宿」が音声認識された音声データの区間（１）を、テキスト（Ａ）に対応する副区間として選択する。なお、ステップＳｔ１３，Ｓｔ１４の処理において、副区間選択部１０８は、上述したように、主区間と隣接する区間（１），（３）から、テキスト（Ａ）の複数の単語のうち、基準単語以外の単語の音声が一定数以上認識された区間を、副区間として選択してもよい。 When there is speech recognition of a word other than the reference word (Yes in step St13), the sub-section selecting unit 108 selects the selected adjacent section (1) as a sub-section (step St14). As described with reference to FIG. 12, the sub-interval selection unit 108 determines the “summer season” and “camp” of the text (A) from the adjacent sections (1) and (3) of the main section corresponding to the text (A). Is selected as the sub-interval corresponding to the text (A). Note that, in the processing of steps St13 and St14, the sub-section selection unit 108, as described above, from the sections (1) and (3) adjacent to the main section, among the plurality of words of the text (A), the reference word A section in which a certain number or more of speeches of other words are recognized may be selected as a sub-section.

次に、副区間選択部１０８は、テキスト（Ａ）と副区間の対応関係を分析結果ＤＢ１３５に登録する（ステップＳｔ１５）。これにより、音声データの区間（１）が、テキスト（Ａ）に対応付けられる。 Next, the subsection selection unit 108 registers the correspondence between the text (A) and the subsection in the analysis result DB 135 (step St15). Thereby, the section (1) of the voice data is associated with the text (A).

次に、副区間選択部１０８は、副区間の隣接区間を選択し（ステップＳｔ１６）、再びステップＳｔ１３の処理を実行する。副区間選択部１０８は、ステップＳｔ１３〜Ｓｔ１６の処理を繰り返すことで、主区間と隣接する区間を起点として、テキストの基準単語以外の単語の音声が認識された区間の範囲を検出し、範囲内の区間を、テキストに対応する副区間として選択する。 Next, the sub-section selecting unit 108 selects an adjacent section of the sub-section (step St16) and executes the process of step St13 again. The sub-interval selection unit 108 repeats the processing of steps St13 to St16 to detect a range of a segment in which the speech of a word other than the reference word of the text is recognized starting from a segment adjacent to the main segment. Is selected as a sub-interval corresponding to the text.

また、基準単語以外の単語の音声認識がない場合（ステップＳｔ１３のＮｏ）、基準単語選択部１０３は、全てのテキスト（Ａ）〜（Ｃ）を選択済みであるか否かを判定する（ステップＳｔ１７）。つまり、基準単語選択部１０３は、上記のステップＳｔ８〜Ｓｔ１６の各処理が、全てのテキスト（Ａ）〜（Ｃ）について実行済みであるか否かを判定する。 When there is no speech recognition of words other than the reference word (No in step St13), the reference word selection unit 103 determines whether all texts (A) to (C) have been selected (step). St17). That is, the reference word selection unit 103 determines whether or not the processes in steps St8 to St16 have been executed for all the texts (A) to (C).

基準単語選択部１０３は、全てのテキスト（Ａ）〜（Ｃ）を選択済みである場合（ステップＳｔ１７のＹｅｓ）、処理を終了する。また、基準単語選択部１０３は、全てのテキスト（Ａ）〜（Ｃ）を選択済みではない場合（ステップＳｔ１７のＮｏ）、他のテキスト（Ｂ），（Ｃ）を選択する（ステップＳｔ１８）。他のテキスト（Ｂ），（Ｃ）を選択した後、選択したテキストについて、ステップＳｔ８〜Ｓｔ１３の各処理が、再び行われる。このようにして、音声分析プログラムの処理は実行される。 When all the texts (A) to (C) have been selected (Yes in step St17), the reference word selection unit 103 ends the process. In addition, when all the texts (A) to (C) have not been selected (No in Step St17), the reference word selection unit 103 selects other texts (B) and (C) (Step St18). After selecting the other texts (B) and (C), the processes in steps St8 to St13 are performed again for the selected text. In this way, the processing of the voice analysis program is executed.

図９、図１０、及び図１２では、基準単語が１つである場合を例示したが、これに限定されず、基準単語が複数個であってもよい。複数の基準単語が選択された場合、主区間候補選択部１０６は、全ての基準単語を含む音声データの区間を、主区間の候補として選択する。 9, 10, and 12 exemplify the case where there is one reference word, the present invention is not limited to this, and there may be a plurality of reference words. When a plurality of reference words are selected, the main section candidate selecting unit 106 selects a section of audio data including all the reference words as a main section candidate.

これまで述べたように、実施例に係る音声分析方法は、以下の工程を、コンピュータ（ＣＰＵ）１０が実行する方法である。
工程（１）：音声データを複数の区間（１）〜（４）に分割する。
工程（２）：テキスト（Ａ）〜（Ｃ）から複数の単語を抽出する。
工程（３）：抽出した複数の単語に基づき、複数の区間（１）〜（４）の音声認識をそれぞれ行う。
工程（４）：抽出した複数の単語から、一定以上の音声認識の容易性または確実性を有する１以上の単語を、基準単語として選択する。
工程（５）：複数の区間から、基準単語の音声が認識された区間を選択する。
工程（６）：該選択した区間から、複数の単語のうち、基準単語以外の単語の音声が認識された１以上の区間を、テキスト（Ａ）〜（Ｃ）に対応する第１区間（主区間）として選択する。
工程（７）：第１区間と隣接する区間を起点として、複数の単語のうち、基準単語以外の単語の音声が認識された区間の範囲を検出し、範囲内の区間を、テキスト（Ａ）〜（Ｃ）に対応する第２区間（副区間）として選択する。 As described above, the speech analysis method according to the embodiment is a method in which the computer (CPU) 10 executes the following steps.
Step (1): The voice data is divided into a plurality of sections (1) to (4).
Step (2): Extract a plurality of words from the texts (A) to (C).
Step (3): Based on the extracted plurality of words, speech recognition in a plurality of sections (1) to (4) is performed.
Step (4): One or more words having a certain level of ease or certainty of speech recognition are selected as reference words from the extracted plurality of words.
Step (5): A section in which the voice of the reference word is recognized is selected from a plurality of sections.
Step (6): One or more sections in which the voices of words other than the reference word are recognized among the plurality of words from the selected section are defined as a first section corresponding to the texts (A) to (C) (main (Section).
Step (7): Starting from a section adjacent to the first section, a range of a section in which the voice of a word other than the reference word is recognized among a plurality of words is detected, and the section within the range is defined as text (A). To (C) are selected as the second section (sub-section).

この構成によると、テキストから音声認識が容易な基準単語、または音声認識の結果の確実性が高い基準単語を抽出し、基準単語の音声が認識された音声データの区間（１）〜（４）のうち、テキストから抽出された他の単語の音声も認識された第１区間を、テキスト（Ａ）〜（Ｃ）に対応付ける。また、第１区間に連なる音声データの区間から、テキストから抽出された他の単語の音声が認識される第２区間の範囲を検出し、テキストに対応付ける。 According to this configuration, a reference word that is easily recognized by speech or a reference word that is highly reliable as a result of speech recognition is extracted from the text, and the sections (1) to (4) of the speech data in which the speech of the reference word is recognized. Among them, the first section in which the speech of other words extracted from the text is also recognized is associated with the texts (A) to (C). Further, a range of the second section in which the speech of another word extracted from the text is recognized is detected from the section of the speech data continuous to the first section, and is associated with the text.

したがって、音声データ及びテキストに共通する基準単語及び他の単語を用いて、関連性が高い音声データの１以上の区間をテキストに対応付けることができる。よって、実施例に係る音声分析方法によれば、音声データをテキストに的確に対応付けることができる。 Therefore, it is possible to associate one or more sections of speech data with high relevance with text using a reference word and other words common to the speech data and text. Therefore, according to the speech analysis method according to the embodiment, speech data can be accurately associated with text.

実施例に係る音声分析プログラムは、以下の処理を、コンピュータ（ＣＰＵ）１０に実行させるプログラムである。
処理（１）：音声データを複数の区間（１）〜（４）に分割する。
処理（２）：テキスト（Ａ）〜（Ｃ）から複数の単語を抽出する。
処理（３）：抽出した複数の単語に基づき、複数の区間（１）〜（４）の音声認識をそれぞれ行う。
処理（４）：抽出した複数の単語から、一定以上の音声認識の容易性または確実性を有する１以上の単語を、基準単語として選択する。
処理（５）：複数の区間から、基準単語の音声が認識された区間を選択する。
処理（６）：該選択した区間から、複数の単語のうち、基準単語以外の単語の音声が認識された１以上の区間を、テキスト（Ａ）〜（Ｃ）に対応する第１区間（主区間）として選択する。
処理（７）：第１区間と隣接する区間を起点として、複数の単語のうち、基準単語以外の単語の音声が認識された区間の範囲を検出し、範囲内の区間を、テキスト（Ａ）〜（Ｃ）に対応する第２区間（副区間）として選択する。 The speech analysis program according to the embodiment is a program that causes the computer (CPU) 10 to execute the following processing.
Process (1): The audio data is divided into a plurality of sections (1) to (4).
Process (2): A plurality of words are extracted from the texts (A) to (C).
Process (3): Based on the extracted plurality of words, speech recognition is performed in a plurality of sections (1) to (4).
Process (4): One or more words having a certain level of ease or certainty of speech recognition are selected as reference words from a plurality of extracted words.
Process (5): A section in which the voice of the reference word is recognized is selected from a plurality of sections.
Process (6): One or more sections in which the voices of words other than the reference word are recognized among the plurality of words from the selected section are defined as a first section corresponding to the texts (A) to (C) (main (Section).
Process (7): Starting from a section adjacent to the first section, a range of a section in which the speech of a word other than the reference word is recognized among a plurality of words is detected, and the section within the range is converted into text (A) To (C) are selected as the second section (sub-section).

実施例に係る音声分析プログラムは、上記の音声分析方法と同様の構成を含むので、上述した内容と同様の作用効果を奏する。 Since the speech analysis program according to the embodiment includes the same configuration as the speech analysis method described above, the same effects as those described above can be obtained.

また、実施例に係る音声分析装置１は、分割部（音声データ分割部）１０５と、抽出部（単語抽出部）１０２と、音声認識処理部１０４とを有する。音声分析装置１は、さらに、第１選択部（基準単語選択部）１０３と、第２選択部（主区間候補選択部）１０６と、第３選択部（主区間選択部）１０７と、第４選択部（副区間選択部）１０８とを有する。 The speech analysis apparatus 1 according to the embodiment includes a dividing unit (speech data dividing unit) 105, an extracting unit (word extracting unit) 102, and a speech recognition processing unit 104. The speech analysis apparatus 1 further includes a first selection unit (reference word selection unit) 103, a second selection unit (main section candidate selection unit) 106, a third selection unit (main section selection unit) 107, and a fourth And a selection unit (sub-section selection unit) 108.

分割部１０５は、音声データを複数の区間（１）〜（４）に分割する。抽出部１０２は、テキスト（Ａ）〜（Ｃ）から複数の単語を抽出する。音声認識処理部１０４は、抽出した複数の単語に基づき、複数の区間（１）〜（４）の音声認識をそれぞれ行う。第１選択部１０３は、抽出した複数の単語から、一定以上の音声認識の容易性または確実性を有する１以上の単語を、基準単語として選択する。 The dividing unit 105 divides the audio data into a plurality of sections (1) to (4). The extraction unit 102 extracts a plurality of words from the texts (A) to (C). The speech recognition processing unit 104 performs speech recognition for the plurality of sections (1) to (4) based on the extracted plurality of words. The first selection unit 103 selects, as a reference word, one or more words having a certain level of ease or certainty of voice recognition from a plurality of extracted words.

第２選択部１０６は、複数の区間から、基準単語の音声が認識された区間を選択する。第３選択部１０７は、該選択した区間から、複数の単語のうち、基準単語以外の単語の音声が認識された１以上の区間を、テキスト（Ａ）〜（Ｃ）に対応する第１区間（主区間）として選択する。第４選択部１０８は、第１区間と隣接する区間を起点として、複数の単語のうち、基準単語以外の単語の音声が認識された区間の範囲を検出し、範囲内の区間を、テキスト（Ａ）〜（Ｃ）に対応する第２区間（副区間）として選択する。 The second selection unit 106 selects a section in which the voice of the reference word is recognized from the plurality of sections. The third selection unit 107 selects, from the selected section, one or more sections in which the voices of words other than the reference word among the plurality of words are recognized as the first section corresponding to the texts (A) to (C). Select as (main section). The fourth selection unit 108 detects a range of a segment in which a voice of a word other than the reference word is recognized among a plurality of words, with a segment adjacent to the first segment as a starting point. A second section (sub-section) corresponding to (A) to (C) is selected.

実施例に係る音声分析装置１は、上記の音声分析方法と同様の構成を含むので、上述した内容と同様の作用効果を奏する。 Since the speech analysis apparatus 1 according to the embodiment includes the same configuration as the speech analysis method described above, the same effects as those described above are achieved.

なお、上記の処理機能は、コンピュータによって実現することができる。その場合、処理装置が有すべき機能の処理内容を記述したプログラムが提供される。そのプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体（ただし、搬送波は除く）に記録しておくことができる。 The above processing functions can be realized by a computer. In that case, a program describing the processing contents of the functions that the processing apparatus should have is provided. By executing the program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium (except for a carrier wave).

プログラムを流通させる場合には、例えば、そのプログラムが記録されたＤＶＤ（Digital Versatile Disc）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）などの可搬型記録媒体の形態で販売される。また、プログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することもできる。 When the program is distributed, for example, it is sold in the form of a portable recording medium such as a DVD (Digital Versatile Disc) or a CD-ROM (Compact Disc Read Only Memory) on which the program is recorded. It is also possible to store the program in a storage device of a server computer and transfer the program from the server computer to another computer via a network.

プログラムを実行するコンピュータは、例えば、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、自己の記憶装置に格納する。そして、コンピュータは、自己の記憶装置からプログラムを読み取り、プログラムに従った処理を実行する。なお、コンピュータは、可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することもできる。また、コンピュータは、サーバコンピュータからプログラムが転送されるごとに、逐次、受け取ったプログラムに従った処理を実行することもできる。 The computer that executes the program stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, the computer reads the program from its own storage device and executes processing according to the program. The computer can also read the program directly from the portable recording medium and execute processing according to the program. Further, each time the program is transferred from the server computer, the computer can sequentially execute processing according to the received program.

上述した実施形態は本発明の好適な実施の例である。但し、これに限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々変形実施可能である。 The above-described embodiment is an example of a preferred embodiment of the present invention. However, the present invention is not limited to this, and various modifications can be made without departing from the scope of the present invention.

なお、以上の説明に関して更に以下の付記を開示する。
（付記１）音声データを複数の区間に分割する工程と、
テキストから複数の単語を抽出する工程と、
抽出した前記複数の単語に基づき、前記複数の区間の音声認識をそれぞれ行う工程と、
抽出した前記複数の単語から、一定以上の音声認識の容易性または確実性を有する１以上の単語を、基準単語として選択する工程と、
前記複数の区間から、前記基準単語の音声が認識された区間を選択する工程と、
該選択した区間から、前記複数の単語のうち、前記基準単語以外の単語の音声が認識された１以上の区間を、前記テキストに対応する第１区間として選択する工程と、
前記第１区間と隣接する区間を起点として、前記複数の単語のうち、前記基準単語以外の単語の音声が認識された区間の範囲を検出し、前記範囲内の区間を、前記テキストに対応する第２区間として選択する工程とを、コンピュータが実行することを特徴とする音声分析方法。
（付記２）前記基準単語を選択する工程において、抽出した前記複数の単語から、モーラ数が最多または一定数以上である１以上の単語を、前記基準単語として選択することを特徴とする付記１に記載の音声分析方法。
（付記３）前記基準単語を選択する工程において、音声認識における単語らしさを示すスコアが最多または一定値以上である前記１以上の単語を、前記基準単語として選択することを特徴とする付記１に記載の音声分析方法。
（付記４）前記第１区間を選択する工程において、前記選択された区間から、前記複数の単語のうち、前記基準単語以外の単語の音声が最も多く認識された区間、または前記基準単語以外の単語の音声が一定数以上認識された１以上の区間を、前記第１区間として選択することを特徴とする付記１乃至３の何れかに記載の音声分析方法。
（付記５）前記第２区間を選択する工程において、前記第１区間と隣接する区間を起点として、前記複数の単語のうち、前記基準単語以外の単語の音声が一定数以上認識された１以上の区間の範囲を検出することを特徴とする付記１乃至４の何れかに記載の音声分析方法。
（付記６）音声データを複数の区間に分割し、
テキストから複数の単語を抽出し、
抽出した前記複数の単語に基づき、前記複数の区間の音声認識をそれぞれ行い、
抽出した前記複数の単語から、一定以上の音声認識の容易性または確実性を有する１以上の単語を、基準単語として選択し、
前記複数の区間から、前記基準単語の音声が認識された区間を選択し、
該選択した区間から、前記複数の単語のうち、前記基準単語以外の単語の音声が認識された１以上の区間を、前記テキストに対応する第１区間として選択し、
前記第１区間と隣接する区間を起点として、前記複数の単語のうち、前記基準単語以外の単語の音声が認識された区間の範囲を検出し、前記範囲内の区間を、前記テキストに対応する第２区間として選択する、処理とを、コンピュータに実行させることを特徴とする音声分析プログラム。
（付記７）前記基準単語を選択する処理において、抽出した前記複数の単語から、モーラ数が最多または一定数以上である１以上の単語を、前記基準単語として選択することを特徴とする付記６に記載の音声分析プログラム。
（付記８）前記基準単語を選択する工程において、音声認識における単語らしさを示すスコアが最多または一定値以上である前記１以上の単語を、前記基準単語として選択することを特徴とする付記６に記載の音声分析プログラム。
（付記９）前記第１区間を選択する工程において、前記選択された区間から、前記複数の単語のうち、前記基準単語以外の単語の音声が最も多く認識された区間、または前記基準単語以外の単語の音声が一定数以上認識された１以上の区間を、前記第１区間として選択することを特徴とする付記６乃至８の何れかに記載の音声分析プログラム。
（付記１０）前記第２区間を選択する工程において、前記第１区間と隣接する区間を起点として、前記複数の単語のうち、前記基準単語以外の単語の音声が一定数以上認識された１以上の区間の範囲を検出することを特徴とする付記６乃至９の何れかに記載の音声分析プログラム。
（付記１１）音声データを複数の区間に分割する分割部と、
テキストから複数の単語を抽出する抽出部と、
抽出した前記複数の単語に基づき、前記複数の区間の音声認識をそれぞれ行う音声認識処理部と、
抽出した前記複数の単語から、一定以上の音声認識の容易性または確実性を有する１以上の単語を、基準単語として選択する第１選択部と、
前記複数の区間から、前記基準単語の音声が認識された区間を選択する第２選択部と、
該選択した区間から、前記複数の単語のうち、前記基準単語以外の単語の音声が認識された１以上の区間を、前記テキストに対応する第１区間として選択する第３選択部と、
前記第１区間と隣接する区間を起点として、前記複数の単語のうち、前記基準単語以外の単語の音声が認識された区間の範囲を検出し、前記範囲内の区間を、前記テキストに対応する第２区間として選択する第４選択部とを有することを特徴とする音声分析装置。
（付記１２）前記第１選択部は、抽出した前記複数の単語から、モーラ数が最多または一定数以上である１以上の単語を、前記基準単語として選択することを特徴とする付記１１に記載の音声分析装置。
（付記１３）前記第１選択部は、音声認識における単語らしさを示すスコアが最多または一定値以上である前記１以上の単語を、前記基準単語として選択することを特徴とする付記１１に記載の音声分析装置。
（付記１４）前記第３選択部は、前記選択された区間から、前記複数の単語のうち、前記基準単語以外の単語の音声が最も多く認識された区間、または前記基準単語以外の単語の音声が一定数以上認識された１以上の区間を、前記第１区間として選択することを特徴とする付記１１乃至１３の何れかに記載の音声分析装置。
（付記１５）前記第４選択部は、前記第１区間と隣接する区間を起点として、前記複数の単語のうち、前記基準単語以外の単語の音声が一定数以上認識された１以上の区間の範囲を検出することを特徴とする付記１１乃至１４の何れかに記載の音声分析装置。 In addition, the following additional notes are disclosed regarding the above description.
(Supplementary note 1) dividing audio data into a plurality of sections;
Extracting a plurality of words from the text;
Performing speech recognition of the plurality of sections based on the extracted plurality of words,
Selecting one or more words having a certain level of speech recognition ease or certainty as a reference word from the plurality of extracted words;
Selecting a section in which the voice of the reference word is recognized from the plurality of sections;
Selecting one or more sections from which the speech of words other than the reference word is recognized among the plurality of words as a first section corresponding to the text;
Starting from a section adjacent to the first section, a range of a section in which speech of a word other than the reference word is recognized among the plurality of words is detected, and a section in the range corresponds to the text A voice analysis method, wherein the computer executes the step of selecting as the second section.
(Supplementary Note 2) In the step of selecting the reference word, one or more words having a maximum number of mora or a certain number or more are selected as the reference word from the plurality of extracted words. The voice analysis method described in 1.
(Additional remark 3) In the process of selecting the said reference word, the said 1 or more word whose score which shows the word likeness in speech recognition is the most or more than a fixed value is selected as the said reference word. The voice analysis method described.
(Supplementary Note 4) In the step of selecting the first section, from the selected section, among the plurality of words, a section in which the voice of a word other than the reference word is most recognized, or a section other than the reference word 4. The speech analysis method according to any one of appendices 1 to 3, wherein one or more sections in which a certain number of words are recognized are selected as the first section.
(Supplementary Note 5) In the step of selecting the second section, one or more voices of words other than the reference word among the plurality of words are recognized starting from a section adjacent to the first section. The speech analysis method according to any one of appendices 1 to 4, wherein the range of the section is detected.
(Appendix 6) Dividing audio data into multiple sections,
Extract multiple words from text,
Based on the plurality of extracted words, perform speech recognition of the plurality of sections,
From the extracted plurality of words, one or more words having a certain level of ease or certainty of speech recognition are selected as reference words,
From the plurality of sections, select a section in which the voice of the reference word is recognized,
From the selected section, one or more sections in which the voices of words other than the reference word are recognized among the plurality of words are selected as the first section corresponding to the text,
Starting from a section adjacent to the first section, a range of a section in which speech of a word other than the reference word is recognized among the plurality of words is detected, and a section in the range corresponds to the text A speech analysis program that causes a computer to execute processing selected as the second section.
(Supplementary Note 7) In the process of selecting the reference word, one or more words having a maximum number of mora or a certain number or more are selected as the reference word from the plurality of extracted words. The voice analysis program described in 1.
(Supplementary note 8) In the supplementary note 6, the step of selecting the reference word selects, as the reference word, the one or more words having a score indicating the likelihood of a word in speech recognition or having a certain score or more. The voice analysis program described.
(Supplementary Note 9) In the step of selecting the first section, from the selected section, among the plurality of words, a section in which the voice of a word other than the reference word is most recognized, or a section other than the reference word 9. The speech analysis program according to any one of appendices 6 to 8, wherein one or more sections in which a certain number of words are recognized are selected as the first section.
(Supplementary Note 10) In the step of selecting the second section, one or more voices of words other than the reference word among the plurality of words are recognized starting from a section adjacent to the first section The speech analysis program according to any one of appendices 6 to 9, wherein the range of the section is detected.
(Supplementary Note 11) A dividing unit that divides audio data into a plurality of sections;
An extractor for extracting a plurality of words from the text;
A speech recognition processing unit that performs speech recognition of the plurality of sections based on the extracted words;
A first selection unit that selects, as a reference word, one or more words having a certain level of ease or certainty of speech recognition from the plurality of extracted words;
A second selection unit that selects a section in which the voice of the reference word is recognized from the plurality of sections;
A third selection unit that selects, from the selected section, one or more sections in which voices of words other than the reference word are recognized among the plurality of words as a first section corresponding to the text;
Starting from a section adjacent to the first section, a range of a section in which speech of a word other than the reference word is recognized among the plurality of words is detected, and a section in the range corresponds to the text A speech analysis apparatus comprising: a fourth selection unit that selects the second section.
(Supplementary note 12) The supplementary note 11 is characterized in that the first selection unit selects, as the reference word, one or more words having a maximum number of mora or a certain number or more from the plurality of extracted words. Voice analysis device.
(Supplementary note 13) The first selection unit according to Supplementary note 11, wherein the first selection unit selects, as the reference word, the one or more words having a maximum score or a certain value or more indicating a word-likeness in speech recognition. Voice analysis device.
(Additional remark 14) The said 3rd selection part is the area in which the audio | voice of words other than the said reference word was recognized most among these words from the selected area, or the audio | voice of words other than the said reference word. 14. The speech analysis apparatus according to any one of appendices 11 to 13, wherein one or more sections in which a certain number is recognized are selected as the first section.
(Additional remark 15) The said 4th selection part is the starting point of the area adjacent to the said 1st area, Of one or more areas from which the audio | voice of words other than the said reference word was recognized more than among the said several words. The speech analyzer according to any one of appendices 11 to 14, wherein a range is detected.

１音声分析装置
１０ＣＰＵ
１０１文書データ分割部
１０２単語抽出部（抽出部）
１０３基準単語選択部（第１選択部）
１０４音声認識処理部
１０５音声データ分割部（分割部）
１０６主区間候補選択部（第２選択部）
１０７主区間選択部（第３選択部）
１０８副区間選択部（第４選択部） 1 Speech analyzer 10 CPU
101 document data division unit 102 word extraction unit (extraction unit)
103 reference word selection unit (first selection unit)
104 voice recognition processing unit 105 voice data dividing unit (dividing unit)
106 Main section candidate selection section (second selection section)
107 Main section selector (third selector)
108 Subsection selection unit (fourth selection unit)

Claims

Dividing the audio data into a plurality of sections;
Extracting a plurality of words from the text;
Performing speech recognition of the plurality of sections based on the extracted plurality of words,
Selecting one or more words having a certain level of speech recognition ease or certainty as a reference word from the plurality of extracted words;
Selecting a section in which the voice of the reference word is recognized from the plurality of sections;
Selecting one or more sections from which the speech of words other than the reference word is recognized among the plurality of words as a first section corresponding to the text;
Starting from a section adjacent to the first section, a range of a section in which speech of a word other than the reference word is recognized among the plurality of words is detected, and a section in the range corresponds to the text A voice analysis method, wherein the computer executes the step of selecting as the second section.

2. The method according to claim 1, wherein in the step of selecting the reference word, one or more words having a maximum number of mora or a predetermined number or more are selected as the reference word from the plurality of extracted words. Voice analysis method.

2. The speech according to claim 1, wherein, in the step of selecting the reference word, the one or more words having a maximum score or a certain value or more indicating a word likelihood in speech recognition are selected as the reference word. Analysis method.

In the step of selecting the first section, from the selected section, among the plurality of words, a section in which the speech of a word other than the reference word is most recognized, or a speech of a word other than the reference word The speech analysis method according to any one of claims 1 to 3, wherein one or more recognized sections are selected as the first section.

In the step of selecting the second section, a range of one or more sections in which a predetermined number or more of voices of the words other than the reference word are recognized from the section adjacent to the first section as a starting point The speech analysis method according to claim 1, wherein the speech analysis method is detected.

Divide audio data into multiple sections,
Extract multiple words from text,
Based on the plurality of extracted words, perform speech recognition of the plurality of sections,
From the extracted plurality of words, one or more words having a certain level of ease or certainty of speech recognition are selected as reference words,
From the plurality of sections, select a section in which the voice of the reference word is recognized,
From the selected section, one or more sections in which the voices of words other than the reference word are recognized among the plurality of words are selected as the first section corresponding to the text,
Starting from a section adjacent to the first section, a range of a section in which speech of a word other than the reference word is recognized among the plurality of words is detected, and a section in the range corresponds to the text A speech analysis program that causes a computer to execute processing selected as the second section.

A dividing unit for dividing the audio data into a plurality of sections;
An extractor for extracting a plurality of words from the text;
A speech recognition processing unit that performs speech recognition of the plurality of sections based on the extracted words;
A first selection unit that selects, as a reference word, one or more words having a certain level of ease or certainty of speech recognition from the plurality of extracted words;
A second selection unit that selects a section in which the voice of the reference word is recognized from the plurality of sections;
A third selection unit that selects, from the selected section, one or more sections in which voices of words other than the reference word are recognized among the plurality of words as a first section corresponding to the text;
Starting from a section adjacent to the first section, a range of a section in which speech of a word other than the reference word is recognized among the plurality of words is detected, and a section in the range corresponds to the text A speech analysis apparatus comprising: a fourth selection unit that selects the second section.