JP7341260B2

JP7341260B2 - Semi-automatic refinement for speech recognition - speech data extraction and transcription data generation method

Info

Publication number: JP7341260B2
Application number: JP2022006014A
Authority: JP
Inventors: バン、ジュンソン
Original assignee: Electronics and Telecommunications Research Institute ETRI
Current assignee: Electronics and Telecommunications Research Institute ETRI
Priority date: 2021-01-18
Filing date: 2022-01-18
Publication date: 2023-09-08
Anticipated expiration: 2042-01-18
Also published as: KR20220104432A; JP2022111106A

Description

本発明は、音声認識のための半自動精製－音声データ抽出および転写データ生成方法に関する。 The present invention relates to a semi-automatic purification-speech data extraction and transcription data generation method for speech recognition.

音声認識機のモデル学習のために大量の音声データファイルおよびそれに対応する転写テキストが必要であるが、手動転写作業には多くの時間と費用がかかる問題点がある。 A large amount of audio data files and corresponding transcribed text are required for model training of a speech recognizer, but manual transcription requires a lot of time and money.

従来技術によれば、大韓民国登録特許公報第１０－２０８３９３８号のように、音声データの転写過程において自動化可能な部分を導入しようとする試みがあるが、これは、精製されたコーパス生成のために多くの努力と時間がかかり、転写過程におけるエラーが含まれた状態でモデルが再学習される問題点がある。 According to the prior art, there has been an attempt to introduce an automatable part in the transcription process of audio data, as in the Republic of Korea Patent Publication No. 10-2083938, but this has been done in order to generate a refined corpus. The problem is that it takes a lot of effort and time, and the model is retrained with errors in the transcription process.

本発明は、上記の問題点を解決するために提案されたものであって、音声認識機の認識率向上のためにモデル学習に必要な精製－音声データおよびその転写データを低費用で短時間内に半自動方式で確保することが可能な方法を提供することを目的とする。 The present invention was proposed in order to solve the above-mentioned problems. In order to improve the recognition rate of a speech recognizer, the present invention can purify speech data and its transcription data necessary for model learning at low cost and in a short time. The purpose of the present invention is to provide a method that allows semi-automatic methods to be used within the system.

本発明の実施例による音声認識のための半自動精製－音声データ抽出および転写データ生成方法は、音声データをスライシングし、精製されたコーパスを構築するステップと、精製されたコーパスを活用して音声認識機のモデル学習を行うステップと、半自動で精製－音声データを抽出し、転写データを生成するステップとを含む。 A semi-automatic refining-speech data extraction and transcription data generation method for speech recognition according to an embodiment of the present invention includes the steps of slicing speech data and constructing a refined corpus, and utilizing the refined corpus for speech recognition. The method includes a step of performing machine model learning, and a step of semi-automatically extracting purified audio data and generating transcription data.

本発明によれば、半自動精製音声データ抽出アルゴリズムを用いて精製されたコーパスを生成するための作業時間と費用を節減する効果がある。 According to the present invention, it is possible to reduce the working time and cost for generating a refined corpus using a semi-automatic refined speech data extraction algorithm.

また、音声認識率が改善されてほとんど自動で字幕の生成が可能で、動画に該当する字幕生成の費用節減が可能な効果がある。 In addition, the voice recognition rate has been improved and subtitles can be generated almost automatically, which has the effect of reducing the cost of subtitle generation for videos.

本発明の効果は以上に言及したものに限定されず、言及されていない他の効果は以下の記載から当業者に明確に理解されるであろう。 The effects of the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

従来技術による音声データと転写テキストを用いた音声認識機のモデル学習過程を示す。The model learning process of a speech recognizer using speech data and transcribed text according to the prior art is shown. 本発明の実施例による音声データスライシングおよび精製されたコーパスの構築過程を示す。4 illustrates a process of audio data slicing and construction of a refined corpus according to an embodiment of the present invention. 本発明の実施例による半自動精製－音声データ／転写テキスト抽出機を示す。1 illustrates a semi-automatic refinement-audio data/transcribed text extraction machine according to an embodiment of the present invention; 本発明の実施例による半自動精製－音声データ／転写テキスト抽出機により得られたデータで音声認識機のＳＴＴモデルを学習させる構造を示す。1 shows a structure for training an STT model of a speech recognizer with data obtained by a semi-automatic refining-speech data/transcription text extractor according to an embodiment of the present invention; ２２００時間の原本ｒａｗ音声データに対して基準値を５０％、６０％、７０％とした時の認識率と自動化により抽出できる音声データの量をまとめたものである。This is a summary of the recognition rate and the amount of audio data that can be extracted by automation when the reference values are set to 50%, 60%, and 70% for 2200 hours of original raw audio data. 本発明の実施例による字幕のある動画から音声データを抽出して活用する例を示す。An example of extracting and utilizing audio data from a video with subtitles according to an embodiment of the present invention will be described. 本発明の実施例による字幕を用いたオーディオスライス過程、シードモデル生成過程、シードモデルを活用した半自動精製音声データ抽出アルゴリズム開発過程を示す。3 shows an audio slicing process using subtitles, a seed model generation process, and a semi-automatic refined audio data extraction algorithm development process using the seed model according to an embodiment of the present invention.

本発明の上述した目的およびその他の目的と利点および特徴、そしてそれらを達成する方法は、添付した図面とともに詳細に後述する実施例を参照すれば明確になるであろう。 The above-mentioned and other objects, advantages and features of the present invention, as well as methods of achieving them, will become clearer with reference to the embodiments described below in detail in conjunction with the accompanying drawings.

しかし、本発明は、以下に開示される実施例に限定されるものではなく、互いに異なる多様な形態で実現可能であり、単に以下の実施例は本発明の属する技術分野における通常の知識を有する者に発明の目的、構成および効果を容易に知らせるために提供されるものに過ぎず、本発明の権利範囲は請求項の記載によって定義される。 However, the present invention is not limited to the embodiments disclosed below, and can be realized in various forms different from each other, and the following embodiments are merely described by those of ordinary skill in the technical field to which the present invention pertains. The scope of the present invention is defined by the claims.

一方、本明細書で使われた用語は実施例を説明するためのものであり、本発明を制限しようとするものではない。本明細書において、単数形は、文章で特に言及しない限り、複数形も含む。明細書で使われる「含む（ｃｏｍｐｒｉｓｅｓ）」および／または「含む（ｃｏｍｐｒｉｓｉｎｇ）」は、言及された構成素子、段階、動作および／または素子が、１つ以上の他の構成素子、段階、動作および／または素子の存在または追加を排除しない。 Meanwhile, the terms used in this specification are for describing embodiments and are not intended to limit the present invention. As used herein, the singular term also includes the plural term unless the context specifically indicates otherwise. As used in the specification, "comprises" and/or "comprising" refer to the terms "comprises" and/or "comprising" in which the referenced component, step, act, and/or element is present in one or more other components, steps, acts, and/or elements. /or does not exclude the presence or addition of elements.

以下、当業者の理解のために本発明が提案された背景について先に記述し、本発明の実施例について記述する。 DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS In the following, for the understanding of those skilled in the art, the background in which the present invention was proposed will first be described, and embodiments of the present invention will be described.

音声認識機（ＳＴＴ：Ｓｐｅｅｃｈ－Ｔｏ－Ｔｅｘｔ）を開発するためには、多大な量の音声データとその転写されたテキストが必要である。 In order to develop a speech-to-text (STT) device, a large amount of speech data and its transcribed text are required.

音声データと転写テキストによって音声認識機（ＳＴＴ）のモデルが学習されて構築されるので、音声認識機の性能は転写されたテキストの正確度に大きく影響される。 Since the speech data and transcribed text are used to train and build a speech recognizer (STT) model, the performance of the speech recognizer is greatly influenced by the accuracy of the transcribed text.

図１は、従来技術による音声データと転写テキストを用いた音声認識機のモデル学習過程を示す。 FIG. 1 shows a model learning process of a speech recognizer using speech data and transcribed text according to the prior art.

リアルタイム音声認識機の認識率は８５％程度（サービスドメイン領域で使う用語が一般的でない場合の認識率はより低下する）であるが、安定的なサービスのためには９５％以上の認識性能が要求される。 The recognition rate of real-time speech recognizers is around 85% (the recognition rate decreases when the terms used in the service domain are not common), but recognition performance of 95% or higher is required for stable service. required.

機械学習またはディープラーニングベースの音声認識機のモデル学習のためには、大量の音声データファイルとそれに対応する転写テキスト（ｔｒａｎｓｃｒｉｐｔｉｏｎ）が必要であるが、速記者による転写テキスト生成作業、すなわち手動転写作業には多くの時間と高い費用がかかる問題点がある。 A large amount of audio data files and corresponding transcriptions are required for model learning of machine learning or deep learning-based speech recognition machines, but transcription text generation work by a stenographer, that is, manual transcription work is required. There are problems that require a lot of time and cost.

音声認識性能を向上させるためには、精製された解答紙（音声データ、転写テキスト）が必要である。 In order to improve speech recognition performance, refined answer sheets (speech data, transcribed text) are required.

音声データの転写過程は人（速記者）によって行われるが、この過程でモデル学習（特に、ディープラーニング）の性能低下要因が存在する。 The process of transcribing audio data is performed by humans (stenographers), but there are factors that reduce the performance of model learning (particularly deep learning) in this process.

これは、速記者によって生成された転写テキストにはモデル学習に適しない情報が存在するからである。 This is because the transcribed text generated by the stenographer contains information that is not suitable for model learning.

原本音声ファイルにはないものの速記者によって添加された部分（ｉｎｓｅｒｔ）、原本音声ファイルにはあるものの速記者によって除去された部分（ｄｅｌｅｔｅ）と変更された部分（ｕｐｄａｔｅ）があるので、これを学習データとして用いる場合、エラーがある解答紙（音声データ、転写テキスト）を学習することとなり、音声認識率の低下をもたらす問題点がある。 There are parts that are not in the original audio file but were added by the stenographer (insert), parts that were in the original audio file but were removed by the stenographer (delete), and parts that were changed (update), so learn these. When used as data, answer sheets (voice data, transcribed text) with errors must be learned, resulting in a problem that the speech recognition rate decreases.

精製された解答紙を作るために、速記検討の際に、音声データの原本ファイルを聴きながら、速記者によって作成されたテキストの間違った部分を探して修正しているが、これは、多くの時間と費用がかかる問題点がある。 In order to create refined answer sheets, when examining shorthand, we listen to the original file of audio data while searching for and correcting incorrect parts of the text created by the stenographer. There are problems that require time and money.

また、速記者の転写方式（入力パターン、例：数字表記、非言語的表現、話者音声重複など）が異なっていて、速記者による転写されたテキストが音声認識機のモデル学習に適した形態でないことがあり、これをすべて確認する作業は現実的に難しい問題点がある。 In addition, the stenographer's transcription method (input pattern, e.g., numerical notation, nonverbal expression, speaker voice overlap, etc.) is different, and the text transcribed by the stenographer is in a form suitable for model learning of the speech recognizer. However, it is difficult to check all of them in practice.

研究によれば、９０％の音声認識性能を得るには、最小２０００時間以上の精製されたコーパスを学習させなければならない。 According to research, to obtain 90% speech recognition performance, a minimum of 2000 hours of refined corpus must be trained.

すなわち、最小２０００時間以上の解答紙（音声データ、転写データ）が必要になるが、通常１時間の精製されたコーパスを作るためには、平均３時間程度の時間投入と１０万ウォン以上の作業費用が必要であり、転写テキストの検収作業時間と費用も発生する。 In other words, at least 2,000 hours of answer sheets (audio data, transcription data) are required, but in order to create a refined corpus of 1 hour, it takes an average of 3 hours of time investment and 100,000 won or more of work. There is a cost involved, and the time and cost involved in inspecting the transcription text is also incurred.

従来技術によれば、大韓民国登録特許公報第１０－２０８３９３８号は、メモリに記憶されたベース音声認識モデルに基づいて転写（ｔｒａｎｓｃｒｉｐｔｉｏｎ）する音声データを自動転写するアプローチ方式を活用する。 According to the prior art, Korean Patent Publication No. 10-2083938 utilizes an approach to automatically transcribe voice data based on a base voice recognition model stored in a memory.

すなわち、１次自動転写データおよび音声データの対比結果に基づいて、音声認識モデルを１次学習して２次音声認識モデルを生成する方式を提案する。 That is, we propose a method in which a secondary speech recognition model is generated by performing primary training on a speech recognition model based on the results of comparison between primary automatic transcription data and audio data.

しかし、これも、直接原本音声ファイルを聴きながら、速記者によって作成されたテキストと間違った部分を探して修正する、精製されたコーパスを作るための努力と時間が多く必要になる問題点があり、転写過程におけるエラーが含まれた状態でモデルが再学習される可能性があるとの問題点がある。 However, this also has the problem that it requires a lot of effort and time to create a refined corpus by listening directly to the original audio file, searching for and correcting the mistakes in the text created by the stenographer. , there is a problem that the model may be retrained with errors in the transcription process.

本発明は、前述した問題点を解決するために提案されたものであって、音声データの転写過程において自動化可能な部分を導入して、精製された音声データとその転写されたテキストを低費用で速い時期内に抽出し、精製されたコーパスを抽出する方法を提案する。 The present invention was proposed in order to solve the above-mentioned problems, and by introducing an automatable part in the process of transcribing audio data, refined audio data and its transcribed text can be produced at low cost. We propose a method to extract a refined corpus within a fast period.

本発明の実施例によれば、音声認識システムの開発に必要な音声データの中から半自動方式で（自動化可能な部分に限り）精製－音声データを選別して抽出し、その転写データを生成する。 According to an embodiment of the present invention, purification-speech data is selected and extracted in a semi-automatic manner (limited to parts that can be automated) from the speech data necessary for the development of a speech recognition system, and transcription data thereof is generated. .

音声データと予め転写されたテキストデータ、または音声データと速記者によって転写されたテキストデータがあると仮定する時、音声データが音声認識機（ＳＴＴ）によってデコーディングされたテキストと、予め／手動で転写されたテキストとを、文字列－マッチング－アルゴリズムにより比較して精製－音声データを分類し、それに対応する転写テキストデータを決定し、これを活用して音声認識機（ＳＴＴ）のモデル学習を行う。 Assuming you have audio data and text data that has been previously transcribed, or audio data and text data that has been transcribed by a stenographer. Compare and refine the transcribed text using a character string matching algorithm - classify the voice data, determine the corresponding transcribed text data, and use this to train the model of the speech recognizer (STT) conduct.

本発明の実施例によれば、精製されたコーパスの活用により音声認識機（ＳＴＴ）の認識率を向上させることが可能であり、このために、モデル学習に必要な精製－音声データおよびその転写データを低費用で短時間内に半自動方式で確保する。 According to an embodiment of the present invention, it is possible to improve the recognition rate of a speech recognizer (STT) by utilizing a refined corpus, and for this purpose, the refined speech data and its transcription necessary for model learning are Securing data in a semi-automatic manner at low cost and within a short time.

図２は、本発明の実施例による音声データスライシングおよび精製されたコーパスの構築過程を示す。 FIG. 2 illustrates the process of audio data slicing and refined corpus construction according to an embodiment of the present invention.

音声データファイルのランニングタイムが長くなると、音声認識機（ＳＴＴ）の認識率が低下しうるので、原本音声データファイルをｔ秒以下に分割する音声データ前処理作業を行う。 If the running time of the voice data file becomes long, the recognition rate of the voice recognizer (STT) may decrease, so a voice data preprocessing operation is performed to divide the original voice data file into t seconds or less.

この時、時間ｔは、３０秒前後に設定されることが好ましい。 At this time, it is preferable that the time t is set to around 30 seconds.

本発明の実施例による手動精製されたコーパスを活用して音声認識機のモデル学習を行うステップは、音声データと転写テキストを活用して手動で精製されたコーパスデータを作る。 The step of performing model learning of a speech recognizer using a manually refined corpus according to an embodiment of the present invention involves creating manually refined corpus data using speech data and transcribed text.

９０％の音声認識性能を得るために最小２０００時間以上の精製されたコーパスを学習させなければならないとした時、２５０時間程度のコーパスデータを活用して音声認識機（ＳＴＴ）モデルを学習させる。 In order to obtain 90% speech recognition performance, a refined corpus of at least 2,000 hours must be trained, and a speech recognizer (STT) model is trained using about 250 hours of corpus data.

本発明の実施例による半自動で精製－音声データを抽出し、転写データを生成するステップは、図３に示した半自動精製－音声データ／転写テキスト抽出機３１０によって行われる。 The steps of semi-automatically extracting refined audio data and generating transcribed data according to embodiments of the present invention are performed by a semi-automatic refined audio data/transcribed text extractor 310 shown in FIG.

前述した過程により、音声認識機３１１のＳＴＴモデルが学習されている状況で、精製－音声データの抽出が必要な音声データ（音声ファイルＵ_ｉ）を音声認識機３１１に入力させて、デコーディングされた転写テキスト（デコーディング文字列Ｗ_ｉ）を取得する。 In a situation where the STT model of the speech recognizer 311 has been trained through the process described above, the speech data (speech file U _i ) that requires refinement and extraction of speech data is input to the speech recognizer 311 and decoded. The transcribed text (decoded character string W _i ) is obtained.

速記者によってその音声データ（音声ファイルＵ_ｉ）が転写されたテキスト（転写文字列Ｖ_ｉ）とデコーディングされた転写テキスト（デコーディング文字列Ｗ_ｉ）を精製－音声／文字列抽出機３１２に入力させて、文字列マッチングアルゴリズムにより精製－音声データと文字列（転写テキスト）を抽出する。 The audio data (audio file U _i ) is refined by the stenographer into the transcribed text (transcribed character string V _i ) and the decoded transcribed text (decoded character string W _i ) into the audio/character string extractor 312 input, and extracts purified voice data and character strings (transcribed text) using a character string matching algorithm.

以下、本発明の実施例による文字列マッチングアルゴリズムを説明する。 A character string matching algorithm according to an embodiment of the present invention will be described below.

本発明の実施例による文字列マッチングアルゴリズムは、転写文字列Ｖ_ｉをＡ、デコーディング文字列Ｗ_ｉをＢとした時、２つの文字列の類似度Ｓ（Ａ，Ｂ）が基準値（Ｔｈｒｅｓｈｏｌｄ）Ｅ以上の場合に限り、精製されたデータであることを示す識別子を音声データ分類機３１４およびテキストデータ分類機３１５に転送し、テキストデータ分類機３１５には文字列Ａ、Ｂのうち適したまたは修正された文字列Ｒを併せて転送する。 In the character string matching algorithm according to the embodiment of the present invention, when the transcribed character string V _i is A and the decoded character string W _i is B, the similarity S (A, B) of the two character strings is set to a reference value (Threshold). )E or higher, an identifier indicating that the data is refined data is transferred to the voice data classifier 314 and the text data classifier 315, and the text data classifier 315 selects the appropriate character string A or B. Alternatively, the modified character string R is also transferred.

ステップ１
転写文字列Ａとデコーディング文字列Ｂが文字列マッチング判別機に入力されれば、２つの文字列の類似値が基準値以上であるかを確認し、類似値が基準値以上の文字列に対してのみ文字列マッチングを進める。 Step 1
When the transcription character string A and the decoding character string B are input to the character string matching discriminator, it is checked whether the similarity value of the two character strings is greater than or equal to the reference value, and the similarity value is determined to be a character string that is greater than or equal to the reference value. Proceed with string matching only for those.

ステップ２－１
転写文字列Ａとデコーディング文字列Ｂを単語（／語節）単位に分離し、Ｂに対して空欄を除去した文字列Ｂ^＊を生成する。 Step 2-1
The transcription character string A and the decoding character string B are separated into word (/phrase) units, and a character string B ^* is generated by removing blank spaces from B.

例えば、下記表１のように文字列を生成する。

文字列Ｂ^＊を生成する理由は、Ａにある単語により、Ｂでの検索を有利にするためである。
文字列Ａに対して空欄を除去した文字列Ａ^＊を生成し、同様の方法を行うことが可能である。 For example, a character string is generated as shown in Table 1 below.

The reason for generating the character string B ^* is to make the search in B more advantageous due to the word in A.
It is possible to generate a character string A ^* by removing blank spaces from the character string A and perform a similar method.

本発明の実施例によれば、単語から音節などにする方法で変形が行われる。 According to an embodiment of the invention, the transformation is performed in a way such as from words to syllables.

ステップ２－２
転写文字列Ａの最初の単語と同じ単語が文字列Ｂ^＊にあるかを検索する。 Step 2-2
A search is made to see if the first word of the transcribed character string A exists in the character string B ^* .

ステップ２－２－１
この時、一致する単語があれば、ＡおよびＢ^＊の共通した単語部分を特定の文字（例：チルダ（～））に置き換える。 Step 2-2-1
At this time, if there is a matching word, the common word part of A and B ^* is replaced with a specific character (eg, tilde (~)).

チルダ（～）以外に他の文字が用いられてもよいが、文字列内に出られる文字を特定の文字として用いない。 Other characters besides the tilde (~) may be used, but characters that can appear within a string are not used as specific characters.

一致する単語がなければ、韓国語の場合、韓国語は語幹と助詞とで構成されているため、Ａの最初の単語を基準として文字列の一番最後から一字ずつ減らしていきながら、Ｂ^＊に一致する（同一の）単語があるかを検索する。 If there is no matching word, in the case of Korean, since Korean is composed of a stem and a particle, start with the first word of A and reduce it one character at a time from the end of the string. Search to see if there is a word that matches (same) ^* .

減らしていく過程で一致する単語が１文字の場合は意味がないため、２文字までのみこの過程を進める。 If the word that matches in the reduction process is only one character, it is meaningless, so this process is continued only up to two characters.

本発明の実施例によれば、韓国語と言語特性が同じ（語順などが同じ）言語は、同様の方法で拡張可能である。 According to the embodiment of the present invention, languages that have the same linguistic characteristics as Korean (same word order, etc.) can be extended in a similar manner.

前述した実施例について表２および表３を例として説明する。

The above embodiment will be explained using Tables 2 and 3 as examples.

前述した例において、文字列ＡとＢで「シジャク」という単語が一致するので、「～～」に置き換える。 In the above example, the word "shijaku" matches in character strings A and B, so it is replaced with "~~".

ステップ２－２－２
前述したステップで一致する単語が１文字以下の場合、Ａの最初の単語とＢ^＊の語節に対する類似値（単語類似値）を算出し、類似値が基準値以上の場合、ステップ２－３へ進み、類似値が基準値未満の場合、ステップ２－４を行う。

Step 2-2-2
If the matching word in the above step is one character or less, calculate the similarity value (word similarity value) between the first word of A and the phrase of B ^* , and if the similarity value is greater than the reference value, step 2-3 If the similarity value is less than the reference value, step 2-4 is performed.

文字列ＡとＢで「シ」部分が一致するが、これは１文字一致の場合である。 The "shi" portions of character strings A and B match, but this is a case of a one-character match.

基準値が６０以上の場合、ＡおよびＢ部分に対する類似値を求めると、６／７（８５％以上）であるので、ステップ２－３へ進む。 If the reference value is 60 or more, the similarity value for portions A and B is 6/7 (85% or more), so proceed to step 2-3.

ステップ２－３
Ａの最初の単語に対して前から一字ずつ増やしていきながら、ステップ２－２－１で計算した類似値が基準値より高い単語（ｂと称する）に同一の文字があるかのマッチングを行う。 Step 2-3
While increasing the number of characters one by one from the front of the first word of A, match to see if the same character exists in the word (referred to as b) whose similarity value calculated in step 2-2-1 is higher than the reference value. conduct.

この時、同一の文字を最大にもつグループを探すために、ｂと同一の文字マッチングを連続的に行う。 At this time, in order to find a group having the maximum number of the same characters, matching of the same characters as b is performed continuously.

１文字の場合は意味がないため、２文字以上連続的に一致する場合にのみマッチングを行う。 Since a single character has no meaning, matching is performed only when two or more characters match consecutively.

前述した実施例について表６および表７を例として説明する。

The above embodiment will be explained using Tables 6 and 7 as examples.

マッチングに成功した場合、ＡおよびＢの共通した単語部分をチルダ（～）に置き換える。 If matching is successful, the common word part of A and B is replaced with a tilde (~).

ステップ２－４
Ａの最初の単語の後に登場する単語に対しても、前述したステップ２－１～２－３を繰り返し行う。 Step 2-4
Steps 2-1 to 2-3 described above are repeated for words appearing after the first word of A.

ステップ３
速記者が作成した転写文字列Ａと、音声認識機によって生成されたデコーディング文字列Ｂの残っている単語のグループの個数をもって、後述する過程により最終結果物を作る。 Step 3
Using the number of remaining word groups of the transcribed character string A created by the stenographer and the decoded character string B generated by the speech recognizer, a final result is created by the process described later.

ステップ３－１
転写文字列Ａとデコーディング文字列Ｂに残っている単語のグループの個数が同じ場合、転写文字列を最終選択する。 Step 3-1
If the number of word groups remaining in the transcribed character string A and the decoded character string B is the same, the transcribed character string is finally selected.

速記者が直接手動で作成した文字列であるため、音声データファイルに対する転写の正確度が一般的に９５％以上になり、残りの５％は速記者によって任意に挿入、削除、切替えられた部分や誤字などが該当する。 Since the strings are created manually by the stenographer, the transcription accuracy for the audio data file is generally 95% or higher, with the remaining 5% being arbitrary insertions, deletions, or changes made by the stenographer. This includes typos and typos.

それに対し、デコーディングされた文字列の場合、音声認識機の性能によって影響されるが、騒音の少ない環境で９０％程度になる。 On the other hand, in the case of decoded character strings, the performance is affected by the performance of the speech recognizer, but it is about 90% in a quiet environment.

目的によって相手文章、またはやや変形された文章を選択することができる。 Depending on the purpose, you can select the opponent's text or a slightly modified text.

前述した実施例について表８および表９を例として説明する。

The above-mentioned embodiments will be explained using Tables 8 and 9 as examples.

残っている単語グループの個数が同じであるので、転写文字列のテキスト（「チグムブトチェ 356 フェキョンチャルチョンムンファヘンサケフェシグルコヘンハゲッスムニダ」）を選択する。

Since the number of remaining word groups is the same, the text of the transcription string (``Chigumbut che 356 fe kyongchalchonmunfahensa kefesigur kohenhagessumnida'') is selected.

残っている単語グループの個数が同じであるので、転写文字列のテキスト（「チョンジュエタラ 1チョルマンチェチャンハヨチュシギパラムニダ」）を選択する。 Since the number of remaining word groups is the same, the text of the transcription string (``Cheonjue Tara 1 Chorman Choechanghayo Chushigi Paramnida'') is selected.

ステップ３－２
転写文字列Ａとデコーディング文字列Ｂに対して一方にのみ単語グループが残っている場合、残っているグループの文章を選択する。 Step 3-2
If a word group remains in only one of the transcribed character string A and decoded character string B, the sentence in the remaining group is selected.

その理由は、転写文字列には話者が述べた内容について繰り返し言ったり曖昧に言った部分に対して速記者が１回だけ転写するからである。 The reason for this is that in the transcription string, the stenographer transcribes only once portions of what the speaker has said that are repeated or ambiguous.

デコーディングされた結果から、音声ファイルに対して部分的にデコーディングができない場合がある。 Based on the decoding results, it may be possible to partially decode an audio file.

前述した実施例について表１０および表１１を例として説明する。

The above embodiment will be explained using Tables 10 and 11 as examples.

デコーディング結果にのみテキストが残っているため、デコーディング結果のテキスト（「タヤンハンキオブミッキニョムサオブルジュンビハゴイッスムニダ」）を選択する。

Since the text remains only in the decoding result, select the text in the decoding result ("Tayanhan kiob mi kinyomsaobul jumbi hago issumnida").

転写文字列にのみテキストが残っているため、ｔｒａｎｓｃｒｉｐｔｉｏｎ結果のテキスト（「アムチョロクイアンデロウィギョルヘチュシギルパラミョ」）を選択する。 Since only the text remains in the transcription string, select the text resulting from the transcription ("Amchorok y andero uigyolhe chushigil paramyo").

ステップ３－３
デコーディング文字列Ｂに対してのみ単語グループが文章の最後部分に残っている場合に、無条件で転写文字列Ａを選択する。 Step 3-3
When a word group remains at the end of a sentence only for decoding character string B, transcription character string A is unconditionally selected.

なぜならば、音声ファイル原本の最後部分に若干の雑音が入った場合、デコーディング結果に雑音がテキストとして反映されるからである。 This is because if there is some noise at the end of the original audio file, the noise will be reflected as text in the decoding result.

前述した実施例について表１２を例として説明する。

The above embodiment will be explained using Table 12 as an example.

デコーディング結果にのみテキストが最後部分に残っているため、転写文字列のテキスト（「ウリ 23デキョンチャルウィウォンフェヌンネウェハゲクヨロブンウィ」）を選択する。 Since the text remains only in the last part of the decoding result, select the text of the transcription string (``Uri 23 Dae Kyongcharwiwonfenun Newe Hageku Yolobunwi'').

ステップ３－４
転写文字列Ａとデコーディング文字列Ｂに対して残っている単語のグループの個数が異なる場合、グループの個数が多い方を選択する。 Step 3-4
If the number of remaining word groups differs between the transcribed character string A and the decoded character string B, the one with the larger number of groups is selected.

その理由は、グループの個数の多い方が音声データの情報を最大限に反映して作成されているからである。 The reason is that the group with a larger number is created by reflecting the information of the audio data to the maximum extent.

前述した実施例について表１３を例として説明する。

The above embodiment will be explained using Table 13 as an example.

ｔｒａｎｓｃｒｉｐｔｉｏｎがマッチング後、残っているグループの個数がデコーディング結果より多いため、ｔｒａｎｓｃｒｉｐｔｉｏｎのテキスト（「チョンムウィウォンフェウィキムハンピョウィウォンナオショソ 7コンエテハヨシムサボゴミッチェアンソルミョンヘチュシギパラムニダ」）を選択する。 After the transcription is matched, the number of remaining groups is greater than the decoding result, so the text of the transcription (``Jeonmwiwonhoewi Kimhanpyo Wiwon Naoshoso 7 Gongae Taehayo Simsabogo Mi Chaeansolmyeon Hae Chushigi Paramnida'') is selected.

前述した実施例について表１４を例として説明する。

The above embodiment will be explained using Table 14 as an example.

デコーディング結果がマッチング後、残っているグループの個数がｔｒａｎｓｃｒｉｐｔｉｏｎの結果より多いため、ｔｒａｎｓｃｒｉｐｔｉｏｎのテキスト（「クレソオヒリョチャユファチョンドル 89 % イハチョンドロナッチュミョンソラドチョギエタギョルヘッスミョンワンファハヌンゴシヨギコイツンドルイプチャンイムニダ」）を選択する。 After the decoding results are matched, the number of remaining groups is greater than the transcription result, so the transcription text (``Creso Ohiryo Chayuhwa Jeongdol 89 % Iha Jeongdoro Nachu Myeongseo Lado Chogi Tagyeol Hessmyeon Wanghwa Haneun Gosi Yogi Koitsun Dol Ipchan Imnida'' ).

ステップ３－５
転写文字列Ａとデコーディング文字列Ｂの最後部分にのみ両方とも単語グループが残っている場合には、文字の個数が長い方を選択する。 Step 3-5
If a word group remains only at the end of the transcription character string A and the decoding character string B, the one with the longer number of characters is selected.

その理由は、転写文字列Ａとデコーディング文字列Ｂとも単語グループが残っている場合、雑音に対するテキストではないため、音声データの情報を最大限に反映した方は、文字数がより長いテキストをもっている方である。 The reason is that if word groups remain in both transcription string A and decoding string B, they are not text against noise, so the one that reflects the information of the audio data to the maximum has a text with a longer number of characters. That's the way to go.

ただし、文字数が同じ場合、速記者が作成したｔｒａｎｓｃｒｉｐｔｉｏｎを選択する。 However, if the number of characters is the same, the transcription created by the stenographer is selected.

前述した実施例について表１５を例として説明する。

The above embodiment will be explained using Table 15 as an example.

ｔｒａｎｓｃｒｉｐｔｉｏｎがマッチング後、残っている文字数がデコーディング結果より多いため、ｔｒａｎｓｃｒｉｐｔｉｏｎのテキスト（「チュガハゴソンヘサジョンサウィウィムチョハンウルカンファハヨッスムニダ」）を選択する。 After the transcription is matched, the number of remaining characters is greater than the decoding result, so the text of the transcription (``Jugahago songhaesajeongsawi wim chohaneul ganghwahayossumnida'') is selected.

前述した実施例について表１６を例として説明する。

The above embodiment will be explained using Table 16 as an example.

ｔｒａｎｓｃｒｉｐｔｉｏｎがマッチング後、残っている文字数がデコーディング結果と同じであるため、ｔｒａｎｓｃｒｉｐｔｉｏｎのテキスト（「チナンヘウリクンミンウンテハンミングクヨクサウィ」）を選択する。 After the transcription is matched, the number of remaining characters is the same as the decoding result, so the text of the transcription (``Chinanghae Uri Geunminun Taehamminguk Yeoksawi'') is selected.

以下、図４を参照して、半自動精製－音声データ／転写テキスト抽出機により得られたデータで音声認識機のＳＴＴモデルを学習させる構造について説明する。 Hereinafter, with reference to FIG. 4, a structure for training the STT model of the speech recognizer using data obtained by the semi-automatic refinement-speech data/transcription text extraction machine will be described.

結果的に、文字列マッチングの結果物として得られた「精製－音声データとその転写されたテキスト」は、ＳＴＴのモデル学習に活用されて、音声認識機の性能を向上させる。 As a result, the "refined speech data and its transcribed text" obtained as a result of character string matching are utilized for STT model learning to improve the performance of the speech recognizer.

文字列マッチングアルゴリズムにより、２つの文章の類似値が基準値を６０％とした場合、全体原本音声ファイルの約７０％を自動化により抽出できる（この際の音声認識率は９０％以上である）。 Using a character string matching algorithm, if the similarity value between two sentences is set to a standard value of 60%, approximately 70% of the entire original audio file can be automatically extracted (the speech recognition rate in this case is 90% or more).

図５は、２２００時間の原本ｒａｗ音声データに対して基準値を５０％、６０％、７０％とした時の認識率と自動化により抽出できる音声データの量をまとめたものである。 FIG. 5 summarizes the recognition rate and the amount of audio data that can be extracted by automation when the reference values are set to 50%, 60%, and 70% for 2200 hours of original raw audio data.

基準値を６０％とした時、自動化抽出サイズに対する認識率が最も良くなることが、実験を通して分かった。 Through experiments, it was found that the recognition rate for the automated extraction size was the best when the reference value was set to 60%.

他のサービスドメインでは別途に実験を進めて、当該ドメインでの一般的な基準値を推定して使用できるはずである。 It should be possible to carry out separate experiments in other service domains and estimate and use general reference values for those domains.

本発明の実施例によれば、字幕のある動画から音声データを抽出して活用可能であり、字幕のある動画にも適用して活用可能である。 According to the embodiments of the present invention, it is possible to extract and utilize audio data from videos with subtitles, and it is also possible to apply and utilize video data with subtitles.

図６を参照すれば、字幕のある動画の場合、音声転写を別途に再び行わず、一部の動画区間を抜粹し、これを精製して活用して音声認識機の性能を高めることが可能である。 Referring to Figure 6, in the case of a video with subtitles, it is possible to extract some video sections and refine and utilize them to improve the performance of the speech recognizer, without separately performing audio transcription again. It is possible.

本発明の実施例によれば、動画（音声データを含む）字幕生成器に活用することが可能である。 According to embodiments of the present invention, it is possible to utilize it as a video (including audio data) subtitle generator.

動画とテキストが用意されている場合であっても、学習のためにテキストを１０秒以下の単位に細く分割する（ｄｕｒａｔｉｏｎ）作業およびこれを動画と同期化させる作業が必要になり、通常最初からテキストがないと考えてｔｒａｎｓｃｒｉｐｔｉｏｎ作業を行う。 Even if a video and text are prepared, it is necessary to divide the text into sections of 10 seconds or less for learning purposes and synchronize this with the video, which is usually difficult from the beginning. Perform transcription work assuming that there is no text.

しかし、前述した本発明の実施例によれば、比較的簡単にｔｒａｎｓｃｒｉｐｔｉｏｎ作業を行うことが可能である。 However, according to the embodiment of the present invention described above, it is possible to perform transcription work relatively easily.

速記者によって別途に作られたテキストは、学習させるには適しない情報が入っている。 Texts prepared separately by stenographers contain information that is not suitable for learning.

原本音声ファイルにはない添加された部分があり（ｉｎｓｅｒｔ）、同じく音声ファイルにはあるものの速記者によって除去された部分（ｄｅｌｅｔｅ）、変更された部分（ｕｐｄａｔｅ）があるので、学習データとして用いる場合、質の悪い学習データを用いることとなり、深刻な認識率の低下をもたらす。 There are added parts that are not in the original audio file (insert), parts that are also in the audio file but have been removed by the stenographer (delete), and parts that have been changed (update), so when using it as learning data. , poor quality training data will be used, resulting in a serious decrease in recognition rate.

直接原本音声ファイルを聴きながら、速記者によって作成されたテキストと間違った部分を探して修正する、精製されたコーパスを作るためには多くの努力と時間が必要である。 It takes a lot of effort and time to create a refined corpus by listening directly to the original audio files, searching for and correcting the text created by the stenographer and making mistakes.

図７を参照すれば、本発明の実施例による半自動精製－音声データ／転写テキストの抽出により精製されたコーパスデータを速い時期内に大量に構築することが可能である。 Referring to FIG. 7, it is possible to quickly construct a large amount of purified corpus data through semi-automatic purification-extraction of audio data/transcribed text according to an embodiment of the present invention.

第１ステップで、動画に合ったテキスト抽出および分割作業により字幕ファイルを生成する。 In the first step, a subtitle file is generated by text extraction and segmentation operations that match the video.

動画を用いた時間データ抽出および文章別発言時間による文章時間を用いて動画をスライスした後、ｗａｖ音声ファイルを抽出する。 After extracting time data using a video and slicing the video using the sentence time based on the utterance time of each sentence, a wav audio file is extracted.

以後、短く分割されたｗａｖ音声ファイルとマッチングされる部分に対する字幕部分を自動抽出して生成する。 Thereafter, subtitles are automatically extracted and generated for portions that match the shortened WAV audio files.

第２ステップでは、第１ステップで作ったｗａｖ音声ファイルとテキストを活用して、手動で精製されたコーパスデータを作る作業を行い、精製されたコーパスを活用してｓｅｅｄモデルを生成する。 In the second step, the wav audio file and text created in the first step are used to manually create refined corpus data, and the refined corpus is used to generate a seed model.

第３ステップで、ｓｅｅｄデータモデルから原本ｒａｗ音声ファイルをデコーディングする。 In the third step, the original raw audio file is decoded from the seed data model.

転写テキストとデコーディングされたテキスト結果を通して半自動精製音声データ抽出アルゴリズムで一致率（類似率）が６０％以上の原本ｒａｗ音声ファイルのみを抽出する。 Only original raw audio files with a matching rate (similarity rate) of 60% or more are extracted using a semi-automatic refined audio data extraction algorithm based on the transcribed text and the decoded text results.

従来技術によれば、音声データの転写過程は人（速記者）によって行われるが、この過程で速記者によって生成された転写テキストにはモデル学習に適しない情報が存在し、速記者の転写方式（入力パターン、例：数字表記、非言語的表現、話者音声重複など）が異なっていて、速記者による転写されたテキストが音声認識機のモデル学習に適した形態ではないことがある。 According to the conventional technology, the transcription process of audio data is performed by a person (stenographer), but in this process, the transcription text generated by the stenographer contains information that is not suitable for model learning, and the transcription method of the stenographer is The text transcribed by the stenographer may not be in a suitable form for the speech recognizer's model training due to differences in input patterns (e.g., numerical notation, non-verbal expressions, speaker voice overlap, etc.).

音声データと転写テキストによって音声認識機（ＳＴＴ）のモデルが学習されて構築されるため、音声認識性能を向上させるために精製された解答紙（音声データ、転写テキスト）が必要である。 Since a speech recognizer (STT) model is trained and constructed using speech data and transcribed text, refined answer sheets (speech data and transcribed text) are required to improve speech recognition performance.

研究によれば、９０％の音声認識性能を得るには、最小２０００時間以上の精製されたコーパスを学習させてこそ可能になるが、現在精製されたコーパスを作るために手動で作業を進める場合、時間あたり１０万ウォン以上の費用が必要である。 According to research, achieving 90% speech recognition performance is possible only by training a refined corpus for at least 2000 hours, but currently, when manually proceeding to create a refined corpus, However, it costs more than 100,000 won per hour.

すると、１０００時間の精製されたコーパスを収集するために１億ウォン以上の費用が必要である。 Therefore, it would cost more than 100 million won to collect a 1,000-hour refined corpus.

本発明の実施例による半自動精製音声データ抽出アルゴリズムを用いれば、同一の認識率を有する１０００時間の精製されたコーパスを約４分の１の費用だけで速い時期内に作り上げることができる。 Using the semi-automatic refined speech data extraction algorithm according to embodiments of the present invention, a 1000-hour refined corpus with the same recognition rate can be created in a faster time frame at only about one-fourth the cost.

したがって、本発明の実施例による半自動精製音声データ抽出アルゴリズムを用いて精製されたコーパスを生成するための作業時間と費用を大きく低減可能な効果がある。 Therefore, the working time and cost for generating a refined corpus using the semi-automatic refined speech data extraction algorithm according to the embodiment of the present invention can be greatly reduced.

動画（音声データを含む）が提供される場合、すなわち、講演、会議などの場合、字幕まで提供される場合は希である。 When video (including audio data) is provided, that is, in the case of lectures, conferences, etc., it is rare that even subtitles are provided.

その理由は、リアルタイム音声認識機の認識率が８５％程度にとどまるからである。 The reason is that the recognition rate of real-time speech recognizers remains at about 85%.

リアルタイムに字幕を提供する場合は認識率９６％以上でなければならず、速記者の力を借りなければならないため、費用が多くかかる問題点がある。 Providing subtitles in real time requires a recognition rate of 96% or higher and requires the help of a stenographer, which poses the problem of high costs.

この状況で、一部の重要な会議などは会議録を残すために、後で（非リアルタイムに）別途にテキストを作る作業をする。 In this situation, in order to record the minutes of some important meetings, a separate text is created later (non-real time).

地方議会などで「再度見る」サービスを支援する際は字幕の入った動画を支援する場合が多いが、これは動画とすでに作っておいたテキストとを混合させる技術を活用する。 When supporting ``watch again'' services at local councils, etc., they often support videos with subtitles, but this utilizes technology that mixes the video with the text that has already been created.

この時、動画の場面に合わせてテキストを同期化させる技術とともに音声認識技術が重要であるが、１つの動画に該当する字幕を手動で生成するためには多くの時間と高い費用がかかる。 At this time, voice recognition technology is important as well as technology to synchronize text with the video scene, but it takes a lot of time and high cost to manually generate subtitles for a single video.

本発明の実施例による精製された音声データとテキストで学習された音声認識機（ＳＴＴ）を用いる場合、音声認識率が改善されて自動でほとんどの字幕の生成が可能で、開発費用が節減可能な効果がある。 When a speech recognizer (STT) trained using refined speech data and text according to an embodiment of the present invention is used, the speech recognition rate is improved, most subtitles can be automatically generated, and development costs can be reduced. It has a great effect.

一方、本発明の実施例による音声認識のための半自動精製－音声データ抽出および転写データ生成方法は、コンピュータシステムで実現されるか、または記録媒体に記録される。コンピュータシステムは、少なくとも１つ以上のプロセッサと、メモリと、ユーザ入力装置と、データ通信バスと、ユーザ出力装置と、ストレージとを含むことができる。前述したそれぞれの構成要素は、データ通信バスを介してデータ通信をする。 Meanwhile, the semi-automatic purification-audio data extraction and transcription data generation method for speech recognition according to an embodiment of the present invention may be implemented in a computer system or recorded on a recording medium. A computer system can include at least one processor, memory, user input devices, a data communication bus, user output devices, and storage. Each of the aforementioned components communicates data via a data communication bus.

コンピュータシステムは、ネットワークにカップリングされたネットワークインターフェースをさらに含むことができる。プロセッサは、中央処理装置（ｃｅｎｔｒａｌｐｒｏｃｅｓｓｉｎｇｕｎｉｔ（ＣＰＵ））であるか、あるいはメモリおよび／またはストレージに格納された命令語を処理する半導体装置であってもよい。 The computer system can further include a network interface coupled to the network. The processor may be a central processing unit (CPU) or a semiconductor device that processes instructions stored in memory and/or storage.

メモリおよびストレージは、多様な形態の揮発性あるいは不揮発性記憶媒体を含むことができる。例えば、メモリは、ＲＯＭおよびＲＡＭを含むことができる。 Memory and storage may include various forms of volatile or non-volatile storage media. For example, memory can include ROM and RAM.

したがって、本発明の実施例による音声認識のための半自動精製－音声データ抽出および転写データ生成方法は、コンピュータで実行可能な方法で実現できる。本発明の実施例による音声認識のための半自動精製－音声データ抽出および転写データ生成方法がコンピュータ装置で行われる時、コンピュータで読取可能な命令語が本発明による音声認識のための半自動精製－音声データ抽出および転写データ生成方法を行うことができる。 Therefore, the semi-automatic purification-speech data extraction and transcription data generation method for speech recognition according to embodiments of the present invention can be implemented in a computer-executable manner. Semi-automatic refining for speech recognition according to embodiments of the present invention - When the method for extracting speech data and generating transcription data is performed on a computer device, the computer-readable command word is converted into semi-automatic refining for speech recognition according to the present invention - speech. Data extraction and transcription data generation methods can be performed.

一方、上述した本発明による音声認識のための半自動精製－音声データ抽出および転写データ生成方法は、コンピュータで読込める記録媒体にコンピュータが読込めるコードとして実現されることが可能である。コンピュータが読取可能な記録媒体としては、コンピュータシステムによって解読できるデータが記憶されたすべての種類の記録媒体を含む。例えば、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、磁気テープ、磁気ディスク、フラッシュメモリ、光データ記憶装置などがあり得る。また、コンピュータで読取可能な記録媒体は、コンピュータ通信網で連結されたコンピュータシステムに分散して、分散方式で読込めるコードとして記憶され実行される。 Meanwhile, the above-described semi-automatic refining, speech data extraction and transcription data generation method for speech recognition according to the present invention can be implemented as a computer-readable code on a computer-readable recording medium. Computer-readable storage media include all types of storage media that store data that can be read by a computer system. For example, there may be a ROM (Read Only Memory), a RAM (Random Access Memory), a magnetic tape, a magnetic disk, a flash memory, an optical data storage device, etc. The computer-readable recording medium may also be stored and executed as readable code in a distributed manner over computer systems linked by a computer communication network.

Claims

A method performed by a computer, the method comprising:
(a) slicing the audio data and constructing a refined corpus;
(b) performing model learning for a speech recognizer using the refined corpus;
(c) semi-automatically refining--extracting audio data and generating transcription data ;
The step (c) is
Perform string matching on the transcription string and decoding string to find groups of words with the maximum number of identical characters,
Selecting the transcription character string or the decoding character string by considering the number of word groups remaining in the transcription character string and the decoding character string.
A semi-automatic purification-speech data extraction and transcription data generation method for speech recognition.

Semi-automatic purification-audio data extraction and transcription data generation for speech recognition according to claim 1, wherein the step (a) performs pre-processing to divide the original audio data file into a preset time or less. Method.

According to claim 1, in the step (c), a similarity value between the transcribed character string and the decoded character string is checked, and character string matching is performed only for character strings whose similarity value is equal to or higher than a reference value. Semi-automatic refinement for speech recognition as described - A method for speech data extraction and transcription data generation.

The step (c) separates the transcribed character string and the decoded character string into units of words or phrases, and generates a character string from which blank spaces are removed from the decoded character string. Semi-automatic purification for speech recognition as described in 3.-Speech data extraction and transcription data generation method.

The step (c) searches for the same word as the first word of the transcribed string in the decoded string with blank spaces removed, and replaces the common word portion with a specific character. The semi-automatic purification-speech data extraction and transcription data generation method for speech recognition according to claim 4.

The voice according to claim 5, wherein the step (c) calculates a similarity value between the first word of the transcribed character string and the phrase of the character string with blanks removed from the decoded character string. Semi-automatic refinement for recognition - methods for speech data extraction and transcription data generation.

In step (c), the first word of the transcription string is incremented one character at a time from the beginning, and matching is performed to see if the same character exists in a word with a similarity value higher than a reference value. , Semi-automatic purification-speech data extraction and transcription data generation method for speech recognition as claimed in claim 6.

The speech recognition method according to claim 1 , wherein the step (c) is to finally select the transcribed string when the number of word groups remaining in the transcribed string and the decoded string are the same. Semi-automatic refining-audio data extraction and transcription data generation method for.

According to claim 1 , in the step (c), when a word group remains in only one of the transcribed character string and the decoded character string, a sentence of the remaining group is selected. Semi-automatic refinement for speech recognition as described - A method for speech data extraction and transcription data generation.

Semi-automatic refinement for speech recognition-speech according to claim 1 , wherein the step (c) selects the transcribed string when word groups remain only for the decoded string. Data extraction and transcription data generation methods.

According to claim 1 , in the step (c), when the number of remaining word groups is different for the transcribed character string and the decoded character string, the one with a larger number of groups is selected. Semi-automatic refinement for speech recognition - speech data extraction and transcription data generation method.

According to claim 1 , in step (c), when word groups remain only in the last parts of the transcribed character string and the decoded character string, the one with the longer number of characters is selected. Semi-automatic refinement for speech recognition - speech data extraction and transcription data generation method.