JP2017187797A

JP2017187797A - Text generation device, method, and program

Info

Publication number: JP2017187797A
Application number: JP2017120758A
Authority: JP
Inventors: 平芦川; Taira Ashikawa; 西山　修; Osamu Nishiyama; 修西山; 朋男池田; Tomoo Ikeda; 上野　晃嗣; Akitsugu Ueno; 晃嗣上野; 康太中田; Kota Nakata
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2017-06-20
Filing date: 2017-06-20
Publication date: 2017-10-12
Anticipated expiration: 2033-04-03
Also published as: JP6499228B2

Abstract

PROBLEM TO BE SOLVED: To provide a text generation device, a method, and a program capable of reducing load of transcript work.SOLUTION: A text generation device comprises a recognition section, a selection section and a generation section. The recognition section recognizes acquired voice to acquire a recognized character string and reliable level of the recognized character string of each recognition unit. The selection section selects at least one of the recognized character strings used for a transcript based on at least one parameter in a transcript accuracy parameter and a parameter of workload required for transcript. The generation section generates the transcript by using the selected recognized character string.SELECTED DRAWING: Figure 2

Description

本発明の実施形態は、テキスト生成装置、方法、及びプログラムに関する。 Embodiments described herein relate generally to a text generation device, a method, and a program.

書き起こし作業とは、例えば、録音された音声データを聞きながら、音声の内容を文章にする（テキストに書き起こす）作業のことである。そこで、従来から、書き起こし作業の負担を軽減するため、音声認識システムを用いて、書き起こし作業を支援する装置が知られている。 The transcription work is, for example, a work of making the content of the voice a sentence (writing it into text) while listening to the recorded voice data. Therefore, conventionally, in order to reduce the burden of the transcription work, an apparatus that supports the transcription work using a voice recognition system is known.

特開２００７−１０８４０７号公報JP 2007-108407 A

しかしながら、従来の装置は、作業者が望む適度な音声認識結果を得ることができず、書き起こし作業の負担を軽減するものではない。 However, the conventional apparatus cannot obtain an appropriate speech recognition result desired by the worker, and does not reduce the burden of the transcription work.

実施形態に係るテキスト生成装置は、認識部、選択部、及び生成部を備える。認識部は、取得した音声を認識し、認識単位ごとの認識文字列と前記認識文字列の信頼度とを得る。選択部は、書き起こし精度のパラメータ、及び、書き起こしに要する作業量のパラメータの、少なくとも一方の前記パラメータに基づき、書き起こし文に用いる少なくとも１つの前記認識文字列を選択する。生成部は、選択された前記認識文字列を用いて、前記書き起こし文を生成する。 The text generation device according to the embodiment includes a recognition unit, a selection unit, and a generation unit. The recognition unit recognizes the acquired voice and obtains a recognized character string for each recognition unit and a reliability of the recognized character string. The selection unit selects at least one of the recognized character strings to be used in the transcription sentence based on at least one of the parameter of the transcription accuracy and the parameter of the work amount required for the transcription. The generation unit generates the transcription sentence using the selected recognition character string.

第１の実施形態に係るテキスト生成装置の利用例を示す図。The figure which shows the usage example of the text generation apparatus which concerns on 1st Embodiment. 第１の実施形態に係るテキスト生成装置の機能構成例を示す図。The figure which shows the function structural example of the text generation apparatus which concerns on 1st Embodiment. 第１の実施形態に係るテキスト生成時の基本処理例を示すフローチャート。The flowchart which shows the example of a basic process at the time of the text generation which concerns on 1st Embodiment. 第１の実施形態に係る音声認識結果のデータ例を示す図。The figure which shows the example of data of the speech recognition result which concerns on 1st Embodiment. 第１の実施形態に係る認識文字列選択時の処理例（その１）を示すフローチャート。The flowchart which shows the process example (the 1) at the time of the recognition character string selection which concerns on 1st Embodiment. 第１の実施形態に係る書き起こし精度の許容値の設定例を示す図。The figure which shows the example of a setting of the allowable value of transcription accuracy which concerns on 1st Embodiment. 第１の実施形態に係る認識文字列選択結果のデータ例（その１）を示す図。The figure which shows the data example (the 1) of the recognition character string selection result which concerns on 1st Embodiment. 第１の実施形態に係る認識文字列選択時の処理例（その２）を示すフローチャート。The flowchart which shows the process example (the 2) at the time of the recognition character string selection which concerns on 1st Embodiment. 第１の実施形態に係る書き起こし作業時間の許容値の設定例を示す図。The figure which shows the example of a setting of the allowable value of the transcription work time which concerns on 1st Embodiment. 第１の実施形態に係る認識文字列選択結果のデータ例（その２）を示す図。The figure which shows the data example (the 2) of the recognition character string selection result which concerns on 1st Embodiment. 第１の実施形態に係る認識文字列選択時の処理例（その３）を示すフローチャート。6 is a flowchart showing a processing example (No. 3) when a recognized character string is selected according to the first embodiment. 第１の実施形態に係る書き起こし作業コストの許容値の設定例を示す図。The figure which shows the example of a setting of the allowable value of the transcription work cost which concerns on 1st Embodiment. 第１の実施形態に係る認識文字列選択結果のデータ例（その３）を示す図。The figure which shows the data example (the 3) of the recognition character string selection result which concerns on 1st Embodiment. 第１の実施形態に係る書き起こし文生成時の処理例を示すフローチャート。The flowchart which shows the process example at the time of the transcription production | generation concerning 1st Embodiment. 第１の実施形態に係る書き起こし文のデータ形式例を示す図。The figure which shows the data format example of the transcription sentence which concerns on 1st Embodiment. 第１の実施形態に係る書き起こし文の表示例を示す図。The figure which shows the example of a display of the transcription sentence which concerns on 1st Embodiment. 第１の実施形態に係る文字挿入位置設定時の処理例を示すフローチャート。6 is a flowchart showing a processing example when setting a character insertion position according to the first embodiment. 第１の実施形態に係る音声位置探索時の処理例を示すフローチャート。The flowchart which shows the process example at the time of the audio | voice position search which concerns on 1st Embodiment. 第２の実施形態に係るテキスト生成装置の機能構成例を示す図。The figure which shows the function structural example of the text generation apparatus which concerns on 2nd Embodiment. 第２の実施形態に係るテキスト生成時の基本処理例を示すフローチャート。The flowchart which shows the example of a basic process at the time of the text production | generation concerning 2nd Embodiment. 第２の実施形態に係る認識結果結合時の処理例を示すフローチャート。The flowchart which shows the process example at the time of the recognition result coupling | bonding which concerns on 2nd Embodiment. 第３の実施形態に係るテキスト生成装置の機能構成例を示す図。The figure which shows the function structural example of the text generation apparatus which concerns on 3rd Embodiment. 第３の実施形態に係るテキスト生成時の基本処理例を示すフローチャート。The flowchart which shows the example of a basic process at the time of the text production | generation concerning 3rd Embodiment. 第３の実施形態に係る発話区間情報のデータ例を示す図。The figure which shows the example of data of the speech area information which concerns on 3rd Embodiment. 第３の実施形態に係る認識文字列選択時の処理例を示すフローチャート。The flowchart which shows the process example at the time of the recognition character string selection which concerns on 3rd Embodiment. 第３の実施形態に係る書き起こし精度の許容値の設定例を示す図。The figure which shows the example of a setting of the allowable value of transcription accuracy which concerns on 3rd Embodiment. 実施形態に係るテキスト生成装置の構成例を示す図。The figure which shows the structural example of the text generation apparatus which concerns on embodiment.

以下に、添付図面を参照して、テキスト生成装置、方法、及びプログラムの実施形態を詳細に説明する。 Hereinafter, embodiments of a text generation device, a method, and a program will be described in detail with reference to the accompanying drawings.

［第１の実施形態］
＜概略＞
本実施形態に係るテキスト生成装置が有する機能（以下「テキスト生成機能」という）について説明する。本実施形態に係るテキスト生成装置は、音声認識結果に基づき算出した認識文字列の信頼度と、書き起こし精度に関するパラメータとに基づき、書き起こし文として用いる認識文字列を選択する。又は、本実施形態に係るテキスト生成装置は、音声認識結果に基づき算出した認識文字列の信頼度と、書き起こしに要する作業量に関するパラメータとに基づき、書き起こし文として用いる認識文字列を選択する。その結果、本実施形態に係るテキスト生成装置は、選択した認識文字列から書き起こし文を生成する。これにより、本実施形態に係るテキスト生成装置では、適度な音声認識結果を利用した書き起こし作業が可能となる。本実施形態に係るテキスト生成装置では、このようなテキスト生成機能を有する。 [First Embodiment]
<Outline>
A function (hereinafter referred to as “text generation function”) of the text generation apparatus according to the present embodiment will be described. The text generation device according to the present embodiment selects a recognized character string to be used as a transcript based on the reliability of the recognized character string calculated based on the speech recognition result and a parameter related to the transcription accuracy. Alternatively, the text generation device according to the present embodiment selects a recognized character string to be used as a transcript based on the reliability of the recognized character string calculated based on the speech recognition result and a parameter relating to the work amount required for the transcription. . As a result, the text generation device according to the present embodiment generates a transcription sentence from the selected recognized character string. As a result, the text generation apparatus according to the present embodiment can perform a transcription work using an appropriate speech recognition result. The text generation apparatus according to the present embodiment has such a text generation function.

従来の装置には、例えば、音声データに対する音声認識結果を俯瞰するものがある。この装置では、音声認識結果に対して、認識された単語の信頼度と重要度とに基づいて優先度を求め、優先度に従って、音声認識結果の出力情報を整形する。しかし、従来の装置は、作業者が、表示対象範囲の指定による出力調整しかできない。そのため、従来の装置では、書き起こしの精度、又は、書き起こしに要する作業量に応じて、作業者が望む適度な音声認識結果が出力されることが少なく、作業者に対する書き起こし作業の負担が大きい。このように、従来の装置は、作業者に対する書き起こし作業の負担を軽減するものではない。 Some conventional devices, for example, provide an overview of speech recognition results for speech data. In this apparatus, a priority is obtained for the speech recognition result based on the reliability and importance of the recognized word, and the output information of the speech recognition result is shaped according to the priority. However, in the conventional apparatus, the operator can only adjust the output by specifying the display target range. Therefore, in the conventional apparatus, an appropriate voice recognition result desired by the operator is rarely output according to the accuracy of the transcription or the amount of work required for the transcription, and the burden of the transcription work on the worker is reduced. large. Thus, the conventional apparatus does not reduce the burden of the transcription work for the worker.

そこで、本実施形態に係るテキスト生成装置は、作業者が指定した作業条件（書き起こし精度、又は、書き起こしに要する作業量)に応じて、音声認識結果の出力を調整する。本実施形態に係るテキスト生成装置は、調整した出力に対して、作業者が追加・修正を行う場合に、音声認識結果を用いて入力文字と音声とを同期することで、書き起こし作業が行える仕組みとした。 Therefore, the text generation apparatus according to the present embodiment adjusts the output of the speech recognition result in accordance with the work conditions specified by the worker (the transcription accuracy or the work amount required for the transcription). The text generation apparatus according to the present embodiment can perform a transcription work by synchronizing the input characters and the voice using the voice recognition result when the worker adds / modifies the adjusted output. It was made to work.

その結果、本実施形態に係るテキスト生成装置では、書き起こし精度や書き起こしに要する作業量などの作業条件に応じた適度な音声認識結果を、書き起こし作業時に利用することができ、音声認識結果に対して、容易に文字の追加や修正が行える。これにより、本実施形態に係るテキスト生成装置は、作業者に対する書き起こし作業の負担を軽減できる。 As a result, in the text generation device according to the present embodiment, an appropriate speech recognition result according to work conditions such as transcription accuracy and the amount of work required for transcription can be used during the transcription work, and the speech recognition result On the other hand, characters can be easily added or modified. Thereby, the text generation device according to the present embodiment can reduce the burden of the transcription work for the worker.

なお、本実施形態に係るテキスト生成装置は、例えば、次のようなサービスを提供できる。図１は、本実施形態に係るテキスト生成装置の利用例を示す図である。例えば、図１には、複数の話者の音声を認識し、各話者の発言ごとに、その内容をテキストに書き起こし、各テキストに発言元の話者の名前を付すことができるサービスに用いられた場合の例が示されている。 Note that the text generation apparatus according to the present embodiment can provide the following services, for example. FIG. 1 is a diagram illustrating a usage example of the text generation apparatus according to the present embodiment. For example, FIG. 1 shows a service that recognizes the voices of a plurality of speakers, transcribes the contents of each speaker's utterance, and attaches the name of the speaker who is speaking to each text. An example when used is shown.

以下に、本実施形態に係るテキスト生成装置が有する機能の構成とその動作について説明する。 In the following, the functional configuration and operation of the text generating apparatus according to the present embodiment will be described.

《構成》
図２は、本実施形態に係るテキスト生成装置の機能構成例を示す図である。図２に示すように、本実施形態に係るテキスト生成装置１００は、取得部１１、認識部１２、選択部１３、生成部１４、設定部１５、探索部１６、再生部１７、及び認識結果保持部１８などを有する。 "Constitution"
FIG. 2 is a diagram illustrating a functional configuration example of the text generation device according to the present embodiment. As shown in FIG. 2, the text generation apparatus 100 according to the present embodiment includes an acquisition unit 11, a recognition unit 12, a selection unit 13, a generation unit 14, a setting unit 15, a search unit 16, a reproduction unit 17, and a recognition result holding unit. Part 18 and the like.

取得部１１は、所定の入力手段により、音声入力を受け付けて、音声を取得する。認識部１２は、取得部１１で取得された音声を認識し、少なくとも、認識単位ごとの認識文字列と認識文字列の信頼度とを算出し、算出結果を認識結果保持部１８に記憶する。なお、認識単位は、例えば、形態素などに相当する。また、認識結果保持部１８は、例えば、テキスト生成装置１００が備える記憶装置の所定の記憶領域に相当する。 The acquisition unit 11 receives a voice input by a predetermined input unit and acquires a voice. The recognition unit 12 recognizes the voice acquired by the acquisition unit 11, calculates at least a recognized character string for each recognition unit and a reliability of the recognized character string, and stores the calculation result in the recognition result holding unit 18. Note that the recognition unit corresponds to, for example, a morpheme. The recognition result holding unit 18 corresponds to, for example, a predetermined storage area of a storage device included in the text generation device 100.

選択部１３は、書き起こし作業の作業条件に関する各種パラメータと、認識結果保持部１８に記憶された認識文字列の信頼度とに基づき、書き起こし文に用いる、少なくとも１つの認識文字列を選択する。なお、作業条件に関する各種パラメータの値は、例えば、ＵＩ（User Interface）を介して、作業者Ｕからの操作を受け付けることで指定される値である。生成部１４は、選択部１３で選択された認識文字列を用いて、書き起こし文を生成する。設定部１５は、選択部１３で選択されなかった認識文字列に対応する書き起こし文に対して、作業者Ｕによる文字入力の開始位置（以下「文字挿入位置」という）を設定する。なお、選択されなかった認識文字列は、例えば、ＵＩを介して、作業者Ｕからの操作を受け付けることで指定される。 The selection unit 13 selects at least one recognized character string to be used for the transcript based on various parameters related to the work conditions of the transcription work and the reliability of the recognized character string stored in the recognition result holding unit 18. . Note that the values of various parameters related to the work conditions are values that are specified by accepting an operation from the worker U via a UI (User Interface), for example. The generation unit 14 generates a transcript using the recognized character string selected by the selection unit 13. The setting unit 15 sets the character input start position (hereinafter referred to as “character insertion position”) by the operator U for the transcript corresponding to the recognized character string not selected by the selection unit 13. In addition, the recognition character string which was not selected is designated by receiving operation from the operator U via UI, for example.

探索部１６は、設定部１５で設定された文字挿入位置において、作業者Ｕによる文字入力が開始された場合に、入力された文字に対応する音声の位置（以下「音声位置」という）を探索する。なお、探索の開始は、例えば、ＵＩを介して、作業者Ｕからの操作を受け付けることで指示される。再生部１７は、探索された音声位置から音声を再生する。 When the character input by the operator U is started at the character insertion position set by the setting unit 15, the search unit 16 searches for a voice position corresponding to the input character (hereinafter referred to as “voice position”). To do. The start of the search is instructed by accepting an operation from the worker U via the UI, for example. The reproduction unit 17 reproduces sound from the searched sound position.

以下に、本実施形態に係るテキスト生成装置１００で実行されるテキスト生成時の基本処理について説明する。
《処理》
図３は、本実施形態に係るテキスト生成時の基本処理例を示すフローチャートである。図３に示すように、取得部１１は、音声を取得する（ステップＳ１０１）。次に認識部１２は、取得部１１で取得された音声を認識し、認識単位ごとの認識文字列と認識文字列の信頼度を算出する（ステップＳ１０２）。その結果、認識文字列と認識文字列の信頼度は、認識結果保持部１８に記憶される。 Below, the basic process at the time of the text generation performed with the text generation apparatus 100 which concerns on this embodiment is demonstrated.
"processing"
FIG. 3 is a flowchart showing an example of basic processing at the time of text generation according to the present embodiment. As illustrated in FIG. 3, the acquisition unit 11 acquires audio (step S101). Next, the recognition unit 12 recognizes the voice acquired by the acquisition unit 11, and calculates the recognition character string for each recognition unit and the reliability of the recognition character string (step S102). As a result, the recognition character string and the reliability of the recognition character string are stored in the recognition result holding unit 18.

次に選択部１３は、書き起こし作業の作業条件に関する各種パラメータ（作業条件パラメータ）と、認識結果保持部１８に記憶された認識文字列の信頼度とに基づき、書き起こし文に用いる、少なくとも１つの認識文字列を選択する（ステップＳ１０３）。このとき選択部１３は、書き起こし精度に関するパラメータと認識文字列の信頼度、又は、書き起こしに要する作業量に関するパラメータと認識文字列の信頼度の、いずれかのパラメータと信頼度との組み合わせに基づき、書き起こし文に用いる認識文字列を選択する。次に生成部１４は、選択部１３で選択された認識文字列と、選択部１３で選択されなかった認識文字列とを用いて、書き起こし文を生成する（ステップＳ１０４）。 Next, the selection unit 13 uses at least one used for the transcript based on various parameters (working condition parameters) relating to the working condition of the transcription work and the reliability of the recognized character string stored in the recognition result holding unit 18. Two recognized character strings are selected (step S103). At this time, the selection unit 13 selects a parameter relating to the transcription accuracy and the reliability of the recognized character string, or a combination of any parameter and reliability of the parameter relating to the work amount required for the transcription and the reliability of the recognized character string. Based on this, the recognition character string used for the transcript is selected. Next, the generation unit 14 generates a transcript using the recognized character string selected by the selection unit 13 and the recognized character string not selected by the selection unit 13 (step S104).

次に設定部１５は、選択部１３で選択されなかった認識文字列に対応する書き起こし文に対して、作業者Ｕから受け付けた設定に従い、作業者Ｕによる文字挿入位置を設定する（ステップＳ１０５）。次に探索部１６は、設定部１５で設定された文字挿入位置に対応する音声位置を、認識結果に基づいて探索する（ステップＳ１０６）。 Next, the setting unit 15 sets the character insertion position by the worker U according to the setting received from the worker U for the transcript corresponding to the recognized character string not selected by the selection unit 13 (step S105). ). Next, the search unit 16 searches for a voice position corresponding to the character insertion position set by the setting unit 15 based on the recognition result (step S106).

次に再生部１７は、作業者Ｕから受け付けた指定に従い、探索部１６で探索された音声位置から音声を再生する（ステップＳ１０７）。その後、テキスト生成装置１００は、作業者Ｕからの文字入力（追加・修正）を受け付ける（ステップＳ１０８）。 Next, the reproducing unit 17 reproduces sound from the sound position searched by the search unit 16 in accordance with the designation received from the worker U (step S107). Thereafter, the text generating apparatus 100 accepts character input (addition / correction) from the worker U (step S108).

本実施形態に係るテキスト生成装置１００は、作業者Ｕから書き起こし終了の指示を受け付けると（ステップＳ１０９：Ｙｅｓ）、処理を終了する。一方、テキスト生成装置１００は、作業者Ｕから書き起こし終了の指示が行われるまで（ステップＳ１０９：Ｎｏ）、ステップＳ１０６〜Ｓ１０８までの処理を繰り返す。 When the text generation device 100 according to the present embodiment receives a transcription end instruction from the worker U (step S109: Yes), the process ends. On the other hand, the text generating apparatus 100 repeats the processing from step S106 to step S108 until an instruction to finish transcription is given from the worker U (step S109: No).

＜詳細＞
ここからは、上記各機能部の詳細について説明する。 <Details>
From here, the detail of each said function part is demonstrated.

《各機能部の詳細》
（取得部１１）
取得部１１は、文字へと書き起こす対象となる音声を取得する。 << Details of each function >>
(Acquisition unit 11)
The acquisition part 11 acquires the audio | voice used as the object transcribed to a character.

（認識部１２）
認識部１２は、取得部１１で取得された音声を認識して、少なくとも、認識単位ごとの認識文字列と認識文字列の信頼度とを認識結果として得る。 (Recognition unit 12)
The recognition unit 12 recognizes the voice acquired by the acquisition unit 11, and obtains at least a recognized character string for each recognition unit and a reliability of the recognized character string as a recognition result.

図４は、本実施形態に係る音声認識結果Ｄ１のデータ例を示す図である。図４には、認識部１２が、「こんにちは、ＡＢＣ会社の太郎です。」という発話を音声認識した場合に得られる結果例が示されている。このように、認識部１２は、例えば、認識ＩＤ、認識文字列、及び認識文字列の信頼度などを含む音声認識結果Ｄ１を得る。認識部１２は、得た音声認識結果Ｄ１を認識結果保持部１８に記憶し保管する。 FIG. 4 is a diagram illustrating a data example of the speech recognition result D1 according to the present embodiment. In FIG. 4, the recognition unit 12, "Hello, this is Taro of ABC company." Example of the result obtained is shown in the case of speech recognition an utterance that. As described above, the recognition unit 12 obtains the speech recognition result D1 including the recognition ID, the recognized character string, and the reliability of the recognized character string, for example. The recognition unit 12 stores and stores the obtained speech recognition result D1 in the recognition result holding unit 18.

（選択部１３）
選択部１３は、書き起こし精度に関するパラメータと認識文字列の信頼度、又は、書き起こしに要する作業量に関するパラメータと認識文字列の信頼度の、いずれかのパラメータと信頼度との組み合わせに基づき、書き起こし文に用いる少なくとも１つの認識文字列を選択する。 (Selection unit 13)
The selection unit 13 is based on a combination of any parameter and reliability of the parameter regarding the transcription accuracy and the reliability of the recognized character string, or the parameter regarding the amount of work required for the transcription and the reliability of the recognized character string. At least one recognized character string used for the transcript is selected.

ここで、上記書き起こし精度と作業量について説明する。書き起こし精度は、書き起こした文字列と、音声を正確に文字に起こした場合の文字列（正解文字列）との一致の度合いを示す値であり、値が大きければ、書き起こした文字列と正解文字列の一致の度合いが高く、正確に書き起こされていることを表す。また、書き起こしに要する作業量は、音声を文字に起こす場合に必要な作業量であり、例えば、書き起こし作業にかかる時間やコストなどに相当する。 Here, the transcription accuracy and the workload will be described. Transcription accuracy is a value that indicates the degree of matching between the written character string and the character string (correct character string) when the speech is correctly transcribed. If the value is large, the written character string And the correct character string have a high degree of coincidence, indicating that it has been transcribed correctly. Also, the amount of work required for transcription is the amount of work necessary for generating speech into characters, and corresponds to, for example, the time and cost required for the transcription work.

以下に、選択部１３が認識文字列を選択する処理について説明する。図５は、本実施形態に係る認識文字列選択時の処理例（その１）を示すフローチャートである。図５には、選択部１３が、書き起こし精度に関するパラメータとして、書き起こし精度の許容値を用いる場合の処理例が示されている。 Below, the process in which the selection part 13 selects a recognition character string is demonstrated. FIG. 5 is a flowchart showing a processing example (part 1) at the time of selecting a recognized character string according to the present embodiment. FIG. 5 shows an example of processing in the case where the selection unit 13 uses an allowable value of transcription accuracy as a parameter related to transcription accuracy.

図５に示すように、選択部１３は、まず、作業者Ｕから、書き起こし精度の許容値Ｐの設定を受け付ける（ステップＳ２０１）。 As illustrated in FIG. 5, the selection unit 13 first receives a setting of the transcription accuracy tolerance P from the operator U (step S <b> 201).

図６は、本実施形態に係る書き起こし精度の許容値Ｐの設定例を示す図である。図６に示すように、作業者Ｕは、例えば、Ｎ段階（図中ではＮ＝５）のうち１つの許容段階を指定可能なスライド式のＵＩ（スライドバー）を介して、書き起こし精度の許容値Ｐを設定する。このように、選択部１３は、上記ＵＩを画面に表示し、作業者Ｕからの設定を受け付ける。 FIG. 6 is a diagram showing a setting example of the transcription accuracy allowable value P according to the present embodiment. As shown in FIG. 6, the operator U can, for example, control the transcription accuracy via a slide-type UI (slide bar) that can designate one of the N stages (N = 5 in the figure). An allowable value P is set. As described above, the selection unit 13 displays the UI on the screen and receives a setting from the worker U.

図５の説明に戻る。次に選択部１３は、認識部１２で得られた認識結果（認識結果保持部１８に記憶された認識結果）のうち、最初の認識文字列を対象文字列ｗとし（ステップＳ２０２）、対象文字列ｗの信頼度から、対象文字列ｗの書き起こし精度ｗｐを算出する（ステップＳ２０３）。このとき、選択部１３は、例えば、書き起こし精度として１〜Ｎの正の整数値を利用する場合、以下の（式１）により、対象文字列ｗの書き起こし精度ｗｐを算出する。
書き起こし精度ｗｐ＝Ｎ×（対象文字列ｗの信頼度/信頼度の最高値）・・・（式１） Returning to the description of FIG. Next, the selection unit 13 sets the first recognized character string as the target character string w among the recognition results obtained by the recognition unit 12 (recognition results stored in the recognition result holding unit 18) (step S202). The transcription accuracy wp of the target character string w is calculated from the reliability of the column w (step S203). At this time, for example, when using a positive integer value of 1 to N as the transcription accuracy, the selection unit 13 calculates the transcription accuracy wp of the target character string w by the following (Equation 1).
Transcription accuracy wp = N × (reliability of target character string w / maximum reliability) (Equation 1)

次に選択部１３は、算出した対象文字列ｗの書き起こし精度ｗｐと書き起こし精度の許容値Ｐとを比較し、書き起こし精度ｗｐが許容値Ｐ以上か否かを判定する（ステップＳ２０４）。その結果、選択部１３は、書き起こし精度ｗｐが許容値Ｐ以上と判定した場合（ステップＳ２０４：Ｙｅｓ）、対象文字列ｗを選択する（ステップＳ２０５）。一方、選択部１３は、書き起こし精度ｗｐが許容値Ｐ未満と判定した場合（ステップＳ２０４：Ｎｏ）、対象文字列ｗを選択しない。 Next, the selection unit 13 compares the calculated transcription accuracy wp of the target character string w with the transcription accuracy allowable value P, and determines whether or not the transcription accuracy wp is equal to or greater than the allowable value P (step S204). . As a result, when it is determined that the transcription accuracy wp is equal to or greater than the allowable value P (step S204: Yes), the selection unit 13 selects the target character string w (step S205). On the other hand, if the selection unit 13 determines that the transcription accuracy wp is less than the allowable value P (step S204: No), the selection unit 13 does not select the target character string w.

次に選択部１３は、認識部１２で得られた認識結果に、次の認識文字列があるか否かを判定する（ステップＳ２０６）。その結果、選択部１３は、次の認識文字列があると判定した場合（ステップＳ２０６：Ｙｅｓ）、次の認識文字列を対象文字列ｗとし（ステップＳ２０７）、ステップＳ２０３〜Ｓ２０６までの処理を繰り返す。一方、選択部１３は、次の認識文字列がないと判定した場合（ステップＳ２０６：Ｎｏ）、処理を終了する。 Next, the selection unit 13 determines whether or not there is a next recognized character string in the recognition result obtained by the recognition unit 12 (step S206). As a result, when the selection unit 13 determines that there is the next recognized character string (step S206: Yes), the next recognized character string is set as the target character string w (step S207), and the processes from step S203 to S206 are performed. repeat. On the other hand, the selection part 13 complete | finishes a process, when it determines with there being no next recognition character string (step S206: No).

図７は、本実施形態に係る認識文字列選択結果Ｄ２のデータ例（その１）を示す図である。図７には、Ｎ＝５、対象文字列ｗの信頼度＝４、及び信頼度Ｐの最高値＝１００とした場合、式（１）により算出した書き起こし精度ｗｐに基づき、認証文字列を選択した選択結果が示されている。このように、選択部１３は、例えば、認識ＩＤ、認識文字列、認識文字列の信頼度、書き起こし精度ｗｐ、及び選択結果などを含む認識文字列選択結果Ｄ２を得る。また、選択部１３は、書き起こしに要する作業量（例えば「作業時間」と「作業コスト」など）に基づき、認識文字列を選択してもよい。 FIG. 7 is a diagram showing a data example (part 1) of the recognized character string selection result D2 according to the present embodiment. In FIG. 7, when N = 5, the reliability of the target character string w = 4, and the maximum value of the reliability P = 100, the authentication character string is expressed based on the transcription accuracy wp calculated by the equation (1). The selected selection result is shown. In this way, the selection unit 13 obtains the recognized character string selection result D2 including, for example, the recognition ID, the recognized character string, the reliability of the recognized character string, the transcription accuracy wp, and the selection result. Further, the selection unit 13 may select the recognized character string based on the work amount required for transcription (for example, “work time” and “work cost”).

図８は、本実施形態に係る認識文字列選択時の処理例（その２）を示すフローチャートである。図８には、選択部１３が、書き起こしに要する作業量に関するパラメータとして、書き起こしに要する作業時間の許容値を用いる場合の処理例が示されている。 FIG. 8 is a flowchart showing a processing example (No. 2) when a recognized character string is selected according to the present embodiment. FIG. 8 shows an example of processing when the selection unit 13 uses an allowable value of work time required for transcription as a parameter related to the work amount required for transcription.

図８に示すように、選択部１３は、まず、作業者Ｕから、書き起こしに要する作業時間の許容値Ｔの設定を受け付ける（ステップＳ３０１）。 As illustrated in FIG. 8, the selection unit 13 first receives a setting of an allowable work time T required for transcription from the worker U (step S301).

図９は、本実施形態に係る書き起こし作業時間の許容値Ｔの設定例を示す図である。図９に示すように、作業者Ｕは、例えば、００：００：００からＨＨ：ＭＭ：ＳＳの間の時間を指定可能なスライド式のＵＩ（スライドバー）を介して、書き起こしに要する作業時間の許容値Ｔを設定する。このように、選択部１３は、上記ＵＩを画面に表示し、作業者Ｕからの設定を受け付ける。なお、指定可能な時間の最高値には、例えば、予め決められた値を用いる。また、指定可能な時間の最高値には、次のような方法で算出した値を用いてもよい。例えば、一文字あたりの作業時間を決めておき、認識部１２で得られた認識文字列の全文字数と一文字あたりの作業時間との積を算出し、算出した値を用いてもよい。また、認識部１２が、認識結果として各認識文字列の始端時刻と終端時刻とを出力する場合、出力された各認識文字列の終端時刻から始端時刻を減算した時間（発話時間）を算出し、全認識文字列の発話時間を総和した時間を用いてもよい。 FIG. 9 is a diagram showing an example of setting the allowable value T of the transcription work time according to the present embodiment. As shown in FIG. 9, for example, the worker U performs work required for transcription via a slide-type UI (slide bar) that can specify a time between 00:00:00 and HH: MM: SS. An allowable time T is set. As described above, the selection unit 13 displays the UI on the screen and receives a setting from the worker U. For example, a predetermined value is used as the maximum value of the specifiable time. Further, a value calculated by the following method may be used as the maximum value of the specifiable time. For example, the work time per character may be determined, the product of the total number of characters in the recognized character string obtained by the recognition unit 12 and the work time per character may be calculated, and the calculated value may be used. When the recognition unit 12 outputs the start time and the end time of each recognized character string as a recognition result, it calculates a time (utterance time) obtained by subtracting the start time from the end time of each output recognized character string. The total time of utterances of all recognized character strings may be used.

図８の説明に戻る。次に選択部１３は、認識部１２で得られた認識結果を、認識文字列の信頼度の降順にソートする（ステップＳ３０２）。次に選択部１３は、書き起こしに要する作業時間の累積を示す累積作業時間ｓｔを初期化する（ステップＳ３０３）。 Returning to the description of FIG. Next, the selection unit 13 sorts the recognition results obtained by the recognition unit 12 in descending order of the reliability of the recognized character string (step S302). Next, the selection unit 13 initializes an accumulated work time st indicating accumulation of work time required for transcription (step S303).

次に選択部１３は、降順にソートした認識結果のうち、最初の認識文字列を対象文字列ｗとし（ステップＳ３０４）、対象文字列ｗの書き起こしに要する作業時間ｔを算出する（ステップＳ３０５）。このとき、選択部１３は、例えば、対象文字列ｗの文字数を用いた以下の（式２）により、対象文字列ｗの書き起こしに要する作業時間ｔを算出する。
書き起こしに要する作業時間ｔ＝ α×（対象文字列ｗの文字数）・・・（式２）
なお、αには、例えば、１文字を書き起こすのにかかる平均時間を用いる。 Next, the selection unit 13 sets the first recognized character string among the recognition results sorted in descending order as the target character string w (step S304), and calculates the work time t required to transcribe the target character string w (step S305). ). At this time, the selection unit 13 calculates the work time t required to transcribe the target character string w using, for example, the following (Formula 2) using the number of characters of the target character string w.
Work time required for transcription t = α × (number of characters in target character string w) (Formula 2)
For example, the average time required to write one character is used as α.

また、選択部１３は、例えば、認識部１２が、認識結果として各認識文字列の始端時刻と終端時刻とを出力する場合、（式３）により、対象文字列ｗの書き起こしに要する作業時間ｔを算出してもよい。
書き起こしに要する作業時間ｔ＝ β×（対象文字列ｗの終端時刻―対象文字列ｗの始端時刻）・・・（式３）
なお、βには、例えば、１形態素（１つの認識単位）を書き起こすのにかかる平均時間を用いる。 For example, when the recognition unit 12 outputs the start time and the end time of each recognized character string as a recognition result, the selection unit 13 requires a work time required to transcribe the target character string w according to (Equation 3). t may be calculated.
Work time required for transcription t = β × (end time of target character string w−start time of target character string w) (Equation 3)
For β, for example, an average time taken to write one morpheme (one recognition unit) is used.

次に選択部１３は、対象文字列ｗの書き起こしに要する作業時間ｔから、書き起こしに要する累積作業時間ｓｔを算出する（ステップＳ３０６）。このとき選択部１３は、例えば、書き起こしに要する累積作業時間ｓｔに、（式２）又は（式３）で算出した対象文字列ｗの書き起こしに要する作業時間ｔを加算し累積する。 Next, the selection unit 13 calculates the accumulated work time st required for transcription from the work time t required for transcription of the target character string w (step S306). At this time, for example, the selection unit 13 adds the work time t required for transcription of the target character string w calculated in (Expression 2) or (Expression 3) to the cumulative work time st required for transcription and accumulates.

次に選択部１３は、算出した書き起こしに要する累積作業時間ｓｔと書き起こし作業時間の許容値Ｔを比較し、累積作業時間ｓｔが許容値Ｔ以下か否かを判定する（ステップＳ３０７）。その結果、選択部１３は、累積作業時間ｓｔが許容値Ｔ以下と判定した場合（ステップＳ３０７：Ｙｅｓ）、対象文字列ｗを選択する（ステップＳ３０８）。一方、選択部１３は、累積作業時間ｓｔが許容値Ｔより大きいと判定した場合（ステップＳ３０７：Ｎｏ）、対象文字列ｗを選択しない。 Next, the selection unit 13 compares the calculated cumulative work time st required for the transcription with the allowable value T of the transcription work time, and determines whether the cumulative work time st is equal to or less than the allowable value T (step S307). As a result, when it is determined that the accumulated work time st is equal to or less than the allowable value T (step S307: Yes), the selection unit 13 selects the target character string w (step S308). On the other hand, if the selection unit 13 determines that the accumulated work time st is greater than the allowable value T (step S307: No), the selection unit 13 does not select the target character string w.

次に選択部１３は、認識部１２で得られた認識結果に、次の認識文字列があるか否かを判定する（ステップＳ３０９）。その結果、選択部１３は、次の認識文字列があると判定した場合（ステップＳ３０９：Ｙｅｓ）、次の認識文字列を対象文字列ｗとし（ステップＳ３１０）、ステップＳ３０５〜Ｓ３０９までの処理を繰り返す。一方、選択部１３は、次の認識文字列がないと判定した場合（ステップＳ３０９：Ｎｏ）、処理を終了する。 Next, the selection unit 13 determines whether or not there is a next recognized character string in the recognition result obtained by the recognition unit 12 (step S309). As a result, when the selection unit 13 determines that there is the next recognized character string (step S309: Yes), the next recognized character string is set as the target character string w (step S310), and the processes from step S305 to S309 are performed. repeat. On the other hand, the selection part 13 complete | finishes a process, when it determines with there being no next recognition character string (step S309: No).

図１０は、本実施形態に係る認識文字列選択結果Ｄ２のデータ例（その２）を示す図である。図１０には、（式３）により算出した書き起こしに要する作業時間ｔに基づき、認証文字列を選択した選択結果が示されている。このように、選択部１３は、例えば、認識ＩＤ、認識文字列、認識文字列の信頼度、書き起こしに要する作業時間ｔ、累積作業時間ｓｔ、及び選択結果などを含む認識文字列選択結果Ｄ２を得る。 FIG. 10 is a diagram showing a data example (part 2) of the recognized character string selection result D2 according to the present embodiment. FIG. 10 shows a selection result of selecting an authentication character string based on the work time t required for transcription calculated by (Equation 3). Thus, the selection unit 13 recognizes the recognized character string selection result D2 including, for example, the recognition ID, the recognized character string, the reliability of the recognized character string, the work time t required for transcription, the accumulated work time st, and the selection result. Get.

図１１は、本実施形態に係る認識文字列選択時の処理例（その３）を示すフローチャートである。図１１には、選択部１３が、書き起こしに要する作業量に関するパラメータとして、書き起こしに要する作業コストの許容値を用いる場合の処理例が示されている。 FIG. 11 is a flowchart showing a processing example (No. 3) when a recognized character string is selected according to the present embodiment. FIG. 11 shows an example of processing when the selection unit 13 uses an allowable value for the work cost required for transcription as a parameter related to the work amount required for transcription.

図１１に示すように、選択部１３は、まず、作業者Ｕから、書き起こしに要する作業コストの許容値Ｃの設定を受け付ける（ステップＳ４０１）。 As illustrated in FIG. 11, the selection unit 13 first receives a setting of an allowable value C of work cost required for transcription from the worker U (step S401).

図１２は、本実施形態に係る書き起こし作業コストの許容値Ｃの設定例を示す図である。図１２に示すように、作業者Ｕは、例えば、０から最高値の間の値を指定可能なスライド式のＵＩ（スライドバー）を介して、書き起こしに要する作業コストの許容値Ｃを設定する。このように、選択部１３は、上記ＵＩを画面に表示し、作業者Ｕからの設定を受け付ける。なお、指定可能な値の最高値には、例えば、予め決められた値を用いる。また、指定可能な時間の最高値には、次のような方法で算出した値を用いてもよい。例えば、一文字あたりの作業時間を決めておき、認識部１２で得られた認識文字列の全文字数と一文字あたりの作業時間との積を算出し、算出した値を用いてもよい。また、認識部１２が、認識結果として各認識文字列の発話時間（終端時刻から始端時刻を減算した時間）を出力する場合、出力された各認識文字列の発話時間を総和した時間と、単位時間あたりの作業コストの積を算出し、算出した値を用いてもよい。 FIG. 12 is a diagram showing a setting example of the allowable value C of the transcription work cost according to the present embodiment. As shown in FIG. 12, the worker U sets an allowable value C of the work cost required for transcription, for example, via a slide-type UI (slide bar) that can specify a value between 0 and the maximum value. To do. As described above, the selection unit 13 displays the UI on the screen and receives a setting from the worker U. For example, a predetermined value is used as the maximum value that can be specified. Further, a value calculated by the following method may be used as the maximum value of the specifiable time. For example, the work time per character may be determined, the product of the total number of characters in the recognized character string obtained by the recognition unit 12 and the work time per character may be calculated, and the calculated value may be used. Further, when the recognition unit 12 outputs the utterance time of each recognized character string (a time obtained by subtracting the start time from the end time) as a recognition result, a time obtained by summing up the utterance times of the respective recognized character strings, and a unit A product of work costs per hour may be calculated and the calculated value may be used.

図１１の説明に戻る。次に選択部１３は、認識部１２で得られた認識結果を、認識文字列の信頼度の降順にソートする（ステップＳ４０２）。次に選択部１３は、書き起こしに要する作業コストの累積を示す累積作業コストｓｃを初期化する（ステップＳ４０３）。 Returning to the description of FIG. Next, the selection unit 13 sorts the recognition results obtained by the recognition unit 12 in descending order of the reliability of the recognized character string (step S402). Next, the selection unit 13 initializes an accumulated work cost sc indicating the accumulation of work costs required for transcription (step S403).

次に選択部１３は、降順にソートした認識結果のうち、最初の認識文字列を対象文字列ｗとし（ステップＳ４０４）、対象文字列ｗの書き起こしに要する作業コストｃを算出する（ステップＳ４０５）。このとき、選択部１３は、例えば、対象文字列ｗの文字数を用いた以下の（式４）により、対象文字列ｗの書き起こしに要する作業コストｃを算出する。
書き起こしに要する作業コストｃ＝ γ×（対象文字列ｗの文字数）・・・（式４）
なお、γには、例えば、１文字を書き起こすのにかかる平均コストを用いる。 Next, the selection unit 13 sets the first recognized character string as the target character string w among the recognition results sorted in descending order (step S404), and calculates the work cost c required for transcription of the target character string w (step S405). ). At this time, for example, the selection unit 13 calculates the work cost c required for transcription of the target character string w by the following (Equation 4) using the number of characters of the target character string w.
Work cost required for transcription c = γ × (number of characters of target character string w) (Formula 4)
For γ, for example, an average cost for writing one character is used.

また、選択部１３は、例えば、認識部１２が、認識結果として各認識文字列の始端時刻と終端時刻とを出力する場合、（式５）により、対象文字列ｗの書き起こしに要する作業コストｃを算出してもよい。
書き起こしに要する作業コストｃ＝ ζ×（対象文字列ｗの終端時刻―対象文字列ｗの始端時刻）・・・（式５）
なお、ζには、例えば、１形態素（１つの認識単位）を書き起こすのにかかる平均コストを用いる。 For example, when the recognition unit 12 outputs the start time and the end time of each recognized character string as a recognition result, the selection unit 13 performs an operation cost required for transcription of the target character string w according to (Equation 5). c may be calculated.
Work cost required for transcription c = ζ × (end time of target character string w−start time of target character string w) (Formula 5)
For ζ, for example, an average cost for writing one morpheme (one recognition unit) is used.

次に選択部１３は、対象文字列ｗの書き起こしに要する作業コストｃから、書き起こしに要する累積作業コストｓｃを算出する（ステップＳ４０６）。このとき選択部１３は、例えば、書き起こしに要する累積作業コストｓｃに、（式４）又は（式５）で算出した対象文字列ｗの書き起こしに要する作業コストｃを加算し累積する。 Next, the selection unit 13 calculates the accumulated work cost sc required for transcription from the work cost c required for transcription of the target character string w (step S406). At this time, for example, the selection unit 13 adds and accumulates the work cost c required for transcription of the target character string w calculated in (Expression 4) or (Expression 5) to the cumulative work cost sc required for transcription.

次に選択部１３は、算出した書き起こしに要する累積作業コストｓｃと書き起こし作業コストの許容値Ｃを比較し、累積作業コストｓｃが許容値Ｃ以下か否かを判定する（ステップＳ４０７）。その結果、選択部１３は、累積作業コストｓｃが許容値Ｃ以下と判定した場合（ステップＳ４０７：Ｙｅｓ）、対象文字列ｗを選択する（ステップＳ４０８）。一方、選択部１３は、累積作業コストｓｃが許容値Ｃより大きいと判定した場合（ステップＳ４０７：Ｎｏ）、対象文字列ｗを選択しない。 Next, the selection unit 13 compares the calculated cumulative work cost sc required for transcription with the allowable value C of the transcription work cost, and determines whether the cumulative work cost sc is equal to or less than the allowable value C (step S407). As a result, when it is determined that the accumulated work cost sc is equal to or less than the allowable value C (step S407: Yes), the selection unit 13 selects the target character string w (step S408). On the other hand, when the selection unit 13 determines that the accumulated work cost sc is larger than the allowable value C (step S407: No), the selection unit 13 does not select the target character string w.

次に選択部１３は、認識部１２で得られた認識結果に、次の認識文字列があるか否かを判定する（ステップＳ４０９）。その結果、選択部１３は、次の認識文字列があると判定した場合（ステップＳ４０９：Ｙｅｓ）、次の認識文字列を対象文字列ｗとし（ステップＳ４１０）、ステップＳ４０５〜Ｓ４０９までの処理を繰り返す。一方、選択部１３は、次の認識文字列がないと判定した場合（ステップＳ４０９：Ｎｏ）、処理を終了する。 Next, the selection unit 13 determines whether or not there is a next recognized character string in the recognition result obtained by the recognition unit 12 (step S409). As a result, if the selection unit 13 determines that there is the next recognized character string (step S409: Yes), the next recognized character string is set as the target character string w (step S410), and the processes from step S405 to S409 are performed. repeat. On the other hand, the selection part 13 complete | finishes a process, when it determines with there being no next recognition character string (step S409: No).

図１３は、本実施形態に係る認識文字列選択結果Ｄ２のデータ例（その３）を示す図である。図１３には、（式５）により算出した書き起こしに要する作業コストｃに基づき、認証文字列を選択した選択結果が示されている。このように、選択部１３は、例えば、認識ＩＤ、認識文字列、認識文字列の信頼度、書き起こしに要する作業コストｃ、累積作業コストｓｃ、及び選択結果などを含む認識文字列選択結果Ｄ２を得る。 FIG. 13 is a diagram showing a data example (part 3) of the recognized character string selection result D2 according to the present embodiment. FIG. 13 shows the selection result of selecting the authentication character string based on the work cost c required for the transcription calculated by (Equation 5). In this way, the selection unit 13 recognizes the recognition character string selection result D2 including, for example, the recognition ID, the recognition character string, the reliability of the recognition character string, the work cost c required for transcription, the cumulative work cost sc, and the selection result. Get.

（生成部１４）
生成部１４は、選択部１３において、選択された認識文字列と選択されなかった認識文字列とを用いて、書き起こし文を生成する。 (Generator 14)
The generation unit 14 generates a transcript using the selected recognized character string and the unselected recognized character string in the selection unit 13.

以下に、生成部１４が書き起こし文を生成する処理について説明する。図１４は、本実施形態に係る書き起こし文生成時の処理例を示すフローチャートである。また、図１５は、本実施形態に係る書き起こし文のデータ形式例を示す図である。 Below, the process which the production | generation part 14 produces | generates a transcription sentence is demonstrated. FIG. 14 is a flowchart showing a processing example when generating a transcript according to the present embodiment. FIG. 15 is a diagram showing an example of the data format of the transcription sentence according to this embodiment.

図１４に示すように、生成部１４は、まず、書き起こし文ｋを初期化する（ステップＳ５０１）。書き起こし文ｋは、例えば、データ形式がＨＴＭＬ（HyperText Markup Language）の場合、図１５に示すように、ＤＩＶ要素として作成される。 As illustrated in FIG. 14, the generation unit 14 first initializes the transcription sentence k (step S501). For example, when the data format is HTML (HyperText Markup Language), the transcription sentence k is created as a DIV element as shown in FIG.

次に生成部１４は、認識部１２で得られた認識結果のうち、最初の認識文字列を対象文字列ｗとし（ステップＳ５０２）、対象文字列ｗが選択部１３において選択されているか否かを判定する（ステップＳ５０３）。その結果、生成部１４は、対象文字列ｗが選択されていると判定した場合（ステップＳ５０３：Ｙｅｓ）、対象文字列ｗから選択要素ｓを作成し（ステップＳ５０４）、作成した選択要素ｓを書き起こし文ｋに追加する（ステップＳ５０５）。選択要素ｓは、例えば、図１５に示すように、ＩＤ属性を、対象文字列ｗの識別ＩＤ、また、ＣＬＡＳＳ属性を、選択要素ｓを示す文字列（例えば「ｓｅｌｅｃｔｅｄ」）とするＳＰＡＮ要素として作成される。一方、生成部１４は、対象文字列ｗが選択されていないと判定した場合（ステップＳ５０３：Ｎｏ）、対象文字列ｗから非選択要素ｎｓを作成し（ステップＳ５０６）、作成した非選択要素ｎｓを書き起こし文ｋに追加する（ステップＳ５０７）。非選択要素ｎｓは、例えば、図１５に示すように、ＩＤ属性を、対象文字列ｗの識別ＩＤ、また、ＣＬＡＳＳ属性を、非選択要素ｎｓを示す文字列（例えば「ｎｏｔ＿ｓｅｌｅｃｔｅｄ」）とするＳＰＡＮ要素として作成される。 Next, the generation unit 14 sets the first recognized character string as the target character string w among the recognition results obtained by the recognition unit 12 (step S502), and determines whether or not the target character string w is selected by the selection unit 13. Is determined (step S503). As a result, when the generation unit 14 determines that the target character string w is selected (step S503: Yes), the generation unit 14 creates a selection element s from the target character string w (step S504), and creates the created selection element s. It adds to the transcription sentence k (step S505). For example, as shown in FIG. 15, the selection element s is a SPAN element having an ID attribute as an identification ID of the target character string w and a CLASS attribute as a character string (for example, “selected”) indicating the selection element s. Created. On the other hand, when determining that the target character string w is not selected (step S503: No), the generation unit 14 creates a non-selected element ns from the target character string w (step S506), and creates the created non-selected element ns. Is added to the transcript k (step S507). For example, as shown in FIG. 15, the non-selected element ns is a SPAN in which the ID attribute is an identification ID of the target character string w, and the CLASS attribute is a character string (for example, “not_selected”) indicating the non-selected element ns. Created as an element.

次に生成部１４は、認識部１２で得られた認識結果に、次の認識文字列があるか否かを判定する（ステップＳ５０８）。その結果、生成部１４は、次の認識文字列があると判定した場合（ステップＳ５０８：Ｙｅｓ）、次の認識文字列を対象文字列ｗとし（ステップＳ５０９）、ステップＳ５０３〜Ｓ５０８までの処理を繰り返す。一方、生成部１４は、次の認識文字列がないと判定した場合（ステップＳ５０８：Ｎｏ）、処理を終了する。 Next, the generation unit 14 determines whether or not the recognition result obtained by the recognition unit 12 includes the next recognized character string (step S508). As a result, if the generation unit 14 determines that there is the next recognized character string (step S508: Yes), the next recognized character string is set as the target character string w (step S509), and the processes from step S503 to S508 are performed. repeat. On the other hand, when it determines with the production | generation part 14 not having the following recognition character string (step S508: No), a process is complete | finished.

図１６は、本実施形態に係る書き起こし文ｋの表示例を示す図である。図１６に示すように、生成部１４では、選択要素ｓの文字列と非選択要素ｎｓの文字列との区別が明確となるように、異なる態様で表示可能な書き起こし文ｋを生成してもよい。例えば、図１６（Ａ）には、非選択要素ｎｓの文字列に下線を付した場合の表示例が示されている。また、図１６（Ｂ）には、選択要素ｓの文字列より、非選択要素ｎｓの文字列の文字サイズを小さくした場合の表示例が示されている。また、図１６（Ｃ）には、非選択要素ｎｓの文字列に網掛けを施した場合の表示例が示されている。また、図１６（Ｄ）には、非選択要素ｎｓの文字列を所定の文字（図中では黒丸）に置き換えた場合の表示例が示されている。この他にも、文字の濃さ、色、書体、背景色などを変えた表示例などがある。また、認識部１２が、認識単位ごとに、信頼度が高い第Ｎ候補（Ｎは１以上の整数）までの認識文字列を出力する場合、選択されなかった認識文字列に対して、第Ｎ候補までの認識文字列を、作業者Ｕが選択可能な状態で表示される書き起こし文ｋを生成してもよい。 FIG. 16 is a diagram showing a display example of the transcription sentence k according to the present embodiment. As shown in FIG. 16, the generation unit 14 generates a transcript k that can be displayed in a different manner so that the distinction between the character string of the selected element s and the character string of the non-selected element ns becomes clear. Also good. For example, FIG. 16A shows a display example when the character string of the non-selected element ns is underlined. FIG. 16B shows a display example when the character size of the character string of the non-selected element ns is smaller than the character string of the selected element s. FIG. 16C shows a display example when the character string of the non-selected element ns is shaded. FIG. 16D shows a display example when the character string of the non-selected element ns is replaced with a predetermined character (black circle in the drawing). In addition to this, there are display examples in which the character density, color, typeface, background color, etc. are changed. When the recognition unit 12 outputs recognition character strings up to the Nth candidate (N is an integer equal to or greater than 1) with high reliability for each recognition unit, the recognition unit 12 outputs the recognition character string that is not selected. A transcript k that displays the recognized character string up to the candidate in a state where the operator U can select may be generated.

（設定部１５）
設定部１５は、生成部１４で生成された書き起こし文ｋの非選択要素ｎｓに基づき、文字挿入位置（文字入力の開始位置）を設定する。このとき設定部１５は、検出した現在の文字挿入位置と、書き起こし文内において、選択部１３で選択された認識文字列に相当する選択要素と選択部１３で選択されなかった認識文字列に相当する非選択要素との位置関係とに基づき、文字挿入位置を設定する。 (Setting unit 15)
The setting unit 15 sets a character insertion position (character input start position) based on the non-selected element ns of the transcript k generated by the generation unit 14. At this time, the setting unit 15 sets the detected current character insertion position, the selected element corresponding to the recognized character string selected by the selecting unit 13 and the recognized character string not selected by the selecting unit 13 in the transcript. The character insertion position is set based on the positional relationship with the corresponding non-selected element.

以下に、設定部１５が文字挿入位置を設定する処理について説明する。図１７は、本実施形態に係る文字挿入位置設定時の処理例を示すフローチャートである。 Hereinafter, a process in which the setting unit 15 sets the character insertion position will be described. FIG. 17 is a flowchart showing a processing example when setting the character insertion position according to the present embodiment.

図１７に示すように、設定部１５は、まず、作業者Ｕから、非選択要素ｎｓの文字への移動指示を受け付ける（ステップＳ６０１）。このとき設定部１５は、例えば、表示された書き起こし文内で所定のキー（例えば「タブキー」）が押下されたことを検出した場合、移動が指示されたと判断し、指示を受け付ける。
次に設定部１５は、書き起こし文内の現在の文字挿入位置ｃｐを検出する（ステップＳ６０２）。なお、現在の文字挿入位置ｃｐは、書き起こし文内の文字列における現在の文字挿入位置ｃｐである。例えば、書き起こし文が表示される画面上では、カーソル位置（例えば「縦棒が点滅する位置」）に相当する。 As illustrated in FIG. 17, the setting unit 15 first receives an instruction to move from the worker U to the character of the non-selected element ns (step S601). At this time, for example, when the setting unit 15 detects that a predetermined key (for example, “tab key”) is pressed in the displayed transcript, it determines that movement is instructed, and accepts the instruction.
Next, the setting unit 15 detects the current character insertion position cp in the transcript (step S602). The current character insertion position cp is the current character insertion position cp in the character string in the transcript. For example, it corresponds to the cursor position (for example, “position where the vertical bar blinks”) on the screen where the transcript is displayed.

次に設定部１５は、検出した現在の文字挿入位置ｃｐが選択要素内か否かを判定する（ステップＳ６０３）。その結果、設定部１５は、現在の文字挿入位置ｃｐが選択要素内であると判定した場合（ステップＳ６０３：Ｙｅｓ）、文字挿入位置ｃｐより後方で、文字挿入位置ｃｐに最も近い位置にある非選択要素ｎｓを検出する（ステップＳ６０４）。一方、設定部１５は、文字挿入位置ｃｐが選択要素内でないと判定した場合（ステップＳ６０３：Ｎｏ）、文字挿入位置ｃｐより後方で、文字挿入位置ｃｐに最も近い位置にある選択要素ｓを検出する（ステップＳ６０５）。その後、設定部１５は、検出した選択要素ｓより後方で、検出した選択要素ｓに最も近い位置にある非選択要素ｎｓを検出する（ステップＳ６０６）。次に設定部１５は、検出した非選択要素ｎｓの先頭位置ｎｓｐに文字挿入位置ｃｐを移動する（ステップＳ６０７）。 Next, the setting unit 15 determines whether or not the detected current character insertion position cp is within the selected element (step S603). As a result, when the setting unit 15 determines that the current character insertion position cp is within the selected element (step S603: Yes), the setting unit 15 is behind the character insertion position cp and is closest to the character insertion position cp. The selected element ns is detected (step S604). On the other hand, when the setting unit 15 determines that the character insertion position cp is not within the selected element (step S603: No), the setting unit 15 detects the selection element s that is behind the character insertion position cp and closest to the character insertion position cp. (Step S605). Thereafter, the setting unit 15 detects the non-selected element ns that is behind the detected selected element s and is closest to the detected selected element s (step S606). Next, the setting unit 15 moves the character insertion position cp to the head position nsp of the detected non-selected element ns (step S607).

なお、設定部１５は、非選択要素ｎｓの先頭位置ｎｓｐに文字挿入位置ｃｐを移動した後に、非選択要素ｎｓにより後方で連続する他の非選択要素が存在する場合、非選択要素ｎｓの文字列と他の非選択要素の文字列とを異なる態様で表示させてもよい。例えば、設定部１５は、非選択要素ｎｓの文字列と他の非選択要素の文字列とを、別の背景色によりハイライト表示させてもよい。 Note that the setting unit 15 moves the character insertion position cp to the head position nsp of the non-selected element ns, and then, when there is another non-selected element that continues behind by the non-selected element ns, You may display a row | line and the character string of another non-selection element in a different aspect. For example, the setting unit 15 may highlight the character string of the non-selected element ns and the character string of another non-selected element with different background colors.

（探索部１６）
探索部１６は、文字挿入位置ｃｐにおいて、作業者Ｕによる文字入力が開始された場合に、入力文字に対応する音声位置を探索する。 (Search unit 16)
When the character input by the operator U is started at the character insertion position cp, the search unit 16 searches for a voice position corresponding to the input character.

以下に、探索部１６が音声位置を探索する処理について説明する。図１８は、本実施形態に係る音声位置探索時の処理例を示すフローチャートである。 Hereinafter, a process in which the search unit 16 searches for a voice position will be described. FIG. 18 is a flowchart showing an example of processing at the time of voice position search according to the present embodiment.

図１８に示すように、設定部１５は、まず、作業者Ｕから、現在の文字挿入位置ｃｐに対応する音声位置の探索指示を受け付ける（ステップＳ７０１）。このとき探索部１６は、例えば、表示された書き起こし文内でＥｎｔｅｒキーが押下されたことを検出した場合、探索が指示されたと判断し、指示を受け付ける。 As illustrated in FIG. 18, the setting unit 15 first receives a search instruction for a voice position corresponding to the current character insertion position cp from the worker U (step S701). At this time, for example, when detecting that the Enter key is pressed in the displayed transcript, the search unit 16 determines that the search is instructed, and accepts the instruction.

次に探索部１６は、書き起こし文内の現在の文字挿入位置ｃｐを検出する（ステップＳ７０２）。次に探索部１６は、検出した現在の文字挿入位置ｃｐが選択要素内か否かを判定する（ステップＳ７０３）。 Next, the search unit 16 detects the current character insertion position cp in the transcript (step S702). Next, the search unit 16 determines whether or not the detected current character insertion position cp is within the selected element (step S703).

その結果、探索部１６は、現在の文字挿入位置ｃｐが選択要素内であると判定した場合（ステップＳ７０３：Ｙｅｓ）、選択要素ｓの始端時刻を音声位置ｐとする（ステップＳ７０４）。一方、探索部１６は、現在の文字挿入位置ｃｐが選択要素内でないと判定した場合（ステップＳ７０３：Ｎｏ）、所定の音声認識技術（例えば「強制アライメント法」）を用いて、音声位置ｐを推定する（ステップＳ７０５）。このとき探索部１６は、書き起こし文字ｋ、文字挿入位置ｃｐがある非選択要素ｎｓに該当する認識文字列の始端時刻、及び現在の音声再生位置などから、音声認識技術により推定する。 As a result, when it is determined that the current character insertion position cp is within the selected element (step S703: Yes), the search unit 16 sets the start time of the selected element s as the voice position p (step S704). On the other hand, when the search unit 16 determines that the current character insertion position cp is not within the selected element (step S703: No), the search unit 16 uses the predetermined voice recognition technique (for example, “forced alignment method”) to determine the voice position p. Estimate (step S705). At this time, the search unit 16 estimates by the speech recognition technique from the start character of the recognized character string corresponding to the non-selected element ns having the transcription character k, the character insertion position cp, the current speech reproduction position, and the like.

（再生部１７）
再生部１７は、探索部１６で探索された音声位置ｐから音声を再生する。 (Playback unit 17)
The reproduction unit 17 reproduces sound from the sound position p searched by the search unit 16.

＜まとめ＞
以上のように、本実施形態に係るテキスト生成装置１００によれば、音声認識結果に基づき算出した認識文字列の信頼度と、作業者Ｕが指定した書き起こし作業の作業条件に関する各種パラメータ（書き起こし精度、及び、書き起こしに要する作業量の、少なくとも一方のパラメータ）とに基づき、音声から認識した認識文字列を選択し、書き起こし文を生成する。 <Summary>
As described above, according to the text generating apparatus 100 according to the present embodiment, the reliability of the recognized character string calculated based on the speech recognition result, and various parameters (writing data related to the work conditions of the transcription work specified by the worker U). Based on the transcription accuracy and at least one parameter of the amount of work required for transcription, a recognized character string recognized from speech is selected, and a transcript is generated.

これによって、本実施形態に係るテキスト生成装置１００は、作業者Ｕが指定した作業条件に応じて、音声認識結果の出力を調整する。本実施形態に係るテキスト生成装置１００は、調整した出力に対して、作業者Ｕが追加・修正を行う場合に、音声認識結果を用いて入力文字と音声とを同期することで、書き起こし作業が行える環境を提供する。 Thereby, the text generation device 100 according to the present embodiment adjusts the output of the speech recognition result according to the work condition specified by the worker U. The text generation device 100 according to the present embodiment uses the speech recognition result to synchronize the input character and the speech when the worker U adds or corrects the adjusted output, thereby transcribe the work. Provide an environment that can

その結果、本実施形態に係るテキスト生成装置１００は、書き起こしの作業条件に応じた適度な音声認識結果を、書き起こし作業時に利用することができ、音声認識結果に対して、容易に文字の追加や修正が行える。これにより、本実施形態に係るテキスト生成装置１００は、作業者Ｕに対する書き起こし作業の負担を軽減できる。 As a result, the text generation apparatus 100 according to the present embodiment can use an appropriate speech recognition result according to the transcription work condition at the time of the transcription work. Can be added or modified. Thereby, the text generation device 100 according to the present embodiment can reduce the burden of the transcription work for the worker U.

［第２の実施形態］
＜概略＞
本実施形態に係るテキスト生成装置が有する機能（テキスト生成機能）について説明する。本実施形態に係るテキスト生成装置は、認識部で得られた認識結果を、文単位、又は、時間単位で結合し、結合した結果を、書き起こし文に用いる点で、上記実施形態と異なる。より具体的には、本実施形態に係るテキスト生成装置は、認識文字列の文末表現に基づき、認識結果を文単位に結合した結果を書き起こし文に用いる。又は、本実施形態に係るテキスト生成装置は、認識文字列の始端時刻と終端時刻とに基づき、認識結果を所定の時間単位に結合した結果を、書き起こし文に用いる。 [Second Embodiment]
<Outline>
A function (text generation function) of the text generation apparatus according to the present embodiment will be described. The text generation apparatus according to the present embodiment is different from the above-described embodiment in that the recognition result obtained by the recognition unit is combined in sentence units or time units, and the combined result is used in a transcript. More specifically, the text generation device according to the present embodiment uses a result obtained by combining the recognition results in units of sentences based on the sentence end expression of the recognized character string in the transcript. Alternatively, the text generation device according to the present embodiment uses a result obtained by combining the recognition results in predetermined time units based on the start time and the end time of the recognized character string in the transcript.

以下に、本実施形態に係るテキスト生成装置が有する機能の構成とその動作について説明する。なお、以下の説明では、上記実施形態と異なる事項について説明し、同じ事項については同一符号を付し、その説明を省略する。 In the following, the functional configuration and operation of the text generating apparatus according to the present embodiment will be described. In the following description, items different from the above embodiment will be described, the same items will be denoted by the same reference numerals, and description thereof will be omitted.

《構成》
図１９は、本実施形態に係るテキスト生成装置１００の機能構成例を示す図である。図１９に示すように、本実施形態に係るテキスト生成装置１００は、第１の実施形態の機能構成に対して、結合部２１及び認識結合結果保持部２２などを、さらに有する。 "Constitution"
FIG. 19 is a diagram illustrating a functional configuration example of the text generation device 100 according to the present embodiment. As illustrated in FIG. 19, the text generation device 100 according to the present embodiment further includes a combining unit 21 and a recognition combined result holding unit 22 in addition to the functional configuration of the first embodiment.

結合部２１は、認識部１２で得られた認識結果（認識結果保持部１８に記憶された認識結果）を、文単位、又は、時間単位で結合し、結合した結果を認識結合結果保持部２２に記憶する。なお、認識結合結果保持部２２は、例えば、テキスト生成装置１００が備える記憶装置の所定の記憶領域に相当する。また、選択部１３や探索部１６は、認識結合結果保持部２２に記憶された認識結合結果を用いる。 The combination unit 21 combines the recognition results obtained by the recognition unit 12 (recognition results stored in the recognition result holding unit 18) in sentence units or time units, and combines the combined results into the recognition combination result holding unit 22. To remember. Note that the recognition combination result holding unit 22 corresponds to, for example, a predetermined storage area of a storage device included in the text generation device 100. Further, the selection unit 13 and the search unit 16 use the recognition combination result stored in the recognition combination result holding unit 22.

以下に、本実施形態に係るテキスト生成装置１００で実行されるテキスト生成時の基本処理について説明する。
《処理》
図２０は、本実施形態に係るテキスト生成時の基本処理例を示すフローチャートである。図２０に示すように、取得部１１は、音声を取得する（ステップＳ８０１）。次に認識部１２は、取得部１１で取得された音声を認識し、認識単位ごとの認識文字列と認識文字列の信頼度を算出する（ステップＳ８０２）。その結果、認識文字列と認識文字列の信頼度は、認識結果保持部１８に記憶される。 Below, the basic process at the time of the text generation performed with the text generation apparatus 100 which concerns on this embodiment is demonstrated.
"processing"
FIG. 20 is a flowchart showing an example of basic processing at the time of text generation according to the present embodiment. As illustrated in FIG. 20, the acquisition unit 11 acquires sound (step S801). Next, the recognition unit 12 recognizes the voice acquired by the acquisition unit 11, and calculates the recognition character string and the reliability of the recognition character string for each recognition unit (step S802). As a result, the recognition character string and the reliability of the recognition character string are stored in the recognition result holding unit 18.

次に結合部２１は、認識部１２の認識結果を、所定の文単位、又は、所定の時間単位で結合する（ステップＳ８０３）。その結果、結合された認識文字列と結合後の認識文字列の信頼度は、認識結合結果として認識結合結果保持部２２に記憶される。次に選択部１３は、書き起こし作業の作業条件に関する各種パラメータ（作業条件パラメータ）と、認識結合結果保持部２２に記憶された認識結合結果の信頼度（結合後の認識文字列の信頼度）とに基づき、書き起こし文に用いる、少なくとも１つの認識文字列を選択する（ステップＳ８０４）。このとき選択部１３は、書き起こし精度に関するパラメータと認識文字列の信頼度、又は、書き起こしに要する作業量に関するパラメータと認識文字列の信頼度の、いずれかのパラメータと信頼度との組み合わせに基づき、書き起こし文に用いる認識文字列を選択する。 Next, the combining unit 21 combines the recognition results of the recognition unit 12 in a predetermined sentence unit or a predetermined time unit (step S803). As a result, the reliability of the combined recognized character string and the combined recognized character string is stored in the recognized combined result holding unit 22 as a recognized combined result. Next, the selection unit 13 performs various parameters (working condition parameters) related to the work conditions of the transcription work and the reliability of the recognition combination result stored in the recognition combination result holding unit 22 (reliability of the recognized character string after combination). Based on the above, at least one recognized character string to be used for the transcript is selected (step S804). At this time, the selection unit 13 selects a parameter relating to the transcription accuracy and the reliability of the recognized character string, or a combination of any parameter and reliability of the parameter relating to the work amount required for the transcription and the reliability of the recognized character string. Based on this, the recognition character string used for the transcript is selected.

次に生成部１４は、選択部１３で選択された認識文字列と、選択部１３で選択されなかった認識文字列とを用いて、書き起こし文を生成する（ステップＳ８０５）。次に設定部１５は、選択部１３で選択されなかった認識文字列に対応する書き起こし文に対して、作業者Ｕから受け付けた設定に従い、作業者Ｕによる文字挿入位置を設定する（ステップＳ８０６）。次に探索部１６は、設定部１５で設定された文字挿入位置に対応する音声位置を、認識結果に基づいて探索する（ステップＳ８０７）。 Next, the generation unit 14 generates a transcript using the recognized character string selected by the selection unit 13 and the recognized character string not selected by the selection unit 13 (step S805). Next, the setting unit 15 sets the character insertion position by the worker U according to the setting received from the worker U for the transcript corresponding to the recognized character string not selected by the selection unit 13 (step S806). ). Next, the search unit 16 searches for a voice position corresponding to the character insertion position set by the setting unit 15 based on the recognition result (step S807).

次に再生部１７は、作業者Ｕから受け付けた指定に従い、探索部１６で探索された音声位置から音声を再生する（ステップＳ８０８）。その後、テキスト生成装置１００は、作業者Ｕからの文字入力（追加・修正）を受け付ける（ステップＳ８０９）。 Next, the reproducing unit 17 reproduces sound from the sound position searched by the search unit 16 in accordance with the designation received from the worker U (step S808). Thereafter, the text generating apparatus 100 accepts character input (addition / correction) from the worker U (step S809).

本実施形態に係るテキスト生成装置１００は、作業者Ｕから書き起こし終了の指示を受け付けると（ステップＳ８１０：Ｙｅｓ）、処理を終了する。一方、テキスト生成装置１００は、作業者Ｕから書き起こし終了の指示が行われるまで（ステップＳ８１０：Ｎｏ）、ステップＳ８０７〜Ｓ８０９までの処理を繰り返す。 When the text generation apparatus 100 according to the present embodiment receives an instruction to finish transcription from the worker U (step S810: Yes), the process ends. On the other hand, the text generating apparatus 100 repeats the processes from Steps S807 to S809 until an instruction to finish transcription is given from the worker U (Step S810: No).

＜詳細＞
ここからは、主に結合部２１と選択部１３の詳細について説明する。 <Details>
From here, the detail of the coupling | bond part 21 and the selection part 13 is mainly demonstrated.

《各機能部の詳細》
（結合部２１）
結合部２１は、認識文字列の文末表現に基づき、認識結果を文単位に結合し、認識結合結果を得る。又は、結合部２１は、認識文字列の始端時刻と終端時刻とに基づき、認識結果を所定の時間単位に結合し、結合した文字列（結合後の認識文字列）と結合結果の信頼度とを含む認識結合結果を得る。 << Details of each function >>
(Coupling part 21)
The combining unit 21 combines the recognition results in sentence units based on the sentence end expression of the recognized character string, and obtains a recognition combination result. Alternatively, the combining unit 21 combines the recognition results in predetermined time units based on the start time and end time of the recognized character strings, and combines the combined character strings (recognized character strings after combining) and the reliability of the combined results. A recognition combination result including is obtained.

以下に、結合部２１が認識結果を結合する処理について説明する。図２１は、本実施形態に係る認識結果結合時の処理例を示すフローチャートである。 Hereinafter, processing in which the combining unit 21 combines the recognition results will be described. FIG. 21 is a flowchart illustrating a processing example when combining recognition results according to the present embodiment.

図２１に示すように、結合部２１は、まず、認識部１２で得られた認識結果（認識結果保持部１８に記憶された認識結果）の一時結合結果ｃｒを初期化する（ステップＳ９０１）。次に結合部２１は、認識部１２で得られた認識結果のうち、最初の認識結果を対象認識結果ｒとする（ステップＳ９０２）。次に結合部２１は、一時結合結果ｃｒに対象認識結果ｒを追加する（ステップＳ９０３）。 As shown in FIG. 21, the combining unit 21 first initializes a temporary combination result cr of the recognition result obtained by the recognition unit 12 (the recognition result stored in the recognition result holding unit 18) (step S901). Next, the combining unit 21 sets the first recognition result among the recognition results obtained by the recognition unit 12 as the target recognition result r (step S902). Next, the combining unit 21 adds the target recognition result r to the temporary combination result cr (step S903).

次に結合部２１は、結合を完了するか否かを判定する（ステップＳ９０４）。このとき結合部２１は、文単位に結合する場合と時間単位に結合する場合とで判定処理が異なる。 Next, the combining unit 21 determines whether or not to complete the combining (step S904). At this time, the determination process of the combining unit 21 differs depending on whether it is combined in sentence units or in time units.

（Ａ）文単位に結合する場合の判定処理
結合部２１は、対象認識結果ｒの認識文字列が文末か否かの判定結果に基づき、結合を完了するか否かを判定する。この場合、結合部２１は、対象認識結果ｒの認識文字列が文末である場合、結合を完了すると判定する（ステップＳ９０４：Ｙｅｓ）。一方、結合部２１は、対象認識結果ｒの認識文字列が文末でない場合、結合を完了しないと判定する（ステップＳ９０４：Ｎｏ）。なお、文末の判定方法には、例えば、「。」（句点）、「．」（ピリオド）、又は「？」（疑問符）などの文の終わりを表す文字又は記号が、認識文字列に含まれているか否かにより判定する方法などがある。また、これらの文字又は記号が含まれていない場合には、例えば、「です」や「ます」などの所定の文末表現が認識文字列に含まれているか否かにより判定してもよい。 (A) Determination processing when combining in sentence units The combining unit 21 determines whether or not to complete combining based on the determination result of whether or not the recognized character string of the target recognition result r is the end of the sentence. In this case, the combining unit 21 determines that combining is completed when the recognized character string of the target recognition result r is the end of the sentence (step S904: Yes). On the other hand, when the recognized character string of the target recognition result r is not the end of the sentence, the combining unit 21 determines that the combining is not completed (step S904: No). Note that the end-of-sentence determination method includes, for example, a character or symbol representing the end of a sentence such as “.” (Punctuation), “.” (Period), or “?” (Question mark) in the recognized character string. For example, there is a method for determining whether or not there is. If these characters or symbols are not included, for example, the determination may be made based on whether or not a predetermined sentence ending expression such as “is” or “mas” is included in the recognized character string.

（Ｂ）時間単位に結合する場合の判定処理
結合部２１は、認識結果として得られた認識文字列の始端時刻と終端時刻とに基づき、結合を完了するか否かを判定する。この場合、結合部２１は、対象認識結果ｒに相当する認識文字列の始端時刻から、対象認識結果ｒのひとつ前に一時結合結果ｃｒに追加された認識結果に相当する認識文字列の終端時刻までの経過時間が、所定の時間以上の場合、結合を完了すると判定する（ステップＳ９０４：Ｙｅｓ）。一方、結合部２１は、経過時間が、所定の時間未満の場合、結合を完了しないと判定する（ステップＳ９０４：Ｎｏ）。なお、結合部２１は、対象認識結果ｒの始端時刻から、一時結合結果ｃｒに追加された最初の認識結果に相当する認識文字列の始端時刻までの経過時間が、所定の時間以上の場合、結合を完了すると判定してもよい。 (B) Determination processing when combining in time unit The combining unit 21 determines whether or not to complete combining based on the start time and end time of the recognized character string obtained as a recognition result. In this case, the combining unit 21 starts the recognition character string corresponding to the recognition result added to the temporary combination result cr immediately before the target recognition result r from the start time of the recognition character string corresponding to the target recognition result r. If the elapsed time up to is a predetermined time or more, it is determined that the combination is completed (step S904: Yes). On the other hand, when the elapsed time is less than the predetermined time, the combining unit 21 determines that the combining is not completed (step S904: No). The combining unit 21 determines that the elapsed time from the start time of the target recognition result r to the start time of the recognized character string corresponding to the first recognition result added to the temporary combination result cr is equal to or longer than a predetermined time. It may be determined that the combination is completed.

その結果、結合部２１は、結合を完了すると判定した場合（ステップＳ９０４：Ｙｅｓ）、一時結合結果ｃｒの信頼度を算出する（ステップＳ９０５）。なお、一時結合結果ｃｒの信頼度は、一時結合結果ｃｒに追加された認識結果に相当する認識文字列の信頼度に基づき算出する。例えば、一時結合結果ｃｒに追加された認識結果に相当する認識文字列の信頼度の平均値を算出し、算出した値を一時結合結果ｃｒの信頼度とする。一方、結合部２１は、結合を完了しないと判定した場合（ステップＳ９０４：Ｎｏ）、後述するステップＳ９０８の処理へ移行し、ステップＳ９０５〜Ｓ９０７までの処理をスキップする。 As a result, when it is determined that the combination is completed (step S904: Yes), the combining unit 21 calculates the reliability of the temporary combination result cr (step S905). Note that the reliability of the temporary combination result cr is calculated based on the reliability of the recognized character string corresponding to the recognition result added to the temporary combination result cr. For example, the average value of the reliability of the recognized character string corresponding to the recognition result added to the temporary combination result cr is calculated, and the calculated value is set as the reliability of the temporary combination result cr. On the other hand, when it determines with combining not being completed (step S904: No), it transfers to the process of step S908 mentioned later, and the process from step S905 to S907 is skipped.

次に結合部２１は、一時結合結果ｃｒに相当する認識文字列を結合した文字列（結合後の認識文字列）と、算出した一時結合結果ｃｒの信頼度とを、認識結合結果保持部２２に記憶し（ステップＳ９０６）、一時結合結果ｃｒを初期化する（ステップＳ９０７）。 Next, the combining unit 21 uses the recognized combined character string corresponding to the temporary combined result cr (recognized character string after combining) and the reliability of the calculated temporary combined result cr as the recognized combined result holding unit 22. (Step S906), and the temporary combination result cr is initialized (step S907).

次に結合部２１は、認識部１２で得られた認識結果に、次の認識結果があるか否かを判定する（ステップＳ９０８）。その結果、結合部２１は、次の認識結果があると判定した場合（ステップＳ９０８：Ｙｅｓ）、次の認識結果を対象認識結果ｒとし（ステップＳ９０９）、ステップＳ９０３〜Ｓ９０８までの処理を繰り返す。一方、結合部２１は、次の認識結果がないと判定した場合（ステップＳ９０８：Ｎｏ）、一時結合結果ｃｒに認識結果が残っているか否かを判定する（ステップＳ９１０）。その結果、結合部２１は、一時結合結果ｃｒに認識結果が残っていると判定した場合（ステップＳ９１０：Ｙｅｓ）、ステップＳ９０５の処理へ移行する。一方、結合部２１は、一時結合結果ｃｒに認識結果が残っていないと判定した場合（ステップＳ９１０：Ｎｏ）、処理を終了する。 Next, the combining unit 21 determines whether or not the recognition result obtained by the recognition unit 12 includes the next recognition result (step S908). As a result, when the combining unit 21 determines that there is the next recognition result (step S908: Yes), the next recognition result is set as the target recognition result r (step S909), and the processing from step S903 to S908 is repeated. On the other hand, when determining that there is no next recognition result (step S908: No), the combining unit 21 determines whether or not a recognition result remains in the temporary combination result cr (step S910). As a result, when the combining unit 21 determines that the recognition result remains in the temporary combining result cr (step S910: Yes), the combining unit 21 proceeds to the process of step S905. On the other hand, if the combining unit 21 determines that no recognition result remains in the temporary combining result cr (step S910: No), the process ends.

（選択部１３）
選択部１３は、書き起こし精度に関するパラメータと認識結合結果の信頼度（結合後の認識文字列の信頼度）、又は、書き起こしに要する作業量に関するパラメータと認識結合結果の信頼度の、いずれかのパラメータと信頼度との組み合わせに基づき、書き起こし文に用いる少なくとも１つの認識文字列を選択する。 (Selection unit 13)
The selection unit 13 is either one of a parameter relating to transcription accuracy and reliability of a recognition combination result (reliability of a recognized character string after combination), or a parameter relating to a work amount required for transcription and reliability of a recognition combination result. Based on the combination of the parameters and the reliability, at least one recognition character string used for the transcript is selected.

＜まとめ＞
以上のように、本実施形態に係るテキスト生成装置１００によれば、文単位、又は、所定の時間単位ごとに結合した認識文字列の信頼度と、作業者Ｕが指定した書き起こし作業の作業条件に関する各種パラメータ（書き起こし精度、及び、書き起こしに要する作業量の、少なくとも一方のパラメータ）とに基づき、音声から認識した認識文字列を選択し、書き起こし文を生成する。 <Summary>
As described above, according to the text generation device 100 according to the present embodiment, the reliability of the recognized character string combined for each sentence unit or every predetermined time unit, and the work of the transcription work specified by the worker U. Based on various parameters related to conditions (at least one of the transcription accuracy and the amount of work required for transcription), a recognized character string recognized from speech is selected, and a transcript is generated.

その結果、本実施形態に係るテキスト生成装置１００は、第１の実施形態と同様に、容易に文字の追加や修正が行え、作業者Ｕに対する書き起こし作業の負担を軽減できる。 As a result, the text generating apparatus 100 according to the present embodiment can easily add or modify characters as in the first embodiment, and can reduce the burden of the transcription work on the worker U.

［第３の実施形態］
本実施形態に係るテキスト生成装置が有する機能（テキスト生成機能）について説明する。本実施形態に係るテキスト生成装置は、発話者、又は、発話区間ごとに、認識文字列の信頼度と書き起こし作業の作業条件に関する各種パラメータ（書き起こし精度、又は、書き起こしに要する作業量）とに基づき、音声から認識した認識文字列を選択し、書き起こし文を生成する点で、上記実施形態と異なる。 [Third Embodiment]
A function (text generation function) of the text generation apparatus according to the present embodiment will be described. The text generation apparatus according to the present embodiment has various parameters relating to the reliability of the recognized character string and the work conditions of the transcription work (transcription accuracy or work amount required for the transcription) for each speaker or speech section. Based on the above, it is different from the above embodiment in that a recognized character string recognized from speech is selected and a transcript is generated.

《構成》
図２２は、本実施形態に係るテキスト生成装置１００の機能構成例を示す図である。図２２に示すように、本実施形態に係るテキスト生成装置１００は、第１の実施形態の機能構成に対して、発話区間情報生成部３１及び発話区間情報保持部３２などを、さらに有する。 "Constitution"
FIG. 22 is a diagram illustrating a functional configuration example of the text generation device 100 according to the present embodiment. As illustrated in FIG. 22, the text generation device 100 according to the present embodiment further includes an utterance section information generation unit 31, an utterance section information holding unit 32, and the like, in addition to the functional configuration of the first embodiment.

発話区間情報生成部３１は、取得部１１で取得された音声に対して、各発話を識別する発話ＩＤ、発話が開始された時刻（以下「発話開始時刻」という）、及び発話している発話者を識別する発話者ＩＤなどを含む発話区間情報を生成し、生成した発話区間情報を発話区間情報保持部３２に記憶する。なお、発話区間情報保持部３２は、例えば、テキスト生成装置１００が備える記憶装置の所定の記憶領域に相当する。また、選択部１３や探索部１６は、発話区間情報保持部３２に記憶された発話区間情報を用いる。 The utterance section information generation unit 31 utters an utterance ID that identifies each utterance, a utterance start time (hereinafter referred to as an “utterance start time”), and an utterance that is uttered, with respect to the voice acquired by the acquisition unit 11. Utterance section information including a speaker ID for identifying a person is generated, and the generated utterance section information is stored in the utterance section information holding unit 32. Note that the utterance section information holding unit 32 corresponds to, for example, a predetermined storage area of a storage device included in the text generation device 100. Further, the selection unit 13 and the search unit 16 use the utterance section information stored in the utterance section information holding unit 32.

以下に、本実施形態に係るテキスト生成装置１００で実行されるテキスト生成時の基本処理について説明する。
《処理》
図２３は、本実施形態に係るテキスト生成時の基本処理例を示すフローチャートである。図２３に示すように、取得部１１は、音声を取得する（ステップＳ１００１）。次に認識部１２は、取得部１１で取得された音声を認識し、認識単位ごとの認識文字列と認識文字列の信頼度を算出する（ステップＳ１００２）。その結果、認識文字列と認識文字列の信頼度は、認識結果保持部１８に記憶される。 Below, the basic process at the time of the text generation performed with the text generation apparatus 100 which concerns on this embodiment is demonstrated.
"processing"
FIG. 23 is a flowchart showing an example of basic processing at the time of text generation according to the present embodiment. As illustrated in FIG. 23, the acquisition unit 11 acquires a voice (step S1001). Next, the recognition unit 12 recognizes the voice acquired by the acquisition unit 11, and calculates the recognition character string and the reliability of the recognition character string for each recognition unit (step S1002). As a result, the recognition character string and the reliability of the recognition character string are stored in the recognition result holding unit 18.

次に発話区間情報生成部３１は、取得部１１で取得された音声に対して、発話ごとに、発話ＩＤ、発話開始時刻、及び発話者ＩＤを含む発話区間情報を生成する（ステップＳ１００３）。その結果、発話区間情報は、発話区間情報保持部３２に記憶される。 Next, the utterance section information generation unit 31 generates utterance section information including the utterance ID, the utterance start time, and the utterer ID for each utterance with respect to the voice acquired by the acquisition unit 11 (step S1003). As a result, the utterance section information is stored in the utterance section information holding unit 32.

次に選択部１３は、発話区間情報保持部３２に記憶された発話区間情報に基づき、発話者、又は、発話区間ごとに、書き起こし作業の作業条件に関する各種パラメータ（作業条件パラメータ）と、認識結果保持部１８に記憶された認識文字列の信頼度とに基づき、書き起こし文に用いる、少なくとも１つの認識文字列を選択する（ステップＳ１００４）。このとき選択部１３は、書き起こし精度に関するパラメータと認識文字列の信頼度、又は、書き起こしに要する作業量に関するパラメータと認識文字列の信頼度の、いずれかのパラメータと信頼度との組み合わせに基づき、書き起こし文に用いる認識文字列を選択する。次に生成部１４は、選択部１３で選択された認識文字列と、選択部１３で選択されなかった認識文字列とを用いて、書き起こし文を生成する（ステップＳ１００５）。 Next, the selection unit 13 recognizes various parameters (working condition parameters) related to the work conditions of the transcription work for each utterer or each utterance section based on the utterance section information stored in the utterance section information holding unit 32. Based on the reliability of the recognized character string stored in the result holding unit 18, at least one recognized character string to be used for the transcript is selected (step S1004). At this time, the selection unit 13 selects a parameter relating to the transcription accuracy and the reliability of the recognized character string, or a combination of any parameter and reliability of the parameter relating to the work amount required for the transcription and the reliability of the recognized character string. Based on this, the recognition character string used for the transcript is selected. Next, the generating unit 14 generates a transcript using the recognized character string selected by the selecting unit 13 and the recognized character string not selected by the selecting unit 13 (step S1005).

次に設定部１５は、選択部１３で選択されなかった認識文字列に対応する書き起こし文に対して、作業者Ｕから受け付けた設定に従い、作業者Ｕによる文字挿入位置を設定する（ステップＳ１００６）。次に探索部１６は、設定部１５で設定された文字挿入位置に対応する音声位置を、認識結果に基づいて探索する（ステップＳ１００７）。 Next, the setting unit 15 sets the character insertion position by the worker U according to the setting received from the worker U for the transcript corresponding to the recognized character string not selected by the selection unit 13 (step S1006). ). Next, the search unit 16 searches for a voice position corresponding to the character insertion position set by the setting unit 15 based on the recognition result (step S1007).

次に再生部１７は、作業者Ｕから受け付けた指定に従い、探索部１６で探索された音声位置から音声を再生する（ステップＳ１００８）。その後、テキスト生成装置１００は、作業者Ｕからの文字入力（追加・修正）を受け付ける（ステップＳ１００９）。 Next, the reproducing unit 17 reproduces sound from the sound position searched by the search unit 16 in accordance with the designation received from the worker U (step S1008). Thereafter, the text generating apparatus 100 accepts character input (addition / correction) from the worker U (step S1009).

本実施形態に係るテキスト生成装置１００は、作業者Ｕから書き起こし終了の指示を受け付けると（ステップＳ１０１０：Ｙｅｓ）、処理を終了する。一方、テキスト生成装置１００は、作業者Ｕから書き起こし終了の指示が行われるまで（ステップＳ１０１０：Ｎｏ）、ステップＳ１００７〜Ｓ１００９までの処理を繰り返す。 When the text generation device 100 according to the present embodiment receives a transcription end instruction from the worker U (step S1010: Yes), the process ends. On the other hand, the text generation device 100 repeats the processing from step S1007 to step S1009 until an instruction to finish transcription is given from the worker U (step S1010: No).

＜詳細＞
ここからは、主に発話区間情報生成部３１と選択部１３の詳細について説明する。 <Details>
From here, the details of the utterance section information generation unit 31 and the selection unit 13 will be mainly described.

《各機能部の詳細》
（発話区間情報生成部３１）
発話区間情報生成部３１は、次のような方法で発話者と発話区間とを特定し、発話区間情報を生成する。例えば、発話区間情報生成部３１は、音声を聞きながら各発話の発話者と発話開始時刻とを特定した作業者Ｕからの特定結果を受け付け、受け付けた特定結果から発話区間情報を生成する。また、発話区間情報生成部３１は、音響的特徴量に基づく話者認識技術を用いて、発話者と発話区間とを推定し、推定した結果から発話区間情報を生成してもよい。 << Details of each function >>
(Speech section information generation unit 31)
The utterance section information generation unit 31 specifies the utterer and the utterance section by the following method, and generates the utterance section information. For example, the utterance section information generation unit 31 receives a specific result from the worker U who specified the utterer and the utterance start time of each utterance while listening to the voice, and generates utterance section information from the received specific result. Further, the utterance section information generation unit 31 may estimate the utterer and the utterance section using speaker recognition technology based on the acoustic feature amount, and may generate the utterance section information from the estimated result.

図２４は、本実施形態に係る発話区間情報Ｄ３のデータ例を示す図である。図２４には、発話区間情報生成部３１が、取得部１１で取得された音声から、複数の発話者と各発話区間とを特定（推定）した場合に生成されるデータ例が示されている。このように、発話区間情報生成部３１は、例えば、発話ＩＤ、発話開始時刻、及び発話者ＩＤなどを含む発話区間情報Ｄ３を生成する。発話区間情報生成部３１は、生成した発話区間情報Ｄ３を発話区間情報保持部３２に記憶し保管する。 FIG. 24 is a diagram illustrating a data example of the utterance section information D3 according to the present embodiment. FIG. 24 shows an example of data generated when the utterance section information generation unit 31 specifies (estimates) a plurality of speakers and each utterance section from the voice acquired by the acquisition unit 11. . Thus, the utterance section information generation unit 31 generates utterance section information D3 including, for example, an utterance ID, an utterance start time, and a speaker ID. The utterance section information generation unit 31 stores and stores the generated utterance section information D3 in the utterance section information holding unit 32.

（選択部１３）
選択部１３は、発話区間情報生成部３１で生成された発話区間情報Ｄ３に基づき、発話者、又は、発話区間ごとに、認識文字列の信頼度と書き起こし作業の作業条件に関する各種パラメータとに基づき、音声から認識した認識文字列を選択する。より具体的には、選択部１３は、発話者、又は、発話区間ごとに、書き起こし精度に関するパラメータと認識文字列の信頼度とに基づき、書き起こし文に用いる少なくとも１つの認識文字列を選択する。また、選択部１３は、発話者、又は、発話区間ごとに、書き起こしに要する作業量に関するパラメータと認識文字列の信頼度とに基づき、書き起こし文に用いる少なくとも１つの認識文字列を選択する。 (Selection unit 13)
Based on the utterance interval information D3 generated by the utterance interval information generation unit 31, the selection unit 13 sets the reliability of the recognized character string and various parameters relating to the work conditions of the transcription work for each utterer or utterance interval. Based on this, the recognition character string recognized from the voice is selected. More specifically, the selection unit 13 selects at least one recognition character string to be used for the transcription sentence based on the parameter related to the transcription accuracy and the reliability of the recognition character string for each utterer or utterance section. To do. The selection unit 13 selects at least one recognized character string to be used for the transcript based on a parameter related to the amount of work required for transcription and the reliability of the recognized character string for each utterer or utterance section. .

以下に、選択部１３が認識文字列を選択する処理について説明する。図２５は、本実施形態に係る認識文字列選択時の処理例を示すフローチャートである。図２５には、選択部１３が、発話者ごとの書き起こし精度に関するパラメータとして、書き起こし精度の許容値を用いる場合の処理例が示されている。 Below, the process in which the selection part 13 selects a recognition character string is demonstrated. FIG. 25 is a flowchart illustrating a processing example when selecting a recognized character string according to the present embodiment. FIG. 25 shows an example of processing when the selection unit 13 uses an allowable value of transcription accuracy as a parameter related to transcription accuracy for each speaker.

図２５に示すように、選択部１３は、まず、作業者Ｕから、発話者ｉごとの書き起こし精度の許容値Ｐ（ｉ）（ｉ＝１〜Ｍ；Ｍは話者数）の設定を受け付ける（ステップＳ１１０１）。 As shown in FIG. 25, the selection unit 13 first sets an allowable value P (i) (i = 1 to M; M is the number of speakers) of the transcription accuracy for each speaker i from the worker U. Accept (step S1101).

図２６は、本実施形態に係る書き起こし精度の許容値Ｐ（ｉ）の設定例を示す図である。図２６に示すように、作業者Ｕは、例えば、Ｎ段階（図中ではＮ＝５）のうち１つの許容段階を指定可能なスライド式のＵＩ（スライドバー）を介して、発話者ごとの書き起こし精度の許容値Ｐ（ｉ）を設定する。このように、選択部１３は、上記ＵＩを画面に表示し、作業者Ｕからの設定を受け付ける。 FIG. 26 is a diagram showing a setting example of the transcription accuracy allowable value P (i) according to the present embodiment. As shown in FIG. 26, the worker U, for example, for each speaker through a slide-type UI (slide bar) that can designate one allowable stage among N stages (N = 5 in the figure). An allowable value P (i) of transcription accuracy is set. As described above, the selection unit 13 displays the UI on the screen and receives a setting from the worker U.

図２５の説明に戻る。次に選択部１３は、認識部１２で得られた認識結果（認識結果保持部１８に記憶された認識結果）のうち、最初の認識文字列を対象文字列ｗとし（ステップＳ１１０２）、対象文字列ｗの信頼度から、対象文字列ｗの書き起こし精度ｗｐを算出する（ステップＳ１１０３）。このとき、選択部１３は、例えば、第１の実施形態で説明した（式１）により、対象文字列ｗの書き起こし精度ｗｐを算出する。 Returning to the description of FIG. Next, the selection unit 13 sets the first recognized character string as the target character string w among the recognition results obtained by the recognition unit 12 (recognition results stored in the recognition result holding unit 18) (step S1102). From the reliability of the column w, the transcription accuracy wp of the target character string w is calculated (step S1103). At this time, for example, the selection unit 13 calculates the transcription accuracy wp of the target character string w according to (Expression 1) described in the first embodiment.

次に選択部１３は、発話区間情報保持部３２で記憶された発話区間情報Ｄ３に基づき、対象文字列ｗの発話者ｗｉを特定する（ステップＳ１１０４）。このとき選択部１３は、例えば、発話区間情報Ｄ３の中で、発話区間ｎの開始時刻と次の発話区間ｎ＋１の開始時刻との間に、認識文字列の始端時刻が存在する発話区間ｎを抽出し、発話区間ｎの発話者ＩＤから発話者ｗｉを特定する。 Next, the selection unit 13 specifies the speaker wi of the target character string w based on the speech segment information D3 stored in the speech segment information holding unit 32 (step S1104). At this time, for example, in the utterance section information D3, the selection unit 13 selects the utterance section n in which the start time of the recognized character string exists between the start time of the utterance section n and the start time of the next utterance section n + 1. The speaker wi is extracted from the speaker ID of the utterance section n.

次に選択部１３は、算出した対象文字列ｗの書き起こし精度ｗｐと、特定した発話者ｗｉの書き起こし精度の許容値Ｐ（ｗｉ）とを比較し、書き起こし精度ｗｐが許容値Ｐ（ｗｉ）以上か否かを判定する（ステップＳ１１０５）。その結果、選択部１３は、書き起こし精度ｗｐが許容値Ｐ（ｗｉ）以上と判定した場合（ステップＳ１１０５：Ｙｅｓ）、対象文字列ｗを選択する（ステップＳ１１０６）。一方、選択部１３は、書き起こし精度ｗｐが許容値Ｐ（ｗｉ）未満と判定した場合（ステップＳ１１０５：Ｎｏ）、対象文字列ｗを選択しない。 Next, the selection unit 13 compares the calculated transcription accuracy wp of the target character string w with the identified transcription accuracy allowable value P (wi) of the speaker wi, and the transcription accuracy wp is the allowable value P ( wi) It is determined whether or not it is greater than or equal to (step S1105). As a result, when determining that the transcription accuracy wp is equal to or greater than the allowable value P (wi) (step S1105: Yes), the selection unit 13 selects the target character string w (step S1106). On the other hand, when determining that the transcription accuracy wp is less than the allowable value P (wi) (step S1105: No), the selection unit 13 does not select the target character string w.

次に選択部１３は、認識部１２で得られた認識結果に、次の認識文字列があるか否かを判定する（ステップＳ１１０７）。その結果、選択部１３は、次の認識文字列があると判定した場合（ステップＳ１１０７：Ｙｅｓ）、次の認識文字列を対象文字列ｗとし（ステップＳ１１０８）、ステップＳ１１０３〜Ｓ１１０７までの処理を繰り返す。一方、選択部１３は、次の認識文字列がないと判定した場合（ステップＳ１１０７：Ｎｏ）、処理を終了する。 Next, the selection unit 13 determines whether or not there is a next recognized character string in the recognition result obtained by the recognition unit 12 (step S1107). As a result, when the selection unit 13 determines that there is the next recognized character string (step S1107: Yes), the next recognized character string is set as the target character string w (step S1108), and the processing from step S1103 to step S1107 is performed. repeat. On the other hand, if the selection unit 13 determines that there is no next recognized character string (step S1107: No), the process ends.

なお、選択部１３は、上述したように、発話者ごとの書き起こし作業量に関するパラメータを用いて、認識文字列を選択してもよい。また、選択部１３は、発話区間ごとの書き起こし精度に関するパラメータ、又は、書き起こし作業量に関するパラメータのいずれかのパラメータを用いて、認識文字列を選択してもよい。 Note that, as described above, the selection unit 13 may select a recognized character string using a parameter related to the transcription work amount for each speaker. Further, the selection unit 13 may select the recognized character string using either a parameter related to transcription accuracy for each utterance section or a parameter related to the transcription work amount.

＜まとめ＞
以上のように、本実施形態に係るテキスト生成装置１００によれば、発話者、又は、発話区間ごとに、作業者Ｕが指定した書き起こし作業の作業条件に関する各種パラメータ（書き起こし精度、及び、書き起こしに要する作業量の、少なくとも一方のパラメータ）と、認識文字列の信頼度とに基づき、音声から認識した認識文字列を選択し、書き起こし文を生成する。 <Summary>
As described above, according to the text generating apparatus 100 according to the present embodiment, for each speaker or each utterance section, various parameters relating to the work conditions of the transcription work specified by the worker U (the transcription accuracy, and Based on at least one parameter of the work amount required for transcription) and the reliability of the recognized character string, a recognized character string recognized from speech is selected, and a transcript is generated.

＜装置＞
図２７は、上記実施形態に係るテキスト生成装置１００の構成例を示す図である。図２７に示すように、実施形態に係るテキスト生成装置１００は、ＣＰＵ（Central Processing Unit）１０１、及び主記憶装置１０２などを備える。また、テキスト生成装置１００は、補助記憶装置１０３、通信ＩＦ（interface）１０４、外部ＩＦ１０５、及びドライブ装置１０７などを備える。テキスト生成装置１００は、各デバイスがバスＢを介して相互に接続される。このように、実施形態に係るテキスト生成装置１００は、一般的な情報処理装置に相当する。 <Device>
FIG. 27 is a diagram illustrating a configuration example of the text generation device 100 according to the embodiment. As shown in FIG. 27, the text generation device 100 according to the embodiment includes a CPU (Central Processing Unit) 101, a main storage device 102, and the like. The text generation device 100 includes an auxiliary storage device 103, a communication IF (interface) 104, an external IF 105, a drive device 107, and the like. In the text generating apparatus 100, devices are connected to each other via a bus B. Thus, the text generation device 100 according to the embodiment corresponds to a general information processing device.

ＣＰＵ１０１は、装置全体の制御や搭載機能を実現するための演算装置である。主記憶装置１０２は、プログラムやデータなどを所定の記憶領域に保持する記憶装置（メモリ）である。主記憶装置１０２は、例えば、ＲＯＭ（Read Only Memory）やＲＡＭ（Random Access Memory）などである。また、補助記憶装置１０３は、主記憶装置１０２より容量の大きい記憶領域を備える記憶装置である。補助記憶装置１０３は、例えば、ＨＤＤ（Hard Disk Drive）やメモリカード（Memory Card）などの不揮発性の記憶装置である。よって、ＣＰＵ１０１は、例えば、補助記憶装置１０３から主記憶装置１０２上に、プログラムやデータを読み出し、処理を実行することで、装置全体の制御や搭載機能を実現する。 The CPU 101 is an arithmetic device for realizing control of the entire apparatus and mounting functions. The main storage device 102 is a storage device (memory) that holds programs, data, and the like in a predetermined storage area. The main storage device 102 is, for example, a ROM (Read Only Memory) or a RAM (Random Access Memory). The auxiliary storage device 103 is a storage device having a storage area with a larger capacity than the main storage device 102. The auxiliary storage device 103 is a non-volatile storage device such as an HDD (Hard Disk Drive) or a memory card (Memory Card). Therefore, for example, the CPU 101 reads out programs and data from the auxiliary storage device 103 to the main storage device 102 and executes processing, thereby realizing control and mounting functions of the entire device.

通信ＩＦ１０４は、装置をデータ伝送路Ｎに接続するインタフェースである。これにより、テキスト生成装置１００は、データ伝送路Ｎを介して接続される他の外部機器（他の情報処理装置）とデータ通信が行える。外部ＩＦ１０５は、装置と外部装置１０６との間でデータを送受信するためのインタフェースである。外部装置１０６には、例えば、処理結果などの各種情報を表示する表示装置（例えば「液晶ディスプレイ」）や操作入力を受け付ける入力装置（例えば「テンキー」、「キーボード」、又は「タッチパネル」）などがある。ドライブ装置１０７は、記憶媒体１０８の書き込み又は読み取りを行う制御装置である。記憶媒体１０８は、例えば、フレキシブルディスク（ＦＤ）、ＣＤ（Compact Disk）、及びＤＶＤ（Digital Versatile Disk）などである。 The communication IF 104 is an interface that connects the apparatus to the data transmission path N. Thereby, the text generation device 100 can perform data communication with other external devices (other information processing devices) connected via the data transmission path N. The external IF 105 is an interface for transmitting and receiving data between the device and the external device 106. The external device 106 includes, for example, a display device (for example, “liquid crystal display”) that displays various types of information such as processing results, an input device (for example, “ten-key”, “keyboard”, or “touch panel”) that receives operation inputs. is there. The drive device 107 is a control device that writes or reads the storage medium 108. The storage medium 108 is, for example, a flexible disk (FD), a CD (Compact Disk), a DVD (Digital Versatile Disk), or the like.

また、上記実施形態に係るテキスト生成機能は、例えば、テキスト生成装置１００において、プログラムを実行することで、上記各機能部が連携動作することで実現される。この場合、プログラムは、実行環境の装置（コンピュータ）が読み取り可能な記憶媒体に、インストール可能な形式又は実行可能な形式のファイルで記録され提供される。例えば、テキスト生成装置１００の場合には、プログラムは、上記各機能部を含むモジュール構成となっており、ＣＰＵ１０１が記憶媒体１０８からプログラムを読み出し実行することで、主記憶装置１０２のＲＡＭ上に各機能部が生成される。なお、プログラムの提供方法は、この限りでない。例えば、プログラムを、インターネットなどに接続された外部機器に格納し、データ伝送路Ｎ経由でダウンロードする方法であってもよい。また、主記憶装置１０２のＲＯＭや補助記憶装置１０３のＨＤＤなどに予め組み込んで提供する方法であってもよい。なお、ここでは、テキスト生成機能をソフトウェアの実装により実現する例を説明したが、この限りでない。例えば、テキスト生成機能が有する各機能部の一部又は全部を、ハードウェアの実装により実現してもよい。 In addition, the text generation function according to the above-described embodiment is realized by, for example, executing the program in the text generation apparatus 100 so that the respective functional units perform a cooperative operation. In this case, the program is recorded and provided in a file that can be installed or executed in a storage medium that can be read by a device (computer) in the execution environment. For example, in the case of the text generation device 100, the program has a module configuration including the above-described functional units, and the CPU 101 reads out and executes the program from the storage medium 108, whereby each program is stored in the RAM of the main storage device 102. A functional part is generated. Note that the program providing method is not limited to this. For example, the program may be stored in an external device connected to the Internet and downloaded via the data transmission path N. Alternatively, a method may be provided that is incorporated in advance in the ROM of the main storage device 102 or the HDD of the auxiliary storage device 103. In addition, although the example which implement | achieves a text generation function by software implementation was demonstrated here, it is not this limitation. For example, some or all of the functional units included in the text generation function may be realized by hardware implementation.

また、上記実施形態では、テキスト生成装置１００が、取得部１１、認識部１２、選択部１３、生成部１４、設定部１５、探索部１６、再生部１７、認識結果保持部１８、結合部２１、認識結合結果保持部２２、発話区間情報生成部３１、又は発話区間情報保持部３２などの一部又は全部を有する構成について説明を行ったが、この限りでない。例えば、テキスト生成装置１００が、これらの機能部の一部の機能を有する外部機器と、通信ＩＦ１０４を介して接続され、接続された外部機器とデータ通信を行うことで、各機能部が連携動作し、上記テキスト生成機能を提供する構成であってもよい。これにより、本実施形態に係るテキスト生成装置１００は、クラウド環境などにも適用できる。 In the above embodiment, the text generation device 100 includes the acquisition unit 11, the recognition unit 12, the selection unit 13, the generation unit 14, the setting unit 15, the search unit 16, the reproduction unit 17, the recognition result holding unit 18, and the combining unit 21. The configuration including a part or all of the recognition combination result holding unit 22, the utterance section information generation unit 31, or the utterance section information holding unit 32 has been described. For example, the text generation device 100 is connected to an external device having a part of the functions of these functional units via the communication IF 104, and performs data communication with the connected external device, so that each functional unit operates in cooperation. However, it may be configured to provide the text generation function. Thereby, the text generation device 100 according to the present embodiment can be applied to a cloud environment or the like.

最後に、本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、請求の範囲に記載された発明とその均等の範囲に含まれる。 Finally, although several embodiments of the present invention have been described, these embodiments have been presented by way of example and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１１取得部
１２認識部
１３選択部
１４生成部
１５設定部
１６探索部
１７再生部
１８認識結果保持部
２１結合部
２２認識結合結果保持部
３１発話区間情報生成部
３２発話区間情報保持部
１００テキスト生成装置 DESCRIPTION OF SYMBOLS 11 Acquisition part 12 Recognition part 13 Selection part 14 Generation part 15 Setting part 16 Search part 17 Playback part 18 Recognizing result holding | maintenance part 21 Combining part 22 Recognition combining result holding part 31 Utterance area information generating part 32 Utterance area information holding part 100 Text generation apparatus

Claims

A recognition unit that recognizes the acquired speech and obtains a recognized character string for each recognition unit and a reliability of the recognized character string;
A selection unit that selects at least one of the recognized character strings to be used in a transcript based on a parameter of the amount of work required for transcription;
Using the selected recognized character string, a generation unit that generates the transcription sentence,
A text generator comprising:

The selection unit includes:
Selecting the recognized character string based on a combination of a parameter of the amount of work required for the transcription and the reliability of the recognized character string;
The text generation device according to claim 1.

The selection unit includes:
The accumulated work amount accumulated based on the reliability of the recognized character string is compared with the allowable value of the parameter, and the recognized character string is selected when the accumulated work amount is equal to or less than the allowable value. ,
The text generation device according to claim 1.

The selection unit includes:
Using the transcription work time as a parameter of the work amount required for the transcription,
Calculate the transcription work time based on the number of characters in the recognized character string,
The text generation device according to claim 3.

The recognition unit
Further obtaining the start time and end time of the recognition character string,
The selection unit includes:
Using the transcription work time as a parameter of the work amount required for the transcription,
Based on the start time and end time of the recognized character string, calculate the transcription work time,
The text generation device according to claim 3.

The selection unit includes:
The transcription work cost is used as a parameter of the work amount required for the transcription,
Based on the number of characters of the recognized character string, to calculate the transcription work time, to calculate the transcription work cost based on the calculated transcription work time and the work cost per unit time,
The text generation device according to claim 3.

The recognition unit
Further obtaining the start time and end time of the recognition character string,
The selection unit includes:
The transcription work cost is used as a parameter of the work amount required for the transcription,
Based on the start time and end time of the recognized character string, calculate a transcription work time, and calculate the transcription work cost based on the calculated transcription work time and a work cost per unit time.
The text generation device according to claim 3.

The generator is
A state in which an operator can select the recognized character string up to the Nth candidate (N is an integer of 1 or more) with high reliability of the recognized character string among the recognized character strings not selected by the selection unit. Generate the transcript displayed in
The text generation device according to claim 1.

A setting unit that sets a character insertion position corresponding to a start position of character input by an operator at the position of the transcript corresponding to the recognized character string not selected by the selection unit;
The setting unit
In the detected current character insertion position, in the transcript, a selection element corresponding to the recognized character string selected by the selection unit and a non-corresponding to the recognition character string not selected by the selection unit Based on the positional relationship with the selection element, the character insertion position is set.
The text generation device according to claim 1.

The setting unit
A determination is made as to whether or not the detected current character insertion position is within the selection element, and when the character insertion position is within the selection element, a position behind the character insertion position and closest to the character insertion position Detecting the non-selected element, and moving the character insertion position to the head position of the detected non-selected element,
The text generation device according to claim 9.

The setting unit
It is determined whether or not the detected current character insertion position is within the selection element, and if the character insertion position is not within the selection element, the position closest to the character insertion position is behind the character insertion position. Detecting the selected element, detecting the non-selected element at a position closest to the selected element behind the detected selected element, and moving the character insertion position to the detected first position of the non-selected element;
The text generation device according to claim 9.

In the character insertion position set by the setting unit, when character input by an operator is started, a search unit that searches for a voice position corresponding to the input character;
A reproduction unit that reproduces the audio from the audio position searched by the search unit;
The search unit
The current character insertion position detected by the setting unit, the selection element corresponding to the recognition character string selected by the selection unit in the transcript, and the recognition character not selected by the selection unit Searching for the voice position based on the positional relationship with the non-selected elements corresponding to the columns;
The text generation device according to claim 9.

The search unit
It is determined whether or not the detected current character insertion position is within the selection element, and when the character insertion position is within the selection element, the start time of the recognized character string corresponding to the selection element is The voice position,
The text generation device according to claim 12.

The recognizing character string obtained by the recognizing unit is combined with a sentence unit or a predetermined time unit, and further includes a combining unit that obtains reliability of the combined recognized character string and the combined recognized character string. ,
The selection unit includes:
Selecting the recognized character string combined with the sentence unit or the time unit;
The text generation device according to claim 1.

The selection unit includes:
Based on the parameter of the amount of work required for the transcription and the reliability of the combined recognized character string, the recognized character string combined with the sentence unit or the time unit is selected.
The text generation device according to claim 14.

A generator for generating utterance section information including information identifying each utterance, utterance start time of each utterance, and information identifying a utterer of each utterance for the voice,
The selection unit includes:
The recognition character string is selected for each of the utterers or the utterances.
The text generation device according to claim 1.

The selection unit includes:
For each of the utterers or the utterances, the recognition character string is selected based on the parameter of the amount of work required for the transcription and the reliability of the recognition character string.
The text generation device according to claim 16.

A recognition step of recognizing the acquired speech and obtaining a recognition character string for each recognition unit and the reliability of the recognition character string;
A selection step of selecting at least one of the recognized character strings to be used in a transcript based on a parameter of the amount of work required for transcription;
Using the selected recognized character string to generate the transcript sentence;
Text generation method including

Computer
Means for recognizing the acquired speech and obtaining a recognized character string for each recognition unit and the reliability of the recognized character string;
Means for selecting at least one recognized character string to be used in a transcript based on a parameter of the amount of work required for transcription;
Means for generating the transcript using the selected recognized character string;
Text generator to function as