JP6780849B2

JP6780849B2 - Information processing system, terminal device, server, information processing method and program

Info

Publication number: JP6780849B2
Application number: JP2016195846A
Authority: JP
Inventors: 清幸鈴木
Original assignee: Advanced Media Inc
Current assignee: Advanced Media Inc
Priority date: 2016-10-03
Filing date: 2016-10-03
Publication date: 2020-11-04
Anticipated expiration: 2036-10-03
Also published as: JP2018059989A

Description

本発明は、情報処理システム、端末装置、サーバ、情報処理方法及びプログラムに関する。 The present invention relates to information processing systems, terminal devices, servers, information processing methods and programs.

従来、会議等の音声を音声認識処理して得られた文字列を、作業者が音声を聞きながら修正・編集して文章化する文字起こしが行われている。
このような文字起こしにおいては、文字起こしの対象となる音声が長時間のものである場合の作業全体の時間を短縮する目的や音声内容の秘匿性の観点等から、複数の作業者によって分担して文字起こしが行われることがある。
例えば、特許文献１には、話者の発言・会話が記録されている音声データを複数の音声区間に細分化し、細分化された各音声区間それぞれの文字起こしを複数の作業者によって行い、サーバが各作業者の作業結果である文字列を結合して、元の音声データの会話全体を文章化した文章データを構築する技術が開示されている。 Conventionally, a character string obtained by voice recognition processing of a voice of a conference or the like is transcribed by a worker who corrects / edits the character string while listening to the voice and converts it into a sentence.
In such transcription, it is shared by a plurality of workers from the viewpoint of shortening the entire work time when the voice to be transcribed is a long time and the confidentiality of the voice content. Transcription may be performed.
For example, in Patent Document 1, voice data in which speeches and conversations of speakers are recorded is subdivided into a plurality of voice sections, and each of the subdivided voice sections is transcribed by a plurality of workers, and a server is used. Discloses a technique for constructing sentence data in which the entire conversation of the original voice data is documented by combining character strings that are the work results of each worker.

特開２００８−１０７６２４号公報Japanese Unexamined Patent Publication No. 2008-107624

しかしながら、文字起こしの作業を複数の作業者によって分担して行う場合、文字起こしの対象となる全体のデータを適切な位置及びサイズで分割することが必ずしも容易ではない。そして、文字起こしの対象となる全体のデータが不適切に分割された場合、各作業者の作業時間にばらつきが生じる等、文字起こしの作業全体として、効率が低下する可能性がある。さらに、文字起こしの対象となる全体のデータが不適切な位置で分割されている場合、作業者が文脈を適切に判断できないことがあり、文字起こしの作業効率が低下する可能性がある。
また、分割されたデータの文字起こしを行う作業者にとって、当該作業者自身が分担すべきデータの境界部分が明確に把握できないことがあり、他の作業者の作業と重複が生じる可能性がある。
さらに、複数の作業者による作業結果を集約し、最終校正を行う校正者にとって、複数の作業者による作業結果が適切に集約されているか否かを確認する作業負担が大きいものとなる。
このように、対象となる音声を複数の作業者によって分担して文字起こしを行う従来の技術においては、効率的な処理を行うことが困難であった。 However, when the transcription work is shared by a plurality of workers, it is not always easy to divide the entire data to be transcribed into appropriate positions and sizes. If the entire data to be transcribed is improperly divided, the work time of each worker may vary, and the efficiency of the transcription work as a whole may decrease. Furthermore, if the entire data to be transcribed is divided at inappropriate positions, the operator may not be able to properly judge the context, which may reduce the work efficiency of transcription.
In addition, the worker who transcribes the divided data may not be able to clearly grasp the boundary part of the data to be shared by the worker himself, which may overlap with the work of other workers. ..
Further, for the proofreader who aggregates the work results by a plurality of workers and performs the final calibration, the work load for confirming whether or not the work results by the plurality of workers are properly aggregated becomes heavy.
As described above, it has been difficult to perform efficient processing in the conventional technique of transcribing the target voice by sharing it among a plurality of workers.

本発明は、対象となる音声を複数の作業者によって分担して文字起こしを行う処理の効率を向上させることを目的とする。 An object of the present invention is to improve the efficiency of a process of transcribing a target voice by sharing it among a plurality of workers.

上記目的を達成するため、本発明の一態様の情報処理システムは、
文字起こしの対象となる音声データを分割して複数の作業者に割り当てるサーバと、音声データを文字起こしする作業者によって使用される作業者用の端末装置と、を含む情報処理システムであって、
前記サーバは、
文字起こしの対象となる前記音声データ及び当該音声データの音声認識結果のデータを取得する文字起こし対象データ取得手段と、
前記文字起こし対象データ取得手段によって取得された前記音声データ及び前記音声認識結果のデータを分割して分割データを生成するデータ分割手段と、を備え、
前記作業者用の端末装置は、
前記分割データにおける前記音声データの音声波形を表す領域と、当該音声データの前記音声認識結果のデータが示す文字列を表す領域とを含み、前記分割データを文字起こしするための文字起こしインターフェースを表示する文字起こしインターフェース表示手段と、
前記文字起こしインターフェースに表示された前記音声認識結果のデータに対する修正を受け付けるデータ修正受付手段と、を備え、
前記文字起こしインターフェース表示手段は、前記音声データの音声波形において前記分割データの境界位置を示す接続点と、前記音声認識結果のデータにおいて前記接続点の音声に対応する接続語とを識別して表示することを特徴とする。 In order to achieve the above object, the information processing system of one aspect of the present invention is
An information processing system including a server that divides voice data to be transcribed and assigns it to a plurality of workers, and a terminal device for workers used by the worker who transcribes the voice data.
The server
Transcription target data acquisition means for acquiring the voice data to be transcribed and the voice recognition result data of the voice data, and
A data dividing means for dividing the voice data acquired by the transcription target data acquisition means and the voice recognition result data to generate divided data is provided.
The terminal device for the worker is
A transcription interface for transcribing the divided data is displayed, including an area representing the voice waveform of the voice data in the divided data and an area representing a character string indicated by the voice recognition result data of the voice data. Transcription interface display means and
It is provided with a data correction receiving means for receiving corrections to the voice recognition result data displayed on the transcription interface.
The transcription interface display means identifies and displays a connection point indicating the boundary position of the divided data in the voice waveform of the voice data and a connection word corresponding to the voice of the connection point in the voice recognition result data. It is characterized by doing.

本発明によれば、対象となる音声を複数の作業者によって分担して文字起こしを行う処理の効率を向上させることができる。 According to the present invention, it is possible to improve the efficiency of a process of transcribing a target voice by being shared by a plurality of workers.

本発明に係る情報処理システムのシステム構成を示す図である。It is a figure which shows the system structure of the information processing system which concerns on this invention. 本実施形態に係る端末装置のハードウェア構成を示す模式図である。It is a schematic diagram which shows the hardware configuration of the terminal apparatus which concerns on this embodiment. サーバのハードウェア構成を示す模式図である。It is a schematic diagram which shows the hardware configuration of a server. サーバにおいて実現される主な機能構成を示すブロック図である。It is a block diagram which shows the main functional composition realized in a server. 分割データの境界における音声データ及び音声認識結果の文字列の一例を示す模式図である。It is a schematic diagram which shows an example of the character string of the voice data and the voice recognition result at the boundary of the divided data. 作業者によって使用される端末装置において実現される主な機能構成を示すブロック図である。It is a block diagram which shows the main functional composition realized in the terminal apparatus used by an operator. 文字起こしインターフェースの表示画面例を示す模式図である。It is a schematic diagram which shows the display screen example of the transcription interface. 最終校閲者によって使用される端末装置において実現される主な機能構成を示すブロック図である。It is a block diagram which shows the main functional composition realized in the terminal apparatus used by the final reviewer. 情報処理システムのサーバが実行する文字起こし対象データ分割処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the transcription target data division processing executed by the server of an information processing system. 情報処理システムの端末装置が実行する分割データ文字起こし処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the division data transcription processing executed by the terminal apparatus of an information processing system. 情報処理システムのサーバが実行するデータ集約処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the data aggregation processing executed by the server of an information processing system. 情報処理システムの端末装置が実行する集約データ校閲処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the aggregated data review processing executed by the terminal apparatus of an information processing system.

以下、本発明の実施形態について、図面を用いて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［構成］
［システム構成］
図１は、本発明に係る情報処理システム１のシステム構成を示す図である。
図１に示すように、本発明に係る情報処理システム１は、複数の端末装置１０と、サーバ２０とを含んで構成され、複数の端末装置１０とサーバ２０とは、インターネットあるいはＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）等のネットワーク３０を介して互いに通信可能に構成されている。本実施形態において、複数の端末装置１０には、分割データの文字起こし作業を行う作業者によって使用される端末装置１０Ａと、文字起こしの対象となる音声データの文字起こし結果を最終的に校閲する最終校閲者によって使用される端末装置１０Ｂとが含まれる。以下、端末装置１０Ａ及び端末装置１０Ｂを区別しない場合、単に端末装置１０と称するものとする。 [Constitution]
[System configuration]
FIG. 1 is a diagram showing a system configuration of the information processing system 1 according to the present invention.
As shown in FIG. 1, the information processing system 1 according to the present invention is configured to include a plurality of terminal devices 10 and a server 20, and the plurality of terminal devices 10 and the server 20 are connected to the Internet or a LAN (Local Area). It is configured to be able to communicate with each other via a network 30 such as Network). In the present embodiment, the plurality of terminal devices 10 are finally reviewed for the terminal device 10A used by the operator who performs the transcription work of the divided data and the transcription result of the voice data to be transcribed. Includes terminal equipment 10B used by the final reviewer. Hereinafter, when the terminal device 10A and the terminal device 10B are not distinguished, they are simply referred to as the terminal device 10.

本実施形態における情報処理システム１では、文字起こしの元となる音声データ及びその音声データの音声認識結果を複数に分割し、分割された各音声データ及び音声認識結果を複数の作業者によって分担して文字起こしを行う。このとき、情報処理システム１は、音声データを音声認識における信頼度等の条件に基づいて分割し、各作業者の負担を調整する。また、情報処理システム１では、各作業者に配布される音声データ及び音声認識結果において、隣接する音声データ及び音声認識結果との境界部分には、境界となる時刻に対応する単語（あるいは形態素）等の要素を単位として、境界位置（後述する接続語）が識別して示され、その要素を含む文が、分割された音声データ間の境界の文字列（後述する接続文）とされる。これにより、分割された音声データを文字起こしする作業者は、自身が担当すべきデータの境界を容易に把握することができる。さらに、情報処理システム１では、分担して行われた文字起こしの結果が集約され、最終校閲者によって、文字起こし作業の結果が適切に集約されているか否かが確認される。このとき、分担して行われた文字起こし作業の結果には、境界位置（接続語）あるいは境界位置の要素を含む文字列（接続文）等が識別して示されているため、最終校閲者は、異なる作業者によって文字起こし作業が行われた部分の境界を容易に把握しながら、当該部分に対して高い注意をもって確認を行うことができる。
このように、本実施形態に係る情報処理システム１によれば、対象となる音声を複数の作業者によって分担して文字起こしを行う処理の効率を向上させることができる。 In the information processing system 1 of the present embodiment, the voice data that is the source of transcription and the voice recognition result of the voice data are divided into a plurality of parts, and each of the divided voice data and the voice recognition result are shared by a plurality of workers. And transcribe. At this time, the information processing system 1 divides the voice data based on conditions such as reliability in voice recognition and adjusts the burden on each worker. Further, in the information processing system 1, in the voice data and the voice recognition result distributed to each worker, a word (or morphology) corresponding to the time of the boundary is placed at the boundary portion between the adjacent voice data and the voice recognition result. The boundary position (connecting word described later) is identified and shown in units such as, and the sentence including the element is regarded as the character string of the boundary between the divided voice data (connecting sentence described later). As a result, the operator who transcribes the divided voice data can easily grasp the boundary of the data to be in charge of himself / herself. Further, in the information processing system 1, the results of the transcription performed in a shared manner are aggregated, and the final reviewer confirms whether or not the results of the transcription work are appropriately aggregated. At this time, in the result of the transcription work performed in a shared manner, the boundary position (connecting word) or the character string (connecting sentence) including the element of the boundary position is identified and shown, so that the final reviewer Can easily grasp the boundary of the part where the transcription work is performed by different workers, and can confirm the part with high caution.
As described above, according to the information processing system 1 according to the present embodiment, it is possible to improve the efficiency of the process of transcribing the target voice by being shared by a plurality of workers.

［ハードウェア構成］
次に、情報処理システム１を構成する各装置のハードウェア構成を説明する。
図２は、本実施形態に係る端末装置１０のハードウェア構成を示す模式図である。
図２に示すように、端末装置１０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎＵｎｉｔ）１１と、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１２と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１３と、バス１４と、入力部１５と、出力部１６と、記憶部１７と、通信部１８と、ドライブ１９と、を備えている。 [Hardware configuration]
Next, the hardware configuration of each device constituting the information processing system 1 will be described.
FIG. 2 is a schematic diagram showing a hardware configuration of the terminal device 10 according to the present embodiment.
As shown in FIG. 2, the terminal device 10 includes a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a bus 14, an input unit 15, and an output unit. A storage unit 17, a communication unit 18, and a drive 19 are provided.

ＣＰＵ１１は、ＲＯＭ１２に記録されているプログラム、または、記憶部１７からＲＡＭ１３にロードされたプログラムに従って各種の処理を実行する。
ＲＡＭ１３には、ＣＰＵ１１が各種の処理を実行する上において必要なデータ等も適宜記憶される。 The CPU 11 executes various processes according to the program recorded in the ROM 12 or the program loaded from the storage unit 17 into the RAM 13.
Data and the like necessary for the CPU 11 to execute various processes are also appropriately stored in the RAM 13.

ＣＰＵ１１、ＲＯＭ１２及びＲＡＭ１３は、バス１４を介して相互に接続されている。バス１４には、入力部１５、出力部１６、記憶部１７、通信部１８及びドライブ１９が接続されている。 The CPU 11, ROM 12 and RAM 13 are connected to each other via the bus 14. An input unit 15, an output unit 16, a storage unit 17, a communication unit 18, and a drive 19 are connected to the bus 14.

入力部１５は、各種ボタンを備えるキーボードや音声を入力するためのマイク等で構成され、各種ボタンあるいは音声による指示操作に応じて各種情報を入力する。
出力部１６は、ディスプレイやイヤホン等で構成され、画像や音声を出力する。
記憶部１７は、ハードディスクあるいはＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等で構成され、端末装置１０で管理される各種データを記憶する。
通信部１８は、ネットワークを介して他の装置との間で行う通信を制御する。 The input unit 15 is composed of a keyboard having various buttons, a microphone for inputting voice, and the like, and inputs various information in response to various buttons or voice-instructed operations.
The output unit 16 is composed of a display, earphones, or the like, and outputs an image or sound.
The storage unit 17 is composed of a hard disk, a DRAM (Dynamic Random Access Memory), or the like, and stores various data managed by the terminal device 10.
The communication unit 18 controls communication with other devices via the network.

ドライブ１９には、磁気ディスク、光ディスク、光磁気ディスク、あるいは半導体メモリ等よりなる、リムーバブルメディア３１が適宜装着される。ドライブ１９によってリムーバブルメディア３１から読み出されたデータに基づき、必要に応じて所定のプログラムが記憶部１７にインストールされる。 A removable medium 31 made of a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is appropriately mounted on the drive 19. A predetermined program is installed in the storage unit 17 as needed based on the data read from the removable media 31 by the drive 19.

図３は、サーバ２０のハードウェア構成を示す模式図である。
サーバ２０は、サーバコンピュータ等の情報処理装置によって構成される。
図３に示すように、サーバ２０は、ＣＰＵ２１１と、ＲＯＭ２１２と、ＲＡＭ２１３と、バス２１４と、入力部２１５と、出力部２１６と、記憶部２１７と、通信部２１８と、ドライブ２１９と、を備えている。 FIG. 3 is a schematic diagram showing the hardware configuration of the server 20.
The server 20 is composed of an information processing device such as a server computer.
As shown in FIG. 3, the server 20 includes a CPU 211, a ROM 212, a RAM 213, a bus 214, an input unit 215, an output unit 216, a storage unit 217, a communication unit 218, and a drive 219. ing.

ＣＰＵ２１１は、ＲＯＭ２１２に記録されているプログラム、または、記憶部２１７からＲＡＭ２１３にロードされたプログラムに従って各種の処理（サーバ２０の機能を実現するための処理）を実行する。
ＲＡＭ２１３には、ＣＰＵ２１１が各種の処理を実行する上において必要なデータ等も適宜記憶される。 The CPU 211 executes various processes (processes for realizing the functions of the server 20) according to the program recorded in the ROM 212 or the program loaded from the storage unit 217 into the RAM 213.
Data and the like necessary for the CPU 211 to execute various processes are also appropriately stored in the RAM 213.

ＣＰＵ２１１、ＲＯＭ２１２及びＲＡＭ２１３は、バス２１４を介して相互に接続されている。バス２１４には、入力部２１５、出力部２１６、記憶部２１７、通信部２１８及びドライブ２１９が接続されている。 The CPU 211, ROM 212, and RAM 213 are connected to each other via the bus 214. An input unit 215, an output unit 216, a storage unit 217, a communication unit 218, and a drive 219 are connected to the bus 214.

入力部２１５は、各種釦等で構成され、指示操作に応じて各種情報を入力する。
出力部２１６は、ディスプレイやスピーカ等で構成され、画像や音声を出力する。
記憶部２１７は、ハードディスクあるいはＤＲＡＭ等で構成され、各サーバで管理される各種データを記憶する。
通信部２１８は、ネットワークを介して他の装置との間で行う通信を制御する。 The input unit 215 is composed of various buttons and the like, and inputs various information according to an instruction operation.
The output unit 216 is composed of a display, a speaker, or the like, and outputs an image or sound.
The storage unit 217 is composed of a hard disk, DRAM, or the like, and stores various data managed by each server.
The communication unit 218 controls communication with other devices via the network.

ドライブ２１９には、磁気ディスク、光ディスク、光磁気ディスク、あるいは半導体メモリ等よりなる、リムーバブルメディア２３１が適宜装着される。ドライブ２１９によってリムーバブルメディア２３１から読み出されたデータに基づき、必要に応じて所定のプログラムが記憶部２１７にインストールされる。 A removable media 231 made of a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is appropriately mounted on the drive 219. Based on the data read from the removable media 231 by the drive 219, a predetermined program is installed in the storage unit 217 as needed.

［機能的構成］
次に、情報処理システム１において実現される主な機能構成について説明する。
［サーバ２０の機能構成］
図４は、サーバ２０において実現される主な機能構成を示すブロック図である。
図４に示すように、サーバ２０のＣＰＵ２１１において、音声データ取得部２５１と、音声認識結果取得部２５２と、データ分割部２５３と、分割データ送信部２５４と、修正済みデータ受信部２５５と、データ集約部２５６と、集約データ送信部２５７とが機能する。また、記憶部２１７には、文字起こし関連データ記憶部２７１が形成される。 [Functional configuration]
Next, the main functional configurations realized in the information processing system 1 will be described.
[Functional configuration of server 20]
FIG. 4 is a block diagram showing a main functional configuration realized in the server 20.
As shown in FIG. 4, in the CPU 211 of the server 20, the voice data acquisition unit 251 and the voice recognition result acquisition unit 252, the data division unit 253, the division data transmission unit 254, the corrected data reception unit 255, and data. The aggregation unit 256 and the aggregation data transmission unit 257 function. In addition, a transcription-related data storage unit 271 is formed in the storage unit 217.

文字起こし関連データ記憶部２７１は、文字起こしの対象となる音声データ、その音声データを音声認識した結果のデータ、文字起こしの対象となる音声データ及びその音声データの音声認識結果の分割データ（後述）、分割データの送信先の端末装置１０を識別する情報、端末装置１０から送信された修正済みデータ（後述）、及び、最終校閲者によって確認された文字起こし結果のデータ等、文字起こし作業に関連する各種データを対応付けて記憶する。 The transcription-related data storage unit 271 includes voice data to be transcribed, data resulting from voice recognition of the voice data, voice data to be transcribed, and divided data of the voice recognition result of the voice data (described later). ), Information that identifies the terminal device 10 to which the divided data is transmitted, corrected data (described later) transmitted from the terminal device 10, and data of the transcription result confirmed by the final reviewer, etc. for transcription work. Various related data are associated and stored.

音声データ取得部２５１は、文字起こしの対象となる音声データを取得する。例えば、音声データ取得部２５１は、他の装置からネットワーク３０を介して受信したり、リムーバブルメディア２３１を介して入力されたりすることにより、文字起こしの対象となる音声データを取得する。また、音声データ取得部２５１は、取得した音声データを文字起こし関連データ記憶部２７１に記憶する。 The voice data acquisition unit 251 acquires voice data to be transcribed. For example, the voice data acquisition unit 251 acquires voice data to be transcribed by receiving it from another device via the network 30 or inputting it via the removable media 231. Further, the voice data acquisition unit 251 stores the acquired voice data in the transcription-related data storage unit 271.

音声認識結果取得部２５２は、文字起こしの対象となる音声データを音声認識処理した結果の文字列からなるデータを取得する。このとき、音声認識結果取得部２５２は、ネットワーク３０を介して、外部に設置された音声認識のためのサーバに音声認識処理を依頼して、その音声認識結果を取得したり、サーバ２０に音声認識処理機能を備えておき、その音声認識処理機能による音声認識結果を取得したりすることができる。そして、音声認識結果取得部２５２は、取得した音声認識結果を文字起こし関連データ記憶部２７１に記憶する。 The voice recognition result acquisition unit 252 acquires data composed of a character string as a result of voice recognition processing of the voice data to be transcribed. At this time, the voice recognition result acquisition unit 252 requests a voice recognition process to an externally installed server for voice recognition via the network 30 to acquire the voice recognition result, or causes the server 20 to perform voice. It is possible to provide a recognition processing function and acquire the voice recognition result by the voice recognition processing function. Then, the voice recognition result acquisition unit 252 stores the acquired voice recognition result in the transcription-related data storage unit 271.

データ分割部２５３は、音声データ取得部２５１によって取得された文字起こしの対象となる音声データと、音声認識結果取得部２５２によって取得された音声認識結果とを分割し、複数の作業者によって文字起こしを行うためのデータ（以下、「分割データ」と呼ぶ。）を生成する。このとき、データ分割部２５３は、各作業者による作業負担が均等となるように、文字起こしの対象となる音声データ及びその音声データの音声認識結果を分割する。本実施形態において、データ分割部２５３は、文字起こしの対象となる音声データが音声認識される際に取得された音声認識結果の信頼度に基づいて、文字起こしの対象となる音声データ及びその音声データの音声認識結果を分割する。例えば、音声認識結果の信頼度が低い部分については、文字起こしの作業負担が大きいと考えられるため、音声認識結果の信頼度が高い部分よりも、音声データの時間がより短い分割データが生成される。 The data division unit 253 divides the voice data to be transcribed acquired by the voice data acquisition unit 251 and the voice recognition result acquired by the voice recognition result acquisition unit 252, and is transcribed by a plurality of workers. Data for performing the above (hereinafter referred to as "divided data") is generated. At this time, the data division unit 253 divides the voice data to be transcribed and the voice recognition result of the voice data so that the work load by each worker is equalized. In the present embodiment, the data division unit 253 uses the voice data to be transcribed and the voice thereof based on the reliability of the voice recognition result acquired when the voice data to be transcribed is voice-recognized. Divide the voice recognition result of the data. For example, in the part where the reliability of the voice recognition result is low, it is considered that the work load of transcription is heavy, so that the divided data in which the time of the voice data is shorter is generated than the part where the reliability of the voice recognition result is high. To.

以下、データ分割部２５３における具体的な分割データの生成手順について説明する。
初めに、データ分割部２５３は、分担して文字起こしを行う作業者の数によって、文字起こしの対象となる音声データ全体の時間を分割し、分割データの時間（分割時間）の初期値ＤＴ０を設定する。なお、このとき、予め設定された分割時間ＤＴ０の初期値（例えば５分等）を用いることとしてもよい。 Hereinafter, a specific procedure for generating divided data in the data dividing unit 253 will be described.
First, the data division unit 253 divides the time of the entire voice data to be transcribed according to the number of workers who share the transcription, and sets the initial value DT0 of the divided data time (division time). Set. At this time, a preset initial value of the division time DT0 (for example, 5 minutes or the like) may be used.

そして、データ分割部２５３は、文字起こしの対象となる音声データの最初から分割データの分割時間の初期値ＤＴ０までの分割データを取得し、この分割データについて、音声認識結果の信頼度を算出する。例えば、データ分割部２５３は、分割データに含まれる文字列の音声認識結果における単語単位の信頼度の合計を単語数で除算すること等により、分割データの音声認識結果の信頼度を算出する。 Then, the data division unit 253 acquires the division data from the beginning of the voice data to be transcribed to the initial value DT0 of the division time of the division data, and calculates the reliability of the voice recognition result for the division data. .. For example, the data division unit 253 calculates the reliability of the voice recognition result of the divided data by dividing the total reliability of each word in the voice recognition result of the character string included in the divided data by the number of words.

次に、データ分割部２５３は、算出した信頼度に基づいて、分割時間の初期値ＤＴ０の調整を行い、分割データの分割時間ＤＴを算出する。例えば、データ分割部２５３は、音声認識結果の信頼度について設定された閾値Ｔｔｈ１，Ｔｔｈ２（Ｔｔｈ１＜Ｔｔｈ２）に基づいて、以下のように分割時間ＤＴを算出する。 Next, the data division unit 253 adjusts the initial value DT0 of the division time based on the calculated reliability, and calculates the division time DT of the division data. For example, the data division unit 253 calculates the division time DT as follows based on the threshold values Tth1 and Tth2 (Tth1 <Tth2) set for the reliability of the voice recognition result.

（１）Ｔｔｈ１＜音声認識結果の信頼度＜Ｔｔｈ２であれば、分割データの分割時間ＤＴ０を維持する（ＤＴ＝ＤＴ０）。
（２）Ｔｔｈ２＜音声認識結果の信頼度であれば、分割データの分割時間ＤＴを１．５倍にする（ＤＴ＝ＤＴ０×１．５）。
（３）音声認識結果の信頼度＜Ｔｔｈ１であれば、分割データの分割時間ＤＴを１／１．５倍にする（ＤＴ＝ＤＴ０／１．５）。 (1) If Tth1 <reliability of the voice recognition result <Tth2, the division time DT0 of the division data is maintained (DT = DT0).
(2) If Tth2 <reliability of the voice recognition result, the division time DT of the division data is multiplied by 1.5 (DT = DT0 × 1.5).
(3) If the reliability of the voice recognition result <Tth1, the division time DT of the division data is multiplied by 1 / 1.5 (DT = DT0 / 1.5).

なお、このとき用いられる閾値Ｔｔｈ１，Ｔｔｈ２の具体的な値や、ＤＴ０に乗算される係数（１．５あるいは１／１．５）等は、経験値あるいは実験値等に基づいて、適宜決定することができる。なお、設定される閾値や係数は、より多段階に設定することとしてもよい。 The specific values of the threshold values Tth1 and Tth2 used at this time, the coefficient to be multiplied by DT0 (1.5 or 1 / 1.5), etc. are appropriately determined based on the empirical value or the experimental value. be able to. The threshold values and coefficients to be set may be set in more stages.

次に、データ分割部２５３は、分割データの境界を示す情報を設定する。
具体的には、データ分割部２５３は、分割データにおける音声データにおいて、分割データの終端の位置（以下、「接続点」と呼ぶ。）の音声に対応する音声認識結果の単語（あるいは形態素）を特定する。以下、この単語（あるいは形態素）を「仮接続語」と呼ぶ。また、データ分割部２５３は、接続語を含む音声認識結果の文節（または接続点に対応する音声認識結果の文節）を特定する。以下、この文節を「仮接続文節」と呼ぶ。さらに、データ分割部２５３は、接続語を含む音声認識結果の句または文（または接続点に対応する音声認識結果の句または文）を特定する。以下、この句または文を仮接続文と呼ぶ。また、ここでは、仮接続語が複文に含まれる場合、単文に区切って接続文にするものとする。 Next, the data division unit 253 sets information indicating the boundary of the divided data.
Specifically, the data division unit 253 uses the voice recognition result word (or morpheme) corresponding to the voice at the end position of the division data (hereinafter, referred to as “connection point”) in the voice data in the division data. Identify. Hereinafter, this word (or morpheme) is referred to as a "temporary connecting word". In addition, the data division unit 253 specifies a speech recognition result phrase (or a speech recognition result phrase corresponding to the connection point) including the connection word. Hereinafter, this clause is referred to as a "temporary connection clause". Further, the data division unit 253 identifies a speech recognition result phrase or sentence (or a speech recognition result phrase or sentence corresponding to the connection point) including the connection word. Hereinafter, this phrase or statement is referred to as a temporary connection statement. Further, here, when a temporary connecting word is included in a compound sentence, it is assumed that the sentence is divided into simple sentences to form a connecting sentence.

なお、上記分割データの生成手順において、接続点に単語が含まれない場合（即ち、接続点が単語間の境界位置である場合あるいは無音区間である場合）には、接続点に最も近い次の単語が仮接続語とされる。ただし、接続点に最も近い次の単語が接続点から時間Δｔ１以内に存在しない場合、仮接続語は無しとされる。ここで、時間Δｔ１は、人間の発話時における息継ぎ時間の最大値（例えば、数秒程度）に基づいて、経験的に設定される。
そして、データ分割部２５３は、分割時間ＤＴに続く次の分割データについて、上述の生成手順を繰り返し、文字起こしの対象となる音声データの末尾まで分割データを生成する。 In the procedure for generating the divided data, if the connection point does not include a word (that is, if the connection point is a boundary position between words or is a silent section), the next next closest to the connection point The word is a temporary connection word. However, if the next word closest to the connection point does not exist within the time Δt1 from the connection point, the temporary connection word is regarded as none. Here, the time Δt1 is empirically set based on the maximum value of the breathing time (for example, about several seconds) at the time of human utterance.
Then, the data division unit 253 repeats the above-mentioned generation procedure for the next division data following the division time DT, and generates the division data up to the end of the voice data to be transcribed.

図５は、分割データの境界における音声データ及び音声認識結果の文字列の一例を示す模式図である。
図５に示す例では、分割データにおける接続点の音声に対応する音声認識結果（即ち、仮接続語）として、「気持ち」の文字列が特定されている。また、仮接続語を含む音声認識結果の文節（即ち、仮接続文節）として、「気持ちと」の文字列が特定されている。さらに、仮接続語を含む音声認識結果の文（即ち、仮接続文）として、「私の気持ちと同じです」の文字列が特定されている。 FIG. 5 is a schematic diagram showing an example of a character string of the voice data and the voice recognition result at the boundary of the divided data.
In the example shown in FIG. 5, the character string of "feeling" is specified as the voice recognition result (that is, the temporary connection word) corresponding to the voice of the connection point in the divided data. Further, the character string of "feeling and" is specified as the phrase of the voice recognition result including the temporary connection word (that is, the temporary connection phrase). Furthermore, the character string "same as my feelings" is specified as a speech recognition result sentence (that is, a temporary connection sentence) including a temporary connection word.

このように特定された仮接続語（あるいは仮接続文節や仮接続文）に対応して、データ分割部２５３は、分割データの始端である接続点より前の時間Δｔ２ｓ分の音声データと、分割データの終端である接続点より後の時間Δｔ２ｅ分の音声データとをそれぞれ付加して分割データを生成する（後述する図７参照）。分割データの始端である接続点より前の時間Δｔ２ｓは、始端の接続点を基に特定された仮接続語（あるいは仮接続文節や仮接続文）の先頭から、始端である接続点までの長さによって機械的に算出できる。また、分割データの終端である接続点より後の時間Δｔ２ｅは、終端である接続点から、終端の接続点を基に特定された仮接続語（あるいは仮接続文節や仮接続文）の末尾までの長さによって機械的に算出できる。さらに、接続点に単語が含まれない場合（即ち、接続点が単語間の境界位置である場合あるいは無音区間である場合）には、上述の分割データの境界を示す情報（仮接続語、仮接続文節あるいは仮接続文）の一般的な長さに基づいて、これらが含まれるように経験的に時間Δｔ２ｍａｘを設定し、始端の接続点の前あるいは終端の接続点の後に音声データ及びその音声認識結果を付加することができる。なお、このように設定された時間Δｔ２ｓ、Δｔ２ｅ、Δｔ２ｍａｘに対して、さらに拡張時間α分の音声データ含めることとしてもよい。 Corresponding to the temporary connection word (or temporary connection clause or temporary connection sentence) specified in this way, the data division unit 253 divides the audio data for the time Δt2s before the connection point which is the start end of the division data and the division. The divided data is generated by adding the audio data for the time Δt2e after the connection point which is the end of the data (see FIG. 7 described later). The time Δt2s before the connection point which is the start point of the divided data is the length from the beginning of the temporary connection word (or temporary connection clause or temporary connection statement) specified based on the connection point of the start end to the connection point which is the start end. It can be calculated mechanically. Further, the time Δt2e after the connection point which is the end of the divided data is from the connection point which is the end to the end of the temporary connection word (or temporary connection clause or temporary connection statement) specified based on the connection point at the end. It can be calculated mechanically by the length of. Further, when the connection point does not contain a word (that is, when the connection point is a boundary position between words or a silent section), information indicating the boundary of the above-mentioned divided data (temporary connection word, provisional connection word, provisional). Based on the general length of the connection clause (or provisional connection statement), the time Δt2max is empirically set to include these, and the voice data and its voice are set before the start connection point or after the end connection point. The recognition result can be added. It should be noted that the voice data for the extended time α may be further included in the time Δt2s, Δt2e, and Δt2max set in this way.

図４に戻り、分割データ送信部２５４は、データ分割部２５３によって分割された各分割データを、複数の端末装置１０に送信する。
なお、分割データ送信部２５４は、各分割データの送信先の端末装置１０を識別する情報を、文字起こし関連データ記憶部２７１に記憶する。 Returning to FIG. 4, the divided data transmitting unit 254 transmits each divided data divided by the data dividing unit 253 to the plurality of terminal devices 10.
The divided data transmission unit 254 stores information for identifying the terminal device 10 to which each divided data is transmitted in the transcription-related data storage unit 271.

修正済みデータ受信部２５５は、各端末装置１０から送信された文字起こし作業済みの分割データ（以下、「修正済みデータ」と呼ぶ。）を受信する。そして、修正済みデータ受信部２５５は、受信した修正済みデータを文字起こし関連データ記憶部２７１に記憶する。 The corrected data receiving unit 255 receives the divided data (hereinafter, referred to as “corrected data”) that has been transcribed and transmitted from each terminal device 10. Then, the corrected data receiving unit 255 stores the received corrected data in the transcription-related data storage unit 271.

データ集約部２５６は、修正済みデータ受信部２５５によって受信された各修正済みデータを音声データの時間順に集約し、修正済みデータを集合させたデータである集約データを生成する。
集約データ送信部２５７は、データ集約部２５６によって生成された集約データを、最終校閲者が使用する端末装置１０Ｂに送信する。 The data aggregation unit 256 aggregates each of the modified data received by the modified data receiving unit 255 in chronological order of voice data, and generates aggregated data which is a collection of the modified data.
The aggregated data transmission unit 257 transmits the aggregated data generated by the data aggregation unit 256 to the terminal device 10B used by the final reviewer.

［端末装置１０Ａの機能構成］
次に、端末装置１０Ａの機能構成について説明する。
図６は、端末装置１０Ａにおいて実現される主な機能構成を示すブロック図である。
図６に示すように、端末装置１０ＡのＣＰＵ１１において、分割データ受信部５１と、文字起こしインターフェース表示部５２と、分割データ修正受付部５３と、修正済みデータ送信部５４とが機能する。また、記憶部１７には、分割データ記憶部７１が形成される。
分割データ記憶部７１は、サーバ２０から送信された分割データを記憶する。 [Functional configuration of terminal device 10A]
Next, the functional configuration of the terminal device 10A will be described.
FIG. 6 is a block diagram showing a main functional configuration realized in the terminal device 10A.
As shown in FIG. 6, in the CPU 11 of the terminal device 10A, the divided data receiving unit 51, the transcription interface display unit 52, the divided data correction receiving unit 53, and the corrected data transmitting unit 54 function. Further, a divided data storage unit 71 is formed in the storage unit 17.
The divided data storage unit 71 stores the divided data transmitted from the server 20.

分割データ受信部５１は、サーバ２０から送信された分割データを受信する。このとき受信される分割データは、文字起こしの対象となる音声データ及びその音声データの音声認識結果全体のうち、サーバ２０によって当該端末装置１０Ａの作業者に割り当てられた分割データである。そして、分割データ受信部５１は、サーバ２０から受信した分割データを分割データ記憶部７１に記憶する。
文字起こしインターフェース表示部５２は、分割データ受信部５１によって受信された分割データを文字起こしするためのユーザインターフェース（以下、「文字起こしインターフェース」と呼ぶ。）を表示する。 The divided data receiving unit 51 receives the divided data transmitted from the server 20. The divided data received at this time is the divided data assigned by the server 20 to the worker of the terminal device 10A among the voice data to be transcribed and the entire voice recognition result of the voice data. Then, the divided data receiving unit 51 stores the divided data received from the server 20 in the divided data storage unit 71.
The transcription interface display unit 52 displays a user interface (hereinafter, referred to as “transcription interface”) for transcribing the divided data received by the divided data receiving unit 51.

図７は、文字起こしインターフェースの表示画面例を示す模式図である。
図７に示すように、文字起こしインターフェースにおいては、分割データの音声波形を示す音声波形領域Ｖと、分割データの音声波形に対応する文字列を示す文字列領域Ｃとが表示される。
音声波形領域Ｖは、分割データにおける音声データの時系列の音声波形を示す領域であり、始端である接続点より前の時間Δｔ２ｓから、終端である接続点より後の時間Δｔ２ｅまでの音声波形が示されている。なお、音声波形領域Ｖにおいて、始端となる接続点の位置と、終端となる接続点の位置とは、区切り線等の識別指標によって識別して示されている。この識別指標の位置は、音声データの再生時に、アラーム音等で作業者に報知される。また、図７においては、始端となる接続点より前の時間Δｔ２ｓ及び終端である接続点より後の時間Δｔ２ｅそれぞれに、拡張時間αを含む例を示している。 FIG. 7 is a schematic diagram showing an example of a display screen of the transcription interface.
As shown in FIG. 7, in the transcription interface, a voice waveform area V showing the voice waveform of the divided data and a character string area C showing the character string corresponding to the voice waveform of the divided data are displayed.
The voice waveform area V is a region showing the time-series voice waveform of the voice data in the divided data, and the voice waveform from the time Δt2s before the connection point at the start to the time Δt2e after the connection point at the end is It is shown. In the voice waveform region V, the position of the connection point at the start end and the position of the connection point at the end are identified and indicated by an identification index such as a dividing line. The position of this identification index is notified to the operator by an alarm sound or the like when the voice data is reproduced. Further, FIG. 7 shows an example in which the expansion time α is included in each of the time Δt2s before the connection point at the start and the time Δt2e after the connection point at the end.

文字列領域Ｃは、分割データにおける音声データの音声認識結果である文字列を示す領域であり、音声波形領域Ｖと同様に、始端である接続点より前の時間Δｔ２ｓから、終端である接続点より後の時間Δｔ２ｅまでの音声認識結果の文字列が示されている。なお、始端となる接続点を含む接続文よりも前の文字列（拡張時間αに対応する文字列）には、取り消し線が付され、当該端末装置１０Ａの作業者に割り当てられた作業対象ではないことが示されている。
また、文字列領域Ｃにおいては、始端の接続点に対応する仮接続語「長旅」、この接続語を含む仮接続文節「長旅にも」、及び、この仮接続語を含む仮接続文「ブラジルからの長旅にも関わらず、」の文字列がそれぞれ識別して表示されている。このとき、例えば、仮接続文を青色、仮接続文内の仮接続文節を緑色、仮接続文節内の仮接続語を赤色で表示すること等が可能である。 The character string area C is an area indicating a character string which is a voice recognition result of voice data in the divided data, and like the voice waveform area V, the connection point which is the end from the time Δt2s before the connection point which is the start end. The character string of the voice recognition result up to the later time Δt2e is shown. A strikethrough is added to the character string (character string corresponding to the expansion time α) before the connection statement including the connection point that is the start point, and the work target assigned to the worker of the terminal device 10A is It is shown not.
Further, in the character string area C, the temporary connection word "long journey" corresponding to the connection point at the beginning, the temporary connection clause "long journey" including this connection word, and the temporary connection sentence "Brazil" including this temporary connection word. Despite the long journey from, the character strings of "" are identified and displayed. At this time, for example, it is possible to display the temporary connection statement in blue, the temporary connection clause in the temporary connection statement in green, the temporary connection word in the temporary connection clause in red, and the like.

さらに、文字列領域Ｃにおいては、終端の接続点に対応する仮接続語「メディカルチェック」、この仮接続語を含む仮接続文節「メディカルチェックへと」、及び、この仮接続語を含む仮接続文「クラブ関係者の車でメディカルチェックへと向かいました。」の文字列がそれぞれ識別して表示されている。なお、終端の仮接続語を含む仮接続文及びそれよりも後の文字列（拡張時間αに対応する文字列）には、取り消し線が付され、当該端末装置１０Ａの作業者に割り当てられた作業対象ではないことが示されている。 Further, in the character string area C, the temporary connection word "medical check" corresponding to the connection point at the end, the temporary connection clause "to medical check" including this temporary connection word, and the temporary connection including this temporary connection word. The text "I went to the medical check in the car of the club staff" is displayed individually. A strikethrough is added to the temporary connection sentence including the temporary connection word at the end and the character string after that (the character string corresponding to the expansion time α), and the character string is assigned to the worker of the terminal device 10A. It is shown that it is not a work target.

図６に戻り、分割データ修正受付部５３は、図７に示す文字起こしインターフェースの画面において、作業者による分割データの修正の入力を受け付ける。即ち、端末装置１０Ａを使用する作業者は、図７に示す文字起こしインターフェースの画面を見ながら、音声データを再生し、音声認識結果の文字列において、音声データに対する音声認識結果が不適切である部分を分割データ修正受付部５３を介して逐次修正する。なお、音声認識結果が不適切である部分がない場合には、分割データがそのまま修正済みデータとなる。
また、分割データ修正受付部５３は、図７に示す文字起こしインターフェースの画面において、作業者による仮接続語、仮接続文節、仮接続文の修正の入力を受け付ける。即ち、作業者は、仮接続語、仮接続文節、仮接続文それぞれが適切に音声認識され、適切な単位で設定されているかを確認し、不適切なものについては、分割データ修正受付部５３を介して適宜修正する。作業者により確認され、適宜修正された仮接続語、仮接続文節、仮接続文をそれぞれ接続語、接続文節、接続文と呼ぶ。 Returning to FIG. 6, the divided data correction receiving unit 53 accepts the input of the correction of the divided data by the operator on the screen of the transcription interface shown in FIG. 7. That is, the operator using the terminal device 10A reproduces the voice data while looking at the screen of the transcription interface shown in FIG. 7, and the voice recognition result for the voice data is inappropriate in the character string of the voice recognition result. The part is sequentially corrected via the divided data correction receiving unit 53. If there is no part where the voice recognition result is inappropriate, the divided data becomes the corrected data as it is.
Further, the divided data correction receiving unit 53 accepts the input of the temporary connection word, the temporary connection clause, and the correction of the temporary connection sentence by the operator on the screen of the transcription interface shown in FIG. 7. That is, the worker confirms whether each of the temporary connection word, the temporary connection clause, and the temporary connection sentence is properly voice-recognized and set in an appropriate unit, and if it is inappropriate, the divided data correction reception unit 53 Correct as appropriate via. Temporary connection words, temporary connection clauses, and temporary connection statements that have been confirmed and modified by the operator are called connection words, connection clauses, and connection statements, respectively.

ここで、本実施形態において、端末装置１０Ａの作業者が分割データの修正を行う場合、以下の方針に従って修正作業が行われる。
（１）始端の接続文は、当該端末装置１０Ａの作業者が作成（文字起こし）する。
（２）終端の接続文は、当該端末装置１０Ａの作業者は作成（文字起こし）しない。即ち、始端及び終端の接続文は、隣接する分割データに含まれ、これらを割り当てられた複数の作業者に配布されるが、各作業者は、始端の接続文のみを文字起こしするものとする。これにより、同一部分が複数の作業者によって文字起こしされることを防ぐことができる。
（３）音声認識結果の接続文が表示されていない場合は、接続点の発話から接続文を作業者が確認して、始端の接続点に対応する接続文については作成し、終端の接続点に対応する接続文については作成しない。
（４）始端に接続語がない場合（始端の接続点近傍に音声がない場合）は、接続点の次の音声から文字起こしを行う。
（５）終端に接続語がない場合（終端の接続点近傍に音声がない場合）は、接続点の前の音声まで文字起こしを行う。
このように作業者が作業を行った場合、当該端末装置１０Ａの作業者が文字起こしの対象とする音声は、始端の接続文に対応する音声から、終端の接続文に対応する音声の直前までの間の音声となる。
なお、このような修正作業の方針において、接続文を分割データの境界の単位とすることの他、接続文節あるいは接続語（形態素）を分割データの境界の単位とすることが可能である。 Here, in the present embodiment, when the operator of the terminal device 10A corrects the divided data, the correction work is performed according to the following policy.
(1) The connection statement at the beginning is created (transcribed) by the operator of the terminal device 10A.
(2) The terminal connection statement is not created (transcribed) by the operator of the terminal device 10A. That is, the start and end connection statements are included in the adjacent divided data and distributed to a plurality of assigned workers, but each worker shall transcribe only the start and end connection statements. .. This makes it possible to prevent the same part from being transcribed by a plurality of workers.
(3) If the connection statement of the voice recognition result is not displayed, the operator confirms the connection statement from the utterance of the connection point, creates the connection statement corresponding to the connection point at the beginning, and creates the connection statement at the end. Do not create the connection statement corresponding to.
(4) If there is no connection word at the start end (when there is no voice near the connection point at the start end), transcription is performed from the voice next to the connection point.
(5) If there is no connection word at the end (when there is no voice near the connection point at the end), the voice before the connection point is transcribed.
When the worker performs the work in this way, the voice to be transcribed by the worker of the terminal device 10A is from the voice corresponding to the connection sentence at the beginning to immediately before the voice corresponding to the connection sentence at the end. It becomes the voice between.
In such a correction work policy, in addition to using the connection statement as the unit of the boundary of the divided data, it is possible to use the connection clause or the connecting word (morpheme) as the unit of the boundary of the divided data.

修正済みデータ送信部５４は、文字起こし作業済みの分割データ（修正済みデータ）をサーバ２０に送信する。 The corrected data transmission unit 54 transmits the divided data (corrected data) that has been transcribed to the server 20.

［端末装置１０Ｂの機能構成］
次に、端末装置１０Ｂの機能構成について説明する。
図８は、端末装置１０Ｂにおいて実現される主な機能構成を示すブロック図である。
図８に示すように、端末装置１０ＢのＣＰＵ１１において、集約データ受信部１５１と、校閲用インターフェース表示部１５２と、集約データ修正受付部１５３とが機能する。また、記憶部１７には、文字起こしデータ記憶部１７１が形成される。
文字起こしデータ記憶部１７１は、サーバ２０から送信された集約データや、その集約データを最終校閲者が校閲し、最終的なデータとして確認した結果である文字起こしデータを記憶する。
集約データ受信部１５１は、サーバ２０から送信された集約データを受信する。そして、集約データ受信部１５１は、サーバ２０から受信した集約データを文字起こしデータ記憶部１７１に記憶する。 [Functional configuration of terminal device 10B]
Next, the functional configuration of the terminal device 10B will be described.
FIG. 8 is a block diagram showing a main functional configuration realized in the terminal device 10B.
As shown in FIG. 8, in the CPU 11 of the terminal device 10B, the aggregated data receiving unit 151, the review interface display unit 152, and the aggregated data correction receiving unit 153 function. In addition, a transcription data storage unit 171 is formed in the storage unit 17.
The transcription data storage unit 171 stores the aggregation data transmitted from the server 20 and the transcription data which is the result of the final reviewer reviewing the aggregated data and confirming it as the final data.
The aggregated data receiving unit 151 receives the aggregated data transmitted from the server 20. Then, the aggregated data receiving unit 151 stores the aggregated data received from the server 20 in the transcription data storage unit 171.

校閲用インターフェース表示部１５２は、集約データ受信部１５１によって受信された集約データを校閲するためのユーザインターフェース（以下、「校閲用インターフェース」と呼ぶ。）を表示する。校閲用インターフェースには、集約データに含まれる各修正済みデータの音声波形及び修正された音声認識結果の文字列が音声データの時系列順に並べて表示される。例えば、校閲用インターフェースにおいては、文字起こしの対象となる音声データにおける最初の修正済みデータの音声波形及び文字列を、第１段のデータとして、図７に示す文字起こしインターフェースの場合と同様に横方向に表示し、以下、後続の修正済みデータを第２段以降に同様に表示することができる。なお、校閲用インターフェースでは、文字起こし作業済みの音声認識結果である文字列において、始端となる接続点に対応する接続文の先頭から、終端となる接続点に対応する接続文の直前までの文字列が表示される。また、校閲用インターフェースでは、各修正済みデータの音声データにおいて、図７に示す文字起こしインターフェースの場合と同様に、始端となる接続点の位置と、終端となる接続点の位置とは、区切り線等の識別指標によって識別して示されている。この識別指標の位置は、音声データの再生時に、アラーム音等で最終校閲者に報知される。さらに、校閲用インターフェースでは、図７に示す文字起こしインターフェースの場合と同様に、始端の接続点に対応する接続語、この接続語を含む接続文節、及び、この接続語を含む接続文がそれぞれ識別して表示される。このとき、例えば、接続文を青色、接続文内の接続文節を緑色、接続文節内の接続語を赤色で表示すること等が可能である。 The review interface display unit 152 displays a user interface (hereinafter, referred to as “review interface”) for reviewing the aggregated data received by the aggregated data receiving unit 151. In the review interface, the voice waveform of each corrected data included in the aggregated data and the character string of the corrected voice recognition result are displayed in chronological order of the voice data. For example, in the review interface, the voice waveform and character string of the first corrected data in the voice data to be transcribed are used as the first stage data, as in the case of the transcription interface shown in FIG. It can be displayed in the direction, and the subsequent corrected data can be displayed in the same manner in the second and subsequent stages. In the review interface, in the character string that is the voice recognition result that has been transcribed, the characters from the beginning of the connection statement corresponding to the connection point at the beginning to immediately before the connection statement corresponding to the connection point at the end. The column is displayed. Further, in the proofreading interface, in the voice data of each corrected data, the position of the connection point at the start end and the position of the connection point at the end are separated lines as in the case of the transcription interface shown in FIG. It is identified and shown by an identification index such as. The position of this identification index is notified to the final reviewer by an alarm sound or the like when the voice data is reproduced. Further, in the review interface, as in the case of the transcription interface shown in FIG. 7, the connection word corresponding to the connection point at the beginning, the connection clause including this connection word, and the connection statement including this connection word are identified. Is displayed. At this time, for example, it is possible to display the connection statement in blue, the connection clause in the connection statement in green, and the connection word in the connection clause in red.

なお、校閲用インターフェースの他の表示形態としては、集約データに含まれる各修正済みデータの音声波形及び修正された音声認識結果の文字列を音声データの時系列順にそれぞれ結合して、１つの音声波形及び１つの文字列を生成し、これら音声波形及び文字列を、図７に示す文字起こしインターフェースの場合と同様に横方向に表示してもよい。この場合、音声波形領域Ｖに、複数の修正済みデータの境界に対応する複数の接続点を表示すると共に、文字列領域Ｃに、複数の修正済みデータの境界に対応する複数の接続語（あるいは接続文節や接続文）を表示することができる。また、このとき、複数の修正済みデータの境界に対応する複数の接続語（あるいは接続文節や接続文）を、上述のように所定の色で表示する等により、それぞれ識別して表示することができる。 As another display form of the review interface, one voice is obtained by combining the voice waveform of each corrected data included in the aggregated data and the character string of the corrected voice recognition result in the time series order of the voice data. A waveform and one character string may be generated, and these voice waveforms and the character string may be displayed in the horizontal direction as in the case of the transcription interface shown in FIG. In this case, a plurality of connection points corresponding to the boundaries of the plurality of corrected data are displayed in the voice waveform area V, and a plurality of connection words (or multiple connection words) corresponding to the boundaries of the plurality of corrected data are displayed in the character string area C. Connection clauses and connection clauses) can be displayed. Further, at this time, a plurality of connection words (or connection clauses and connection statements) corresponding to the boundaries of the plurality of corrected data may be identified and displayed by displaying them in a predetermined color as described above. it can.

集約データ修正受付部１５３は、校閲用インターフェースの画面において、最終校閲者による集約データの修正の入力を受け付ける。即ち、端末装置１０Ｂを使用する最終校閲者は、校閲用インターフェースの画面を見ながら、音声データを再生し、各作業者による文字起こし作業済みの音声認識結果の文字列において、音声データに対する文字起こしの結果が不適切である部分を集約データ修正受付部１５３を介して逐次修正する。このとき、校閲用インターフェースにおいては、接続点、接続語、接続文節及び接続文等が識別して表示されるため、最終校閲者は、修正済みデータの境界部分については、これらの識別情報を参照することで、より高い注意をもって校閲作業を行うことができる。 The aggregated data correction reception unit 153 accepts the input of the correction of the aggregated data by the final reviewer on the screen of the review interface. That is, the final reviewer using the terminal device 10B reproduces the voice data while looking at the screen of the review interface, and transcribes the voice data in the character string of the voice recognition result that has been transcribed by each worker. The part where the result of the above is inappropriate is sequentially corrected via the aggregated data correction reception unit 153. At this time, since the connection point, the connection word, the connection clause, the connection statement, etc. are identified and displayed in the review interface, the final reviewer refers to these identification information for the boundary part of the corrected data. By doing so, the review work can be performed with higher caution.

集約データ修正受付部１５３は、このようにして集約データに対して校閲が行われた結果のデータを、最終的な文字起こしデータとして、文字起こしデータ記憶部１７１に記憶する。 The aggregated data correction reception unit 153 stores the data as a result of reviewing the aggregated data in this way in the transcription data storage unit 171 as the final transcription data.

［動作］
次に、情報処理システム１の動作を説明する。
［文字起こし対象データ分割処理］
図９は、情報処理システム１のサーバ２０が実行する文字起こし対象データ分割処理の流れを示すフローチャートである。
文字起こし対象データ分割処理は、文字起こしの対象となる音声データを複数の作業者によって分担して文字起こしを行うために、サーバ２０が分割データを生成するための処理である。
なお、文字起こし対象データ分割処理は、端末装置１０Ｂあるいは他の装置からサーバ２０に対して、音声データの文字起こしを行うことが依頼された場合に開始される。 [motion]
Next, the operation of the information processing system 1 will be described.
[Transcription target data division processing]
FIG. 9 is a flowchart showing the flow of the transcription target data division processing executed by the server 20 of the information processing system 1.
The transcription target data division process is a process for the server 20 to generate divided data in order to share the voice data to be transcribed by a plurality of workers and perform transcription.
The transcription target data division process is started when the terminal device 10B or another device requests the server 20 to transcribe the voice data.

文字起こし対象データ分割処理が開始されると、ステップＳ１において、音声データ取得部２５１は、文字起こしの対象となる音声データを取得する。
ステップＳ２において、音声認識結果取得部２５２は、文字起こしの対象となる音声データを音声認識処理した結果の文字列からなるデータを取得する。
ステップＳ３において、データ分割部２５３は、音声データ取得部２５１によって取得された文字起こしの対象となる音声データと、音声認識結果取得部２５２によって取得された音声認識結果とを分割し、複数の作業者によって文字起こしを行うための分割データを生成する。このとき、データ分割部２５３は、文字起こしの対象となる音声データが音声認識される際に取得された音声認識結果の信頼度に基づいて、文字起こしの対象となる音声データ及びその音声データの音声認識結果を分割する。 When the transcription target data division process is started, in step S1, the voice data acquisition unit 251 acquires the voice data to be the transcription target.
In step S2, the voice recognition result acquisition unit 252 acquires data consisting of a character string as a result of voice recognition processing of the voice data to be transcribed.
In step S3, the data division unit 253 divides the voice data to be transcribed acquired by the voice data acquisition unit 251 and the voice recognition result acquired by the voice recognition result acquisition unit 252, and performs a plurality of operations. Generates divided data for transcribing by a person. At this time, the data division unit 253 describes the voice data to be transcribed and the voice data thereof based on the reliability of the voice recognition result acquired when the voice data to be transcribed is voice-recognized. Divide the voice recognition result.

ステップＳ４において、分割データ送信部２５４は、データ分割部２５３によって分割された各分割データを、複数の端末装置１０に送信する。
ステップＳ４の後、文字起こし対象データ分割処理は終了となる。 In step S4, the divided data transmitting unit 254 transmits each divided data divided by the data dividing unit 253 to the plurality of terminal devices 10.
After step S4, the transcription target data division process ends.

［分割データ文字起こし処理］
図１０は、情報処理システム１の端末装置１０Ａが実行する分割データ文字起こし処理の流れを示すフローチャートである。
分割データ文字起こし処理は、端末装置１０Ａの使用者（作業者）が分割データの文字起こし作業を行うための処理である。
なお、分割データ文字起こし処理は、サーバ２０から端末装置１０Ａに文字起こし作業が依頼された場合に開始される。 [Divided data transcription process]
FIG. 10 is a flowchart showing the flow of the divided data transcription process executed by the terminal device 10A of the information processing system 1.
The divided data transcription process is a process for the user (worker) of the terminal device 10A to perform the transcription work of the divided data.
The divided data transcription process is started when the server 20 requests the terminal device 10A to perform the transcription work.

分割データ文字起こし処理が開始されると、ステップＳ１１において、分割データ受信部５１は、サーバ２０から送信された分割データを受信する。
ステップＳ１２において、文字起こしインターフェース表示部５２は、分割データ受信部５１によって受信された分割データを文字起こしするための文字起こしインターフェースを表示する。 When the divided data transcription process is started, in step S11, the divided data receiving unit 51 receives the divided data transmitted from the server 20.
In step S12, the transcription interface display unit 52 displays a transcription interface for transcribing the divided data received by the divided data receiving unit 51.

ステップＳ１３において、分割データ修正受付部５３は、文字起こしインターフェースの画面において、作業者による分割データの修正の入力を受け付ける。なお、このとき、分割データ修正受付部５３は、作業者による仮接続語、仮接続文節、仮接続文の修正の入力を併せて受け付ける。
ステップＳ１４において、修正済みデータ送信部５４は、文字起こし作業済みの分割データ（修正済みデータ）をサーバ２０に送信する。
ステップＳ１４の後、分割データ文字起こし処理は終了となる。 In step S13, the divided data correction receiving unit 53 accepts the input of the correction of the divided data by the operator on the screen of the transcription interface. At this time, the divided data correction receiving unit 53 also accepts the input of the temporary connection word, the temporary connection clause, and the correction of the temporary connection sentence by the worker.
In step S14, the corrected data transmission unit 54 transmits the divided data (corrected data) that has been transcribed to the server 20.
After step S14, the split data transcription process ends.

［データ集約処理］
図１１は、情報処理システム１のサーバ２０が実行するデータ集約処理の流れを示すフローチャートである。
データ集約処理は、複数の作業者による文字起こし作業の結果（修正済みデータ）を１つのデータに集約するための処理である。
なお、データ集約処理は、端末装置１０Ａからサーバ２０に対して、修正済みデータが送信された場合に開始される。 [Data aggregation processing]
FIG. 11 is a flowchart showing a flow of data aggregation processing executed by the server 20 of the information processing system 1.
The data aggregation process is a process for aggregating the results (corrected data) of the transcription work by a plurality of workers into one data.
The data aggregation process is started when the corrected data is transmitted from the terminal device 10A to the server 20.

データ集約処理が開始されると、ステップＳ２１において、修正済みデータ受信部２５５は、各端末装置１０から送信された文字起こし作業済みの分割データ（修正済みデータ）を受信する。 When the data aggregation process is started, in step S21, the corrected data receiving unit 255 receives the transcribed divided data (corrected data) transmitted from each terminal device 10.

ステップＳ２２において、データ集約部２５６は、修正済みデータ受信部２５５によって受信された各修正済みデータを音声データの時間順に集約し、修正済みデータを集合させたデータである集約データを生成する。
ステップＳ２３において、集約データ送信部２５７は、データ集約部２５６によって生成された集約データを、最終校閲者が使用する端末装置１０Ｂに送信する。
ステップＳ２３の後、データ集約処理は終了となる。 In step S22, the data aggregation unit 256 aggregates each of the modified data received by the modified data receiving unit 255 in chronological order of the voice data, and generates aggregated data which is the aggregated data of the modified data.
In step S23, the aggregated data transmission unit 257 transmits the aggregated data generated by the data aggregation unit 256 to the terminal device 10B used by the final reviewer.
After step S23, the data aggregation process ends.

［集約データ校閲処理］
図１２は、情報処理システム１の端末装置１０Ｂが実行する集約データ校閲処理の流れを示すフローチャートである。
集約データ校閲処理は、端末装置１０Ｂの使用者（最終校閲者）が集約データの校閲作業を行うための処理である。
なお、集約データ校閲処理は、サーバ２０から端末装置１０Ｂに校閲作業が依頼された場合に開始される。 [Aggregated data review process]
FIG. 12 is a flowchart showing a flow of aggregated data review processing executed by the terminal device 10B of the information processing system 1.
The aggregated data review process is a process for the user (final reviewer) of the terminal device 10B to review the aggregated data.
The aggregated data review process is started when the server 20 requests the terminal device 10B to perform the review work.

集約データ校閲処理が開始されると、ステップＳ３１において、集約データ受信部１５１は、サーバ２０から送信された集約データを受信する。
ステップＳ３２において、校閲用インターフェース表示部１５２は、集約データ受信部１５１によって受信された集約データを校閲するための校閲用インターフェースを表示する。
ステップＳ３３において、集約データ修正受付部１５３は、校閲用インターフェースの画面において、最終校閲者による集約データの修正の入力を受け付ける。
ステップＳ３４において、集約データ修正受付部１５３は、このようにして集約データに対して校閲が行われた結果のデータを、最終的な文字起こしデータとして、文字起こしデータ記憶部１７１に記憶する。
ステップＳ３４の後、集約データ校閲処理は終了となる。 When the aggregated data review process is started, in step S31, the aggregated data receiving unit 151 receives the aggregated data transmitted from the server 20.
In step S32, the review interface display unit 152 displays the review interface for reviewing the aggregated data received by the aggregate data receiving unit 151.
In step S33, the aggregated data correction reception unit 153 accepts the input of the correction of the aggregated data by the final reviewer on the screen of the review interface.
In step S34, the aggregated data correction reception unit 153 stores the data as a result of reviewing the aggregated data in the transcription data storage unit 171 as the final transcription data.
After step S34, the aggregated data review process ends.

［効果］
以上のように、本実施形態に係る情報処理システム１では、隣接する分割データとの境界部分に、境界となる時刻に対応する接続語等が識別して示される。
これにより、分割データを文字起こしする作業者は、自身が担当すべきデータの境界を容易に把握することができると共に、接続語等を単位として、より適切な位置に分割データの境界を設定することができる。
また、最終校閲者が校閲作業を行う集約データには、境界位置（接続語等）が識別して示されている。
そのため、最終校閲者は、異なる作業者によって文字起こし作業が行われた部分の境界を容易に把握しながら、当該部分に対して高い注意をもって確認を行うことができる。
このように、本実施形態に係る情報処理システム１によれば、対象となる音声を複数の作業者によって分担して文字起こしを行う処理の効率を向上させることができる。 [effect]
As described above, in the information processing system 1 according to the present embodiment, the connection word or the like corresponding to the time of the boundary is identified and shown at the boundary portion with the adjacent divided data.
As a result, the worker who transcribes the divided data can easily grasp the boundary of the data to be in charge of himself / herself, and sets the boundary of the divided data at a more appropriate position in units of connecting words or the like. be able to.
In addition, the boundary position (connecting word, etc.) is identified and shown in the aggregated data in which the final reviewer performs the review work.
Therefore, the final proofreader can easily grasp the boundary of the portion where the transcription work is performed by different workers, and can confirm the portion with high caution.
As described above, according to the information processing system 1 according to the present embodiment, it is possible to improve the efficiency of the process of transcribing the target voice by being shared by a plurality of workers.

具体的には、本実施形態に係る情報処理システム１によって文字起こしを行うことにより、以下の点において有利となる。
（１）文字起こしの対象となる音声を分割する場合、音声認識結果における文末に対応する音声の終わりで分割することが望ましいが、音声認識において誤認識が発生している場合等には、実際の発話の文末を正確に把握することができず、文中の不適切な位置や単語の途中等で分割されてしまうという問題が発生する。
これに対し、情報処理システム１では、音声データにおける接続点が明示されているため、接続点に対応する接続語等が誤認識されている場合等でも、上述のような処理により、接続語を特定し、所定の境界（接続文、接続文節あるいは接続語（形態素））まで、適切に文字起こしを行うことができる。 Specifically, transcribing by the information processing system 1 according to the present embodiment is advantageous in the following points.
(1) When dividing the voice to be transcribed, it is desirable to divide it at the end of the voice corresponding to the end of the sentence in the voice recognition result, but when erroneous recognition occurs in voice recognition, etc., it is actually It is not possible to accurately grasp the end of the sentence of the utterance, and there is a problem that the sentence is divided at an inappropriate position in the sentence or in the middle of a word.
On the other hand, in the information processing system 1, since the connection point in the voice data is clearly specified, even if the connection word or the like corresponding to the connection point is erroneously recognized, the connection word is generated by the above processing. It can be specified and appropriately transcribed up to a predetermined boundary (connection sentence, connection clause or connection word (morpheme)).

（２）雑音が多い音声等では、無音や息継ぎの区間を検出することが困難なため、無音区間の検出だけでは発話音声を正確に区切ることが容易ではない。この場合、文中の不適切な位置や単語の途中等で分割されてしまうことがあり、前後の文脈が不明となることから、文字起こしの誤りの原因となる。
これに対し、情報処理システム１では、始端となる接続点より前の時間Δｔ２ｓ及び終端である接続点より後の時間Δｔ２ｅまでの音声データの音声認識結果に属する接続語あるいは接続語を含む文等を単位として、分割データの境界を設定するため、作業者が文脈を把握し易い位置で文字起こしの対象となる音声を分割することができ、文字起こしの精度を高めることができる。 (2) Since it is difficult to detect a silence or a breathing section in a noisy voice or the like, it is not easy to accurately separate the uttered voice only by detecting the silence section. In this case, it may be divided at an inappropriate position in a sentence or in the middle of a word, and the context before and after it becomes unclear, which causes an error in transcription.
On the other hand, in the information processing system 1, a connection word or a sentence including a connection word belonging to the voice recognition result of the voice data up to the time Δt2s before the connection point at the start and the time Δt2e after the connection point at the end, etc. Since the boundary of the divided data is set in units of, the voice to be transcribed can be divided at a position where the operator can easily grasp the context, and the accuracy of transcribing can be improved.

（３）文字起こしの対象となる音声を分割する際に、文脈をより明らかにするため、隣接する部分と一定時間の重なりをもって分割した場合、分割した音声データの境界付近の文字起こし作業が重複して行われる等、効率の低下の問題が発生する。また、文字起こし作業が重複して行われた場合には、いずれの文字起こし作業の結果を採用するかを判断する必要が生じる。
これに対し、情報処理システム１では、接続語あるいは接続語を含む文等の境界位置が識別して表示されるため、分割された音声データを文字起こしする作業者は、自身が担当すべきデータの境界を、わかり易い単位で、容易に把握することができる。
そのため、文字起こし作業が複数の作業者において重複して行われることを防ぐことができ、効率の低下を抑制することができる。また、重複して行われた文字起こし作業の結果のいずれを採用するかを判断する必要がなくなる。 (3) When dividing the voice to be transcribed, in order to clarify the context, if the voice is divided with an overlap of a certain period of time with the adjacent part, the transcription work near the boundary of the divided voice data is duplicated. There is a problem of reduced efficiency. Further, when the transcription work is performed in duplicate, it becomes necessary to determine which transcription work result is to be adopted.
On the other hand, in the information processing system 1, since the boundary position of the connection word or the sentence including the connection word is identified and displayed, the worker who transcribes the divided voice data should be in charge of the data. Boundaries can be easily grasped in easy-to-understand units.
Therefore, it is possible to prevent the transcription work from being duplicated by a plurality of workers, and it is possible to suppress a decrease in efficiency. In addition, it is not necessary to determine which of the results of the duplicate transcription work is to be adopted.

（４）文字起こしの対象となる音声を同一の時間で分割した場合、音声認識結果の精度（信頼度）が高い部分については、修正する文字が少ないため、文字起こしの作業は短時間で済む一方、雑音等の影響で音声認識結果の精度が低い場合については、修正する文字が多くなり、文字起こし作業に要する時間は長時間となる。
この場合、分割された音声データそれぞれを作業者が処理する時間にばらつきが生じ、文字起こし作業全体の効率が低下する可能性がある。また、各作業者に対する報酬が同一であれば、処理負担が大きく異なることとなり、作業者間に不公平をもたらすこととなる。
これに対し、情報処理システム１では、各作業者による作業負担が均等となるように、文字起こしの対象となる音声データが分割されるため、各作業者の処理時間を均一化できると共に、作業者間に不公平が生じる事態を抑制することができる。 (4) When the voice to be transcribed is divided at the same time, the transcription work can be completed in a short time because there are few characters to be corrected for the part where the accuracy (reliability) of the voice recognition result is high. On the other hand, when the accuracy of the voice recognition result is low due to the influence of noise or the like, the number of characters to be corrected increases, and the time required for the transcription work becomes long.
In this case, the time required for the operator to process each of the divided voice data varies, which may reduce the efficiency of the entire transcription work. In addition, if the remuneration for each worker is the same, the processing load will be significantly different, resulting in unfairness among the workers.
On the other hand, in the information processing system 1, since the voice data to be transcribed is divided so that the work load by each worker is equalized, the processing time of each worker can be made uniform and the work can be performed. It is possible to suppress the situation where unfairness occurs between persons.

［変形例１］
上述の実施形態において、分割データの分割時間ＤＴを算出する場合、以下のような算出方法とすることができる。
即ち、分割データの分割時間ＤＴは、音声認識結果の信頼度から算出した係数λで比例計算することができる。
具体的には、音声認識結果の信頼度をＣＬ（０＜ＣＬ＜１）とすると、係数λをＣＬが大きいほど大きくなるＣＬの関数として定義することができ、例えば、λ＝ＣＬ＋０．５と定義することができる。
そして、この係数λを用いて、分割データの分割時間ＤＴを
ＤＴ＝λ×ＤＴ０
と定義することができる。 [Modification 1]
In the above-described embodiment, when calculating the division time DT of the division data, the following calculation method can be used.
That is, the division time DT of the division data can be proportionally calculated by the coefficient λ calculated from the reliability of the voice recognition result.
Specifically, assuming that the reliability of the speech recognition result is CL (0 <CL <1), the coefficient λ can be defined as a function of CL that increases as CL increases. For example, λ = CL + 0.5. Can be defined.
Then, using this coefficient λ, the division time DT of the division data is set to DT = λ × DT0.
Can be defined as.

［変形例２］
上述の実施形態において、分割データの分割時間（即ち、文字起こし作業の負荷）を決定するパラメータとして、音声認識結果の信頼度を用いることとしたが、これに限られない。
例えば、音声認識文字数、発話スピード（一定時間における発話モーラ数）、音声の品質（Ｓ／Ｎ比等）、発話の明瞭度（滑舌の良さ、なまりの度合い等）、音割れ（音の歪み）の有無等、音声データの各種属性に基づいて、分割データの分割時間を決定することとしてもよい。 [Modification 2]
In the above-described embodiment, the reliability of the voice recognition result is used as a parameter for determining the division time of the divided data (that is, the load of the transcription work), but the present invention is not limited to this.
For example, the number of voice recognition characters, speech speed (number of speech mora in a certain period of time), voice quality (S / N ratio, etc.), speech intelligibility (good smoothness, degree of smoothness, etc.), sound cracking (sound distortion) The division time of the divided data may be determined based on various attributes of the voice data such as the presence or absence of).

以上のように構成される情報処理システム１は、サーバ２０と、端末装置１０Ａとを含む。サーバ２０は、音声データ取得部２５１及び音声認識結果取得部２５２（文字起こし対象データ取得手段）と、データ分割部２５３とを備える。端末装置１０Ａは、文字起こしインターフェース表示部５２と、分割データ修正受付部５３とを備える。
音声データ取得部２５１及び音声認識結果取得部２５２は、文字起こしの対象となる音声データ及び当該音声データの音声認識結果のデータを取得する。
データ分割部２５３は、音声データ取得部２５１及び音声認識結果取得部２５２によって取得された音声データ及び音声認識結果のデータを分割して分割データを生成する。
文字起こしインターフェース表示部５２は、分割データにおける音声データの音声波形を表す領域と、当該音声データの音声認識結果のデータが示す文字列を表す領域とを含み、分割データを文字起こしするための文字起こしインターフェースを表示する。
分割データ修正受付部５３は、文字起こしインターフェースに表示された音声認識結果のデータに対する修正を受け付ける。
文字起こしインターフェース表示部５２は、音声データの音声波形において分割データの境界位置を示す接続点と、音声認識結果のデータにおいて接続点の音声に対応する接続語とを識別して表示する。
これにより、隣接する分割データとの境界部分に、境界となる時刻に対応する接続語が識別して示される。
そのため、分割データを文字起こしする作業者は、接続語を単位として、自身が担当すべきデータの境界を容易に把握することができる。
したがって、情報処理システム１によれば、対象となる音声を複数の作業者によって分担して文字起こしを行う処理の効率を向上させることができる。 The information processing system 1 configured as described above includes the server 20 and the terminal device 10A. The server 20 includes a voice data acquisition unit 251 and a voice recognition result acquisition unit 252 (transcription target data acquisition means), and a data division unit 253. The terminal device 10A includes a transcription interface display unit 52 and a divided data correction reception unit 53.
The voice data acquisition unit 251 and the voice recognition result acquisition unit 252 acquire the voice data to be transcribed and the voice recognition result data of the voice data.
The data division unit 253 divides the voice data acquired by the voice data acquisition unit 251 and the voice recognition result acquisition unit 252 and the voice recognition result data to generate the divided data.
The transcription interface display unit 52 includes an area representing the voice waveform of the voice data in the divided data and an area representing the character string indicated by the voice recognition result data of the voice data, and the character for transcribing the divided data. Display the wakeup interface.
The divided data correction receiving unit 53 receives corrections for the voice recognition result data displayed on the transcription interface.
The transcription interface display unit 52 identifies and displays a connection point indicating the boundary position of the divided data in the voice waveform of the voice data and a connection word corresponding to the voice of the connection point in the voice recognition result data.
As a result, the connection word corresponding to the time of the boundary is identified and shown at the boundary portion with the adjacent divided data.
Therefore, the worker who transcribes the divided data can easily grasp the boundary of the data to be in charge of himself / herself in units of connecting words.
Therefore, according to the information processing system 1, it is possible to improve the efficiency of the process of transcribing the target voice by being shared by a plurality of workers.

文字起こしインターフェース表示部５２は、文字起こしインターフェースにおいて、接続語を含む接続文節をさらに識別して表示する。
これにより、接続文節を単位として、自身が担当すべきデータの境界を容易に把握することができる。 The transcription interface display unit 52 further identifies and displays the connection phrase including the connection word in the transcription interface.
As a result, it is possible to easily grasp the boundaries of the data that oneself should be in charge of in units of connection clauses.

文字起こしインターフェース表示部５２は、文字起こしインターフェースにおいて、接続語を含む接続文をさらに識別して表示する。
これにより、接続文を単位として、自身が担当すべきデータの境界を容易に把握することができる。 The transcription interface display unit 52 further identifies and displays the connection statement including the connection word in the transcription interface.
As a result, it is possible to easily grasp the boundary of the data to be in charge of the connection statement as a unit.

データ分割部２５３は、接続点の音声に対応する接続語、接続語を含む接続文節または接続語を含む接続文の少なくともいずれかを単位として、音声データ及び音声認識結果のデータを分割する。
これにより、接続語、接続文節または接続文を単位として、分割データの境界を設定することができる。また、分割データの境界を設定する際に、音声データにおいて接続点に付加する時間を、接続語、接続文節または接続文等の単位の先頭あるいは末尾を区切りとして、機械的に算出することができる。 The data division unit 253 divides the voice data and the voice recognition result data in units of at least one of a connection word corresponding to the voice of the connection point, a connection clause including the connection word, and a connection statement including the connection word.
As a result, the boundaries of the divided data can be set in units of connection words, connection clauses, or connection statements. Further, when setting the boundary of the divided data, the time to be added to the connection point in the voice data can be calculated mechanically with the beginning or end of the unit such as the connection word, the connection clause, or the connection statement as a delimiter. ..

接続語、接続文節または接続文の少なくともいずれかは、分割データの文字起こしを行う作業者が担当する範囲の境界を表す。
これにより、作業者は、自身が文字起こしを担当する文字列をわかり易い単位で把握することができる。 At least one of a connection word, a connection clause, or a connection statement represents the boundary of the range in charge of the operator who transcribes the divided data.
As a result, the worker can grasp the character string in which he / she is in charge of transcribing in an easy-to-understand unit.

データ分割部２５３は、音声データの属性に基づいて、音声データ及び音声認識結果のデータを分割データとして分割する長さを決定する。
これにより、文字起こしの対象となる音声データの属性を反映させて、分割データの長さを決定することができる。 The data division unit 253 determines the length for dividing the voice data and the voice recognition result data as the division data based on the attributes of the voice data.
As a result, the length of the divided data can be determined by reflecting the attributes of the voice data to be transcribed.

データ分割部２５３は、音声認識結果の信頼度に基づいて、音声データ及び音声認識結果のデータを分割データとして分割する長さを決定する。
これにより、文字起こしの対象となる音声データの信頼度を反映させて、分割データの長さを決定することができる。 The data division unit 253 determines the length of dividing the voice data and the voice recognition result data as the divided data based on the reliability of the voice recognition result.
As a result, the length of the divided data can be determined by reflecting the reliability of the voice data to be transcribed.

サーバ２０は、データ集約部２５６を備える。
データ集約部２５６は、作業者用の端末装置１０Ａにおける分割データの作業結果を集約した集約データを生成する。
これにより、複数の作業者による作業結果を容易に集約することができる。 The server 20 includes a data aggregation unit 256.
The data aggregation unit 256 generates aggregated data that aggregates the work results of the divided data in the terminal device 10A for workers.
As a result, the work results of a plurality of workers can be easily aggregated.

情報処理システム１は、端末装置１０Ｂをさらに含む。
端末装置１０Ｂは、複数の作業者による作業結果から全体の文字起こし結果を生成する校閲者によって使用される。
端末装置１０Ｂは、校閲用インターフェース表示部１５２を備える。
校閲用インターフェース表示部１５２は、作業者用の端末装置における作業結果を集約した集約データについて、当該集約データに含まれる分割データの作業結果のうち、音声データの音声波形を表す領域と、当該音声データを対象として作業者が文字起こしした結果の文字列を表す領域とを含み、集約データを校閲するための校閲用インターフェースを表示する。
これにより、校閲者は、異なる作業者による文字起こし作業の結果を容易に校閲することが可能となる。 The information processing system 1 further includes a terminal device 10B.
The terminal device 10B is used by a reviewer who generates an entire transcription result from work results by a plurality of workers.
The terminal device 10B includes a review interface display unit 152.
Regarding the aggregated data that aggregates the work results in the terminal device for workers, the review interface display unit 152 includes an area representing the audio waveform of the audio data and the audio in the work results of the divided data included in the aggregated data. A review interface for reviewing aggregated data is displayed, including an area representing a character string as a result of being transcribed by a worker for data.
This allows the reviewer to easily review the results of the transcription work by different workers.

校閲用インターフェース表示部１５２は、集約データに含まれる分割データの作業結果のうち、音声データの音声波形において分割データの境界位置を示す接続点と、接続点の音声に対応する接続語とを識別して表示する。
これにより、校閲者は、異なる作業者によって文字起こし作業が行われた部分の境界を容易に把握しながら、当該部分に対して高い注意をもって確認を行うことができる。 The review interface display unit 152 identifies the connection point indicating the boundary position of the divided data in the voice waveform of the voice data and the connection word corresponding to the voice of the connection point among the work results of the divided data included in the aggregated data. To display.
As a result, the reviewer can easily grasp the boundary of the portion where the transcription work is performed by different workers, and can confirm the portion with high caution.

なお、本発明は、上述の実施形態に限定されるものではなく、本発明の目的を達成できる範囲での変形、改良等は本発明に含まれるものである。
例えば、上述の実施形態において、分割データの境界を示す情報として、文字列の各種ブロックを単位として定義することができる。即ち、分割データの境界を示す文字列のブロックとしては、形態素、単語、文節、句、単文等を定義したり、あるいは、複文までを許容して定義したりすることができる。また、上述の実施形態において、日本語の他、英語、中国語、タイ語等、異なる言語体系においても、その言語に応じたブロックを定義して本発明を活用することができる。 The present invention is not limited to the above-described embodiment, and modifications, improvements, and the like within the range in which the object of the present invention can be achieved are included in the present invention.
For example, in the above-described embodiment, various blocks of the character string can be defined as a unit as information indicating the boundary of the divided data. That is, as a block of the character string indicating the boundary of the divided data, a morpheme, a word, a phrase, a phrase, a simple sentence, or the like can be defined, or even a compound sentence can be allowed and defined. Further, in the above-described embodiment, the present invention can be utilized by defining blocks corresponding to the language in different language systems such as English, Chinese, Thai, etc. in addition to Japanese.

また、上述の実施形態において、サーバ２０の構成は一例として示したものであり、情報処理システム１全体として、サーバ２０の機能が備えられていれば、サーバ２０の機能を複数のサーバに分割して実装したり、端末装置１０にサーバ２０の機能の一部を実装したりすることができる。
さらに、サーバ２０の機能をいずれかの端末装置１０に実装することにより、サーバ２０を介することなく、端末装置１０を使用するユーザ間において、文字起こしの対象となる音声データ及びその音声データの音声認識結果を分割し、文字起こし作業を分担して行うこととしてもよい。この場合、作業者によって使用される複数の端末装置１０Ａから送信される作業済みデータを、最終校閲者が使用する端末装置１０Ｂが受信して集約データを生成し、最終校閲者が集約データを校閲することにより、最終的な文字起こしデータを生成することができる。
また、上述の実施形態及び変形例を適宜組み合わせた構成とすることとしてもよい。 Further, in the above-described embodiment, the configuration of the server 20 is shown as an example, and if the information processing system 1 as a whole is provided with the function of the server 20, the function of the server 20 is divided into a plurality of servers. Or, a part of the function of the server 20 can be mounted on the terminal device 10.
Further, by implementing the function of the server 20 in any of the terminal devices 10, the voice data to be transcribed and the voice of the voice data can be transcribed between users who use the terminal device 10 without going through the server 20. The recognition result may be divided and the transcription work may be shared. In this case, the terminal device 10B used by the final reviewer receives the work data transmitted from the plurality of terminal devices 10A used by the worker to generate aggregated data, and the final reviewer reviews the aggregated data. By doing so, the final transcription data can be generated.
In addition, the configuration may be a combination of the above-described embodiments and modifications as appropriate.

上述した一連の処理は、ハードウェアにより実行させることもできるし、ソフトウェアにより実行させることもできる。
換言すると、図４，６，８の機能的構成は例示に過ぎず、特に限定されない。即ち、上述した一連の処理を全体として実行できる機能が情報処理システム１に備えられていれば足り、この機能を実現するためにどのような機能ブロックを用いるのかは特に図４，６，８の例に限定されない。
また、１つの機能ブロックは、ハードウェア単体で構成してもよいし、ソフトウェア単体で構成してもよいし、それらの組み合わせで構成してもよい。 The series of processes described above can be executed by hardware or software.
In other words, the functional configurations of FIGS. 4, 6 and 8 are merely examples and are not particularly limited. That is, it suffices if the information processing system 1 is provided with a function capable of executing the above-mentioned series of processes as a whole, and what kind of functional block is used to realize this function is particularly determined in FIGS. 4, 6 and 8. Not limited to examples.
Further, one functional block may be configured by a single piece of hardware, a single piece of software, or a combination thereof.

一連の処理をソフトウェアにより実行させる場合には、そのソフトウェアを構成するプログラムが、コンピュータ等にネットワークや記録媒体からインストールされる。
コンピュータは、専用のハードウェアに組み込まれているコンピュータであってもよい。また、コンピュータは、各種のプログラムをインストールすることで、各種の機能を実行することが可能なコンピュータ、例えば汎用のパーソナルコンピュータであってもよい。 When a series of processes are executed by software, the programs constituting the software are installed on a computer or the like from a network or a recording medium.
The computer may be a computer embedded in dedicated hardware. Further, the computer may be a computer capable of executing various functions by installing various programs, for example, a general-purpose personal computer.

このようなプログラムを含む記録媒体は、ユーザにプログラムを提供するために装置本体とは別に配布される図２及び図３のリムーバブルメディア３１，２３１により構成されるだけでなく、装置本体に予め組み込まれた状態でユーザに提供される記録媒体等で構成される。リムーバブルメディア３１，２３１は、例えば、磁気ディスク（フロッピディスクを含む）、光ディスク、または光磁気ディスク等により構成される。光ディスクは、例えば、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋ−ＲｅａｄＯｎｌｙＭｅｍｏｒｙ），ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）等により構成される。光磁気ディスクは、ＭＤ（Ｍｉｎｉ−Ｄｉｓｋ）等により構成される。また、装置本体に予め組み込まれた状態でユーザに提供される記録媒体は、例えば、プログラムが記録されている図２及び図３のＲＯＭ１２，２１２や、図２及び図３の記憶部１７，２１７に含まれるＤＲＡＭ等で構成される。 The recording medium containing such a program is not only composed of the removable media 31 and 231 of FIGS. 2 and 3 distributed separately from the device main body in order to provide the program to the user, but is also preliminarily incorporated in the device main body. It is composed of a recording medium or the like provided to the user in this state. The removable media 31 and 231 are composed of, for example, a magnetic disk (including a floppy disk), an optical disk, a magneto-optical disk, or the like. The optical disk is composed of, for example, a CD-ROM (Compact Disk-Read Only Memory), a DVD (Digital Versaille Disk), or the like. The magneto-optical disk is composed of MD (Mini-Disk) or the like. The recording medium provided to the user in a state of being preliminarily incorporated in the apparatus main body is, for example, the ROMs 12 and 212 of FIGS. 2 and 3 in which the program is recorded, and the storage units 17 and 217 of FIGS. 2 and 3. It is composed of DRAM and the like included in.

なお、本明細書において、記録媒体に記録されるプログラムを記述するステップは、その順序に沿って時系列的に行われる処理はもちろん、必ずしも時系列的に処理されなくとも、並列的あるいは個別に実行される処理をも含むものである。
また、本明細書において、システムの用語は、複数の装置や複数の手段等より構成される全体的な装置を意味するものとする。 In the present specification, the steps for describing a program recorded on a recording medium are not necessarily processed in chronological order, but also in parallel or individually, even if they are not necessarily processed in chronological order. It also includes the processing to be executed.
Further, in the present specification, the term of the system shall mean an overall device composed of a plurality of devices, a plurality of means, and the like.

１情報処理システム、１０，１０Ａ，１０Ｂ端末装置、２０サーバ、３０ネットワーク、１１，２１１ＣＰＵ、１２，２１２ＲＯＭ、１３，２１３ＲＡＭ、１４，２１４バス、１５，２１５入力部、１６，２１６出力部、１７，２１７記憶部、１８，２１８通信部、１９，２１９ドライブ、３１，２３１リムーバブルメディア、５１分割データ受信部、５２文字起こしインターフェース表示部、５３分割データ修正受付部、５４修正済みデータ送信部、７１分割データ記憶部、１５１集約データ受信部、１５２校閲用インターフェース表示部、１５３集約データ修正受付部、１７１文字起こしデータ記憶部、２５１音声データ取得部、２５２音声認識結果取得部、２５３データ分割部、２５４分割データ送信部、２５５修正済みデータ受信部、２５６データ集約部、２５７集約データ送信部、２７１文字起こし関連データ記憶部 1 Information processing system, 10,10A, 10B terminal equipment, 20 servers, 30 networks, 11,211 CPUs, 12,212 ROMs, 13,213 RAMs, 14,214 buses, 15,215 inputs, 16,216 outputs , 17,217 Storage, 18,218 Communication, 19,219 Drive, 31,231 Removable Media, 51 Divided Data Receiver, 52 Transcription Interface Display, 53 Divided Data Correction Reception, 54 Corrected Data Transmission , 71 Divided data storage unit, 151 Aggregated data receiving unit, 152 Review interface display unit, 153 Aggregated data correction reception unit, 171 Transcription data storage unit, 251 Voice data acquisition unit, 252 Voice recognition result acquisition unit, 253 Data division Unit, 254 split data transmission unit, 255 corrected data reception unit, 256 data aggregation unit, 257 aggregation data transmission unit, 271 transcription related data storage unit

Claims

An information processing system including a server that divides voice data to be transcribed and assigns it to a plurality of workers, and a terminal device for workers used by the worker who transcribes the voice data.
The server
Transcription target data acquisition means for acquiring the voice data to be transcribed and the voice recognition result data of the voice data, and
A data dividing means for generating divided data by dividing the voice data acquired by the transcription target data acquisition means and the voice recognition result data at a position including the middle of a word or a morpheme in the voice data. Prepare,
The terminal device for the worker is
A transcription interface for transcribing the divided data is displayed, including an area representing the voice waveform of the voice data in the divided data and an area representing a character string indicated by the voice recognition result data of the voice data. Transcription interface display means and
It is provided with a data correction receiving means for receiving corrections to the voice recognition result data displayed on the transcription interface.
The transcription interface display means has a connection point indicating a boundary position of the divided data set at a position including the middle of the word or a morphology in the voice waveform of the voice data, and the connection point in the voice recognition result data. An information processing system characterized in that it identifies and displays a connected word corresponding to the voice of.

An information processing system including a server that divides voice data to be transcribed and assigns it to a plurality of workers, and a terminal device for workers used by the worker who transcribes the voice data.
The server
Transcription target data acquisition means for acquiring the voice data to be transcribed and the voice recognition result data of the voice data, and
A data dividing means for dividing the voice data acquired by the transcription target data acquisition means and the voice recognition result data to generate divided data is provided.
The terminal device for the worker is
A transcription interface for transcribing the divided data is displayed, including an area representing the voice waveform of the voice data in the divided data and an area representing a character string indicated by the voice recognition result data of the voice data. Transcription interface display means and
It is provided with a data correction receiving means for receiving corrections to the voice recognition result data displayed on the transcription interface.
The transcription interface display means identifies and displays a connection point indicating the boundary position of the divided data in the voice waveform of the voice data and a connection word corresponding to the voice of the connection point in the voice recognition result data. And
The transcript interface display means, in the transcript interface, information processing system that is characterized in that the further identify and view the connection clause including the access word.

An information processing system including a server that divides voice data to be transcribed and assigns it to a plurality of workers, and a terminal device for workers used by the worker who transcribes the voice data.
The server
Transcription target data acquisition means for acquiring the voice data to be transcribed and the voice recognition result data of the voice data, and
A data dividing means for dividing the voice data acquired by the transcription target data acquisition means and the voice recognition result data to generate divided data is provided.
The terminal device for the worker is
A transcription interface for transcribing the divided data is displayed, including an area representing the voice waveform of the voice data in the divided data and an area representing a character string indicated by the voice recognition result data of the voice data. Transcription interface display means and
It is provided with a data correction receiving means for receiving corrections to the voice recognition result data displayed on the transcription interface.
The transcription interface display means identifies and displays a connection point indicating the boundary position of the divided data in the voice waveform of the voice data and a connection word corresponding to the voice of the connection point in the voice recognition result data. And
The transcript interface display means, wherein in the transcript interface, information processing system that is characterized in that further identifies and displays the connection statement including the access word.

An information processing system including a server that divides voice data to be transcribed and assigns it to a plurality of workers, and a terminal device for workers used by the worker who transcribes the voice data.
The server
Transcription target data acquisition means for acquiring the voice data to be transcribed and the voice recognition result data of the voice data, and
A data dividing means for dividing the voice data acquired by the transcription target data acquisition means and the voice recognition result data to generate divided data is provided.
The terminal device for the worker is
A transcription interface for transcribing the divided data is displayed, including an area representing the voice waveform of the voice data in the divided data and an area representing a character string indicated by the voice recognition result data of the voice data. Transcription interface display means and
It is provided with a data correction receiving means for receiving corrections to the voice recognition result data displayed on the transcription interface.
The transcription interface display means identifies and displays a connection point indicating the boundary position of the divided data in the voice waveform of the voice data and a connection word corresponding to the voice of the connection point in the voice recognition result data. And
The data dividing means has the voice data and the voice recognition result data in units of at least one of a connection word corresponding to the voice of the connection point, a connection clause including the connection word, or a connection statement including the connection word. information processing systems that dividing means divides.

The information processing system according to claim 4, wherein at least one of the connection word, the connection clause, and the connection statement represents a boundary of a range in charge of the worker who transcribes the divided data. ..

An information processing system including a server that divides voice data to be transcribed and assigns it to a plurality of workers, and a terminal device for workers used by the worker who transcribes the voice data.
The server
Transcription target data acquisition means for acquiring the voice data to be transcribed and the voice recognition result data of the voice data, and
A data dividing means for dividing the voice data acquired by the transcription target data acquisition means and the voice recognition result data to generate divided data is provided.
The terminal device for the worker is
A transcription interface for transcribing the divided data is displayed, including an area representing the voice waveform of the voice data in the divided data and an area representing a character string indicated by the voice recognition result data of the voice data. Transcription interface display means and
It is provided with a data correction receiving means for receiving corrections to the voice recognition result data displayed on the transcription interface.
The transcription interface display means identifies and displays a connection point indicating the boundary position of the divided data in the voice waveform of the voice data and a connection word corresponding to the voice of the connection point in the voice recognition result data. And
It said data dividing means, on the basis of the attribute of the audio data, the audio data and the information processing system that is characterized in that to determine the length of dividing the data of the speech recognition result as the divided data.

An information processing system including a server that divides voice data to be transcribed and assigns it to a plurality of workers, and a terminal device for workers used by the worker who transcribes the voice data.
The server
Transcription target data acquisition means for acquiring the voice data to be transcribed and the voice recognition result data of the voice data, and
A data dividing means for dividing the voice data acquired by the transcription target data acquisition means and the voice recognition result data to generate divided data is provided.
The terminal device for the worker is
A transcription interface for transcribing the divided data is displayed, including an area representing the voice waveform of the voice data in the divided data and an area representing a character string indicated by the voice recognition result data of the voice data. Transcription interface display means and
It is provided with a data correction receiving means for receiving corrections to the voice recognition result data displayed on the transcription interface.
The transcription interface display means identifies and displays a connection point indicating the boundary position of the divided data in the voice waveform of the voice data and a connection word corresponding to the voice of the connection point in the voice recognition result data. And
It said data dividing means, on the basis of the reliability of the speech recognition result, the voice data and the information processing system that is characterized in that to determine the length of dividing the data of the speech recognition result as the divided data.

An information processing system including a server that divides voice data to be transcribed and assigns it to a plurality of workers, and a terminal device for workers used by the worker who transcribes the voice data.
The server
Transcription target data acquisition means for acquiring the voice data to be transcribed and the voice recognition result data of the voice data, and
A data dividing means for dividing the voice data acquired by the transcription target data acquisition means and the voice recognition result data to generate divided data is provided.
The terminal device for the worker is
A transcription interface for transcribing the divided data is displayed, including an area representing the voice waveform of the voice data in the divided data and an area representing a character string indicated by the voice recognition result data of the voice data. Transcription interface display means and
It is provided with a data correction receiving means for receiving corrections to the voice recognition result data displayed on the transcription interface.
The transcription interface display means identifies and displays a connection point indicating the boundary position of the divided data in the voice waveform of the voice data and a connection word corresponding to the voice of the connection point in the voice recognition result data. And
The server
Further information processing system that is characterized in that it comprises aggregate data generating means for generating an aggregate data that aggregates the work results of the divided data in the terminal device for the operator.

It also includes a terminal device for reviewers used by reviewers to generate the entire transcription result from the work results of multiple workers.
The terminal device for the reviewer is
Regarding the aggregated data that aggregates the work results in the terminal device for workers, among the work results of the divided data included in the aggregated data, the area representing the audio waveform of the audio data and the audio data are targeted. Any of claims 1 to 8, further comprising a review interface display means for displaying a review interface for reviewing the aggregated data, including an area representing a character string as a result of transcribing by the worker. The information processing system according to item 1.

An information processing system including a server that divides voice data to be transcribed and assigns it to a plurality of workers, and a terminal device for workers used by the worker who transcribes the voice data.
The server
Transcription target data acquisition means for acquiring the voice data to be transcribed and the voice recognition result data of the voice data, and
A data dividing means for dividing the voice data acquired by the transcription target data acquisition means and the voice recognition result data to generate divided data is provided.
The terminal device for the worker is
A transcription interface for transcribing the divided data is displayed, including an area representing the voice waveform of the voice data in the divided data and an area representing a character string indicated by the voice recognition result data of the voice data. Transcription interface display means and
It is provided with a data correction receiving means for receiving corrections to the voice recognition result data displayed on the transcription interface.
The transcription interface display means identifies and displays a connection point indicating the boundary position of the divided data in the voice waveform of the voice data and a connection word corresponding to the voice of the connection point in the voice recognition result data. And
It also includes a terminal device for reviewers used by reviewers to generate the entire transcription result from the work results of multiple workers.
The terminal device for the reviewer is
Regarding the aggregated data that aggregates the work results in the terminal device for workers, among the work results of the divided data included in the aggregated data, the area representing the audio waveform of the audio data and the audio data are targeted. It is provided with a review interface display means for displaying a review interface for reviewing the aggregated data, including an area representing a character string as a result of being transcribed by the worker.
The review interface display means corresponds to the connection point indicating the boundary position of the divided data in the voice waveform of the voice data and the voice of the connection point among the work results of the divided data included in the aggregated data. information processing system that is characterized in that display to identify said connection word for.

A terminal device for workers used by workers who transcribe voice data.
A region representing the voice waveform of the voice data in the divided data in which the voice data to be transcribed and the voice recognition result data of the voice data are divided at positions including the middle of a word or a morphology in the voice data. A transcription interface display means for displaying a transcription interface for transcribing the divided data, including an area representing a character string indicated by the voice recognition result data of the voice data.
It is provided with a data correction receiving means for receiving corrections to the voice recognition result data displayed on the transcription interface.
The transcription interface display means has a connection point indicating a boundary position of the divided data set at a position including the middle of the word or a morphology in the voice waveform of the voice data, and the connection point in the voice recognition result data. A terminal device characterized in that it identifies and displays a connection word corresponding to the voice of.

A terminal device for reviewers used by reviewers to generate the entire transcription result from the work results of multiple workers.
Of the work results included in the aggregated data, the aggregated data obtained by aggregating the work results of the plurality of workers on the divided data obtained by dividing the voice data to be transcribed and the voice recognition result data of the voice data. A review that displays a review interface for reviewing the aggregated data, including an area representing the voice waveform of the voice data and an area representing a character string as a result of being transcribed by the worker for the voice data. Equipped with an interface display means for
The review interface display means has the connection point indicating the boundary position of the divided data in the voice waveform of the voice data and the data of the voice recognition result among the work results of the divided data included in the aggregated data. A terminal device characterized in that it identifies and displays a connection word corresponding to the voice of a connection point.

A server in an information processing system that includes a server that divides voice data to be transcribed and allocates it to a plurality of workers, and a terminal device for workers used by the worker who transcribes the voice data. hand,
Transcription target data acquisition means for acquiring the voice data to be transcribed and the voice recognition result data of the voice data, and
A data dividing means for dividing the voice data acquired by the transcription target data acquisition means and the voice recognition result data to generate divided data is provided.
The data division means has a connection word, a connection clause, or a connection statement corresponding to the voice of the connection point in the voice recognition result data with respect to the connection point indicating the boundary position of the division data in the voice waveform of the voice data. A server characterized in that the voice data and the voice recognition result data are divided in units of at least one of them.

An information processing method executed by a terminal device for a worker used by a worker who transcribes voice data.
A region representing the voice waveform of the voice data in the divided data in which the voice data to be transcribed and the voice recognition result data of the voice data are divided at positions including the middle of a word or a morphology in the voice data. A transcription interface display step for displaying a transcription interface for transcribing the divided data, including an area representing a character string indicated by the voice recognition result data of the voice data, and
Includes a data correction acceptance step that accepts corrections to the speech recognition result data displayed on the transcription interface.
In the transcription interface display step, a connection point indicating the boundary position of the divided data set at a position including the middle of the word or morphology in the voice waveform of the voice data, and the connection point in the voice recognition result data. An information processing method characterized in that a connection word corresponding to the voice of is displayed.

An information processing method performed by a reviewer's terminal device used by a reviewer to generate the entire transcription result from the work results of multiple workers.
Of the work results included in the aggregated data, the aggregated data obtained by aggregating the work results of the plurality of workers on the divided data obtained by dividing the voice data to be transcribed and the voice recognition result data of the voice data. A review that displays a review interface for reviewing the aggregated data, including an area representing the voice waveform of the voice data and an area representing a character string as a result of being transcribed by the worker for the voice data. Including interface display steps for
In the review interface display step, among the work results of the divided data included in the aggregated data, the connection point indicating the boundary position of the divided data in the voice waveform of the voice data and the data of the voice recognition result are described. An information processing method characterized in that a connection word corresponding to the voice of a connection point is identified and displayed.

Executed by a server in an information processing system that includes a server that divides voice data to be transcribed and assigns it to a plurality of workers, and a terminal device for workers used by the worker who transcribes the voice data. Information processing method
The transcription target data acquisition step for acquiring the voice data to be transcribed and the voice recognition result data of the voice data, and
A data division step of dividing the voice data acquired in the transcription target data acquisition step and the voice recognition result data to generate divided data is included.
In the data division step, with respect to the connection point indicating the boundary position of the division data in the voice waveform of the voice data, the connection word, the connection clause or the connection statement corresponding to the voice of the connection point in the voice recognition result data. An information processing method characterized in that the voice data and the voice recognition result data are divided in units of at least one of them.

To the computer that constitutes the terminal device for the worker used by the worker who transcribes the voice data
A region representing the voice waveform of the voice data in the divided data in which the voice data to be transcribed and the voice recognition result data of the voice data are divided at positions including the middle of a word or a morphology in the voice data. A transcription interface display function that displays a transcription interface for transcribing the divided data, including an area representing a character string indicated by the voice recognition result data of the voice data, and
A data correction reception function that accepts corrections to the voice recognition result data displayed on the transcription interface is realized.
The transcription interface display function includes a connection point indicating a boundary position of the divided data set at a position including the middle of the word or morphology in the voice waveform of the voice data, and the connection point in the voice recognition result data. A program characterized by identifying and displaying a connected word corresponding to the voice of.

A computer that constitutes a terminal device for a reviewer used by a reviewer to generate the entire transcription result from the work results of multiple workers.
Of the work results included in the aggregated data, the aggregated data obtained by aggregating the work results of the plurality of workers on the divided data obtained by dividing the voice data to be transcribed and the voice recognition result data of the voice data. A review that displays a review interface for reviewing the aggregated data, including an area representing the voice waveform of the voice data and an area representing a character string as a result of being transcribed by the worker for the voice data. Interface display function for
The review interface display function includes a connection point indicating a boundary position of the divided data in the voice waveform of the voice data and the data of the voice recognition result among the work results of the divided data included in the aggregated data. A program characterized by identifying and displaying a connection word corresponding to the voice of a connection point.

A server in an information processing system including a server that divides voice data to be transcribed and assigns it to a plurality of workers, and a terminal device for workers used by the worker who transcribes the voice data. To the computer
A transcription target data acquisition function that acquires the voice data to be transcribed and the voice recognition result data of the voice data, and
A data division function of dividing the voice data acquired by the transcription target data acquisition function and the voice recognition result data to generate divided data is realized.
The data division function is a connection word, a connection clause, or a connection statement corresponding to the voice of the connection point in the voice recognition result data with respect to the connection point indicating the boundary position of the division data in the voice waveform of the voice data. A program characterized by dividing the voice data and the voice recognition result data in units of at least one of them.