JP2020071470A

JP2020071470A - Information processing system and transcription method

Info

Publication number: JP2020071470A
Application number: JP2019072482A
Authority: JP
Inventors: 永瀬　哲也; Tetsuya Nagase; 哲也永瀬
Original assignee: Jx Wind Co Ltd
Current assignee: Jx Wind Co Ltd
Priority date: 2019-04-05
Filing date: 2019-04-05
Publication date: 2020-05-07
Anticipated expiration: 2038-10-31
Also published as: JP7106124B2

Abstract

To enhance confidentiality of contents of transcription target speech.SOLUTION: A management device 12 divides target speech data, in which a transcription target speech is recorded, into multiple sectional speech data pieces over multiple sections. The management device 12 provides an operator device 16 with a set of at least one of the multiple sectional speech data pieces and dummy speech data. The management device 12 receives text data transcribed from the at least one of the multiple sectional speech data pieces, and/or text data transcribed from the dummy speech data. The management device 12 generates text data transcribed from the target speech by using the text data transcribed from the at least one of the multiple sectional speech data pieces among the received text data.SELECTED DRAWING: Figure 1

Description

この発明は、データ処理技術に関し、特に情報処理システムおよび文字起こし方法に関する。 The present invention relates to data processing technology, and more particularly to an information processing system and a transcription method.

音声として記録された会話から文字を起こす文字起こしシステムが提案されている（例えば特許文献１参照）。特許文献１の文字起こしシステムでは、サーバは、会話が録音された音声データを複数の音声区間に係る音声データに分割して、各音声区間の音声データを複数の情報端末に送信する。各情報端末は、音声データから文字起こしした文字列をサーバに出力し、サーバは、個々の文字列を結合して元の音声データの会話全体を文章化した文章データを構築する。 A transcription system for raising a character from a conversation recorded as voice has been proposed (for example, see Patent Document 1). In the transcription system of Patent Document 1, the server divides the voice data in which the conversation is recorded into voice data relating to a plurality of voice sections, and transmits the voice data of each voice section to a plurality of information terminals. Each information terminal outputs a character string transcribed from voice data to the server, and the server combines the individual character strings to construct sentence data in which the entire conversation of the original voice data is converted into a sentence.

特開２００８−１０７６２４号公報JP, 2008-107624, A

文字起こしの対象となる音声は、機密事項が含まれる場合等、音声の内容が文字起こしを行う作業者にそのまま伝わることは望ましくないことがある。本発明者は、文字起こし対象の音声の内容の秘匿性を高めるための改善の余地があると考えた。 It may not be desirable for the voice to be transcribed to be transmitted as it is to the operator who performs the transcription, such as when confidential information is included. The present inventor considered that there is room for improvement in order to increase the confidentiality of the content of the voice to be transcribed.

本発明は本発明者の上記課題認識に基づきなされたものであり、１つの目的は、文字起こし対象の音声の内容の秘匿性を高めることにある。 The present invention has been made on the basis of the present inventors' recognition of the above problems, and one object thereof is to improve the confidentiality of the content of the voice of the transcription target.

上記課題を解決するために、本発明のある態様の情報処理システムは、文字起こしの対象の音声が録音された対象音声データを記憶する第１記憶部と、ダミーの音声が録音されたダミー音声データを記憶する第２記憶部と、対象音声データを複数の区間に係る複数の区間音声データに分割する分割部と、複数の区間音声データの少なくとも１つとダミー音声データの組を外部装置へ提供する提供部と、複数の区間音声データの少なくとも１つをもとに文字起こししたテキストデータと、ダミー音声データをもとに文字起こししたテキストデータとを受け付ける受付部と、受付部が受け付けたテキストデータのうち、複数の区間音声データの少なくとも１つをもとに文字起こししたテキストデータを用いて、対象の音声を文字起こししたテキストデータを生成する生成部と、を備える。 In order to solve the above-mentioned problems, an information processing system according to an aspect of the present invention includes a first storage unit that stores target voice data in which a voice of a transcription target is recorded, and a dummy voice in which a dummy voice is recorded. A second storage unit that stores data, a dividing unit that divides target voice data into a plurality of section voice data related to a plurality of sections, and a set of at least one of the plurality of section voice data and dummy voice data to an external device And a reception unit that receives the text data transcribed based on at least one of the plurality of section voice data and the text data transcribed based on the dummy voice data, and the text received by the reception unit. Text data obtained by transcribing the target voice using text data obtained by transcribing at least one of a plurality of section voice data among the data. Comprising a generation unit for generating data.

本発明の別の態様は、文字起こし方法である。この方法は、文字起こしの対象の音声が録音された対象音声データと、ダミーの音声が録音されたダミー音声データとを記憶する情報処理システムが、対象音声データを複数の区間に係る複数の区間音声データに分割するステップと、複数の区間音声データの少なくとも１つとダミー音声データの組を外部装置へ提供するステップと、複数の区間音声データの少なくとも１つをもとに文字起こししたテキストデータと、ダミー音声データをもとに文字起こししたテキストデータの両方を受け付けるステップと、受け付けたテキストデータのうち、複数の区間音声データの少なくとも１つをもとに文字起こししたテキストデータを用いて、対象の音声を文字起こししたテキストデータを生成するステップと、を実行する。 Another aspect of the present invention is a transcription method. According to this method, an information processing system that stores target voice data in which a voice to be transcribed is recorded and dummy voice data in which a dummy voice is recorded are stored in a plurality of sections in which the target voice data is divided into a plurality of sections. Dividing into voice data; providing a set of at least one of the plurality of section voice data and dummy voice data to an external device; and text data transcribed based on at least one of the plurality of section voice data. A step of receiving both the text data transcribed based on the dummy voice data, and the text data transcribed based on at least one of the plurality of section voice data among the received text data, And a step of generating text data in which the voice is transcribed.

なお、以上の構成要素の任意の組合せ、本発明の表現を、装置、コンピュータプログラム、コンピュータプログラムを格納した記録媒体などの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above constituent elements and the expression of the present invention converted between an apparatus, a computer program, a recording medium storing the computer program, and the like are also effective as an aspect of the present invention.

本発明によれば、文字起こし対象の音声の内容の秘匿性を高めることができる。 According to the present invention, it is possible to improve the confidentiality of the content of the voice of the transcription target.

実施例の文字起こしシステムの構成を示す図である。It is a figure which shows the structure of the transcription system of an Example. 図１の管理装置の機能ブロックを示すブロック図である。It is a block diagram which shows the functional block of the management apparatus of FIG. 音声データの分割例を示す図である。It is a figure which shows the example of division | segmentation of audio | voice data. 音声データの例を示す図である。It is a figure which shows the example of audio | voice data. 区間音声データの割当例を示す図である。It is a figure which shows the example of allocation of section audio | voice data. 作業者による作業結果の例を示す図である。It is a figure which shows the example of the work result by a worker. 音声データの分割例を示す図である。It is a figure which shows the example of division | segmentation of audio | voice data.

実施例の文字起こしシステムは、文字起こしの対象となる音声（ユーザに関する音声であり、秘密情報が含まれうる音声）の少なくとも一部と、ダミーの音声の組を、文字起こしを行う作業者に提供して、それらの音声の両方を作業者に文字起こしさせる。これにより、文字起こしの対象となる音声全体の内容が漏洩するリスクを低減し、文字起こしの対象となる音声の内容の秘匿性を高めることができる。 The transcription system of the embodiment provides a worker who performs transcription with a set of at least a part of a voice to be transcribed (a voice relating to a user and a voice that may include secret information) and a dummy voice. Provide both of those sounds to the operator to transcribe. As a result, it is possible to reduce the risk of leakage of the entire content of the voice to be transcribed and to increase the confidentiality of the content of the voice to be transcribed.

図１は、実施例の文字起こしシステム１０の構成を示す。文字起こしシステム１０は、文字起こしを支援する情報処理システムであり、管理装置１２と、複数のユーザ端末１４と、複数の作業者装置１６を備える。文字起こしシステム１０の各装置は、ＬＡＮ・ＷＡＮ・インターネット等を含む通信網１８を介して接続される。文字起こしは、音声の内容をテキストに変換することであり、テープ起こしとも言える。 FIG. 1 shows the configuration of a transcription system 10 according to an embodiment. The transcription system 10 is an information processing system that supports transcription, and includes a management device 12, a plurality of user terminals 14, and a plurality of worker devices 16. Each device of the transcription system 10 is connected via a communication network 18 including a LAN, WAN, the Internet, and the like. Transcription is the conversion of audio content into text, which can also be called tape transcription.

管理装置１２は、文字起こしのウェブサービスを複数のユーザ端末１４に提供する情報処理装置である。管理装置１２の詳細な機能は後述する。 The management device 12 is an information processing device that provides a web service of transcription to a plurality of user terminals 14. Detailed functions of the management device 12 will be described later.

複数のユーザ端末１４は、文字起こしサービスを利用するユーザにより操作される情報処理装置である。複数のユーザ端末１４は、Ａ社に所属するユーザａにより操作されるユーザ端末１４ａと、Ｂ社に所属するユーザｂにより操作されるユーザ端末１４ｂと、Ｃ社に所属するユーザｃにより操作されるユーザ端末１４ｃを含む。ユーザ端末１４は、ＰＣ、タブレット端末、スマートフォンであってもよい。 The plurality of user terminals 14 are information processing devices operated by a user who uses the transcription service. The plurality of user terminals 14 are operated by a user terminal 14a operated by a user a belonging to Company A, a user terminal 14b operated by a user b belonging to Company B, and a user c belonging to Company C. The user terminal 14c is included. The user terminal 14 may be a PC, a tablet terminal, or a smartphone.

複数の作業者装置１６は、文字起こしを行う主体の情報処理装置である。実施例では、人間が音声を聞いてその音声をテキスト化する。複数の作業者装置１６は、作業者ｘにより操作される作業者装置１６ｘと、作業者ｙにより操作される作業者装置１６ｙと、作業者ｚにより操作される作業者装置１６ｚを含む。作業者装置１６は、ＰＣ、タブレット端末、スマートフォンであってもよい。 The plurality of worker devices 16 are information processing devices that mainly perform transcription. In the embodiment, a human listens to a voice and converts the voice into text. The plurality of worker devices 16 include a worker device 16x operated by a worker x, a worker device 16y operated by a worker y, and a worker device 16z operated by a worker z. The worker device 16 may be a PC, a tablet terminal, or a smartphone.

図２は、図１の管理装置１２の機能ブロックを示すブロック図である。本明細書のブロック図で示す各ブロックは、ハードウェア的には、コンピュータのプロセッサ、ＣＰＵ、メモリをはじめとする素子や電子回路、機械装置で実現でき、ソフトウェア的にはコンピュータプログラム等によって実現されるが、ここでは、それらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックはハードウェア、ソフトウェアの組合せによっていろいろなかたちで実現できることは、当業者には理解されるところである。 FIG. 2 is a block diagram showing functional blocks of the management device 12 of FIG. Each block shown in the block diagram of this specification can be realized by an element such as a processor, a CPU, and a memory of a computer, an electronic circuit, and a mechanical device in terms of hardware, and by a computer program or the like in terms of software. However, here, the functional blocks realized by those collaborations are drawn. Therefore, it will be understood by those skilled in the art that these functional blocks can be realized in various ways by a combination of hardware and software.

管理装置１２は、制御部２０、記憶部２２、通信部２４を備える。制御部２０は、文字起こしサービスを提供するための各種データ処理を実行する。記憶部２２は、制御部２０により参照または更新されるデータを記憶する。通信部２４は、所定の通信プロトコルにしたがって外部装置と通信する。制御部２０は、通信部２４を介して、ユーザ端末１４および作業者装置１６とデータを送受信する。 The management device 12 includes a control unit 20, a storage unit 22, and a communication unit 24. The control unit 20 executes various data processing for providing a transcription service. The storage unit 22 stores data referred to or updated by the control unit 20. The communication unit 24 communicates with an external device according to a predetermined communication protocol. The control unit 20 transmits / receives data to / from the user terminal 14 and the worker device 16 via the communication unit 24.

記憶部２２は、対象音声記憶部３０、ダミー音声記憶部３２、割当規則記憶部３４、配信データ記憶部３６、作業結果記憶部３８、文章記憶部４０、正解記憶部４２、評価記憶部４４を含む。なお、記憶部２２に記憶されるデータの少なくとも一部は、管理装置１２とは別の記憶装置（不図示）に記憶されてもよく、管理装置１２は、外部の記憶装置に記憶されたデータを参照・更新してもよいことはもちろんである。 The storage unit 22 includes a target voice storage unit 30, a dummy voice storage unit 32, an allocation rule storage unit 34, a distribution data storage unit 36, a work result storage unit 38, a text storage unit 40, a correct answer storage unit 42, and an evaluation storage unit 44. Including. At least a part of the data stored in the storage unit 22 may be stored in a storage device (not shown) different from the management device 12, and the management device 12 stores data stored in an external storage device. Of course, you may refer to and update.

対象音声記憶部３０は、ユーザ端末１４から受け付けた音声データであって、文字起こしの対象となる音声（以下「対象音声」とも呼ぶ。）が録音された対象音声データを記憶する。ダミー音声記憶部３２は、ユーザ端末１４から受け付けた音声データではなく、ダミーの音声（以下「ダミー音声」とも呼ぶ。）が録音されたダミー音声データを記憶する。ダミー音声は、管理装置１２の管理者（例えば文字起こしサービスを提供する企業の担当者）により予め定められた内容の音声である。 The target voice storage unit 30 stores target voice data which is voice data received from the user terminal 14 and in which a voice to be transcribed (hereinafter also referred to as “target voice”) is recorded. The dummy voice storage unit 32 stores not the voice data received from the user terminal 14, but the dummy voice data in which a dummy voice (hereinafter also referred to as “dummy voice”) is recorded. The dummy voice is a voice having contents predetermined by the administrator of the management device 12 (for example, the person in charge of the company that provides the transcription service).

正解記憶部４２は、ダミー音声の内容を示すテキストデータを記憶する。なお、実施例では、ダミー音声データは、複数の区間に係る区間毎の音声データ（以下「区間音声データ」とも呼ぶ。）に予め分割され、ダミー音声記憶部３２は、ダミー音声データに基づく複数の区間音声データを記憶することとする。また、正解記憶部４２は、ダミー音声データに基づく複数の区間音声データそれぞれの内容を示すテキストデータ（以下「正解データ」とも呼ぶ。）を記憶することとする。 The correct answer storage unit 42 stores text data indicating the content of the dummy voice. In the embodiment, the dummy voice data is preliminarily divided into voice data for each section (hereinafter also referred to as “section voice data”) related to a plurality of sections, and the dummy voice storage unit 32 stores a plurality of pieces based on the dummy voice data. It is assumed that the section voice data of is stored. Further, the correct answer storage unit 42 is supposed to store text data (hereinafter also referred to as “correct answer data”) indicating the contents of each of the plurality of section voice data based on the dummy voice data.

割当規則記憶部３４は、対象音声データが分割された区間音声データと、ダミー音声データが分割された区間音声データを作業者に割り当てるための規則（以下「割当規則」とも呼ぶ。）を記憶する。割当規則は、割当部５６の構成に関連して後述する。 The assignment rule storage unit 34 stores the section voice data in which the target voice data is divided and the rule (hereinafter, also referred to as “assignment rule”) for assigning the section voice data in which the dummy voice data is divided to the worker. .. The allocation rule will be described later in relation to the configuration of the allocation unit 56.

配信データ記憶部３６は、複数の作業者のそれぞれに配信するデータであり、１つ以上の区間音声データを含む配信データを記憶する。例えば、配信データ記憶部３６は、作業者ｘ（作業者装置１６ｘ）への配信データ、作業者ｙ（作業者装置１６ｙ）への配信データおよび作業者ｚ（作業者装置１６ｚ）への配信データを記憶する。 The distribution data storage unit 36 is data to be distributed to each of a plurality of workers, and stores distribution data including one or more section voice data. For example, the distribution data storage unit 36 stores the distribution data for the worker x (worker device 16x), the distribution data for the worker y (worker device 16y), and the distribution data for the worker z (worker device 16z). Memorize

作業結果記憶部３８は、作業者ｘ（作業者装置１６ｘ）による文字起こしの結果であるテキストデータ、作業者ｙ（作業者装置１６ｙ）による文字起こしの結果であるテキストデータおよび作業者ｚ（作業者装置１６ｚ）による文字起こしの結果であるテキストデータを記憶する。 The work result storage unit 38 stores text data as a result of transcription by the worker x (worker device 16x), text data as a result of transcription by the worker y (worker device 16y), and worker z (work). The text data that is the result of the transcription by the worker device 16z) is stored.

文章記憶部４０は、後述の文章生成部６２により生成された、対象音声全体の内容を示すテキストデータ（以下「文章データ」とも呼ぶ。）を記憶する。評価記憶部４４は、後述の評価部６６により生成された、複数の作業者に関する評価結果を記憶する。 The text storage unit 40 stores text data (hereinafter also referred to as “sentence data”) indicating the content of the entire target voice, which is generated by the text generation unit 62 described below. The evaluation storage unit 44 stores the evaluation results regarding a plurality of workers, which are generated by the evaluation unit 66 described below.

制御部２０は、要求受付部５０、変換部５２、分割部５４、割当部５６、配信部５８、作業結果受付部６０、文章生成部６２、文章提供部６４、評価部６６を含む。これら複数の機能ブロックの機能を実装したコンピュータプログラムが記憶部２２に格納されてもよい。管理装置１２のプロセッサは、そのコンピュータプログラムをメインメモリに読み出して実行することにより、制御部２０の複数の機能ブロックの機能を発揮してもよい。 The control unit 20 includes a request reception unit 50, a conversion unit 52, a division unit 54, an allocation unit 56, a distribution unit 58, a work result reception unit 60, a text generation unit 62, a text provision unit 64, and an evaluation unit 66. A computer program that implements the functions of the plurality of functional blocks may be stored in the storage unit 22. The processor of the management device 12 may exhibit the functions of the plurality of functional blocks of the control unit 20 by reading the computer program into the main memory and executing the computer program.

要求受付部５０は、音声の文字起こしを要求する複数の要求データを複数のユーザ端末１４から受け付ける。要求受付部５０は、受け付けた要求データを要求元のユーザまたはユーザ端末１４に対応付けて対象音声記憶部３０に保存する。ユーザ端末１４ａから受け付ける要求データは、Ａ社に関する音声（社長の発言や会議の音声等）が録音された対象音声データを含む。また、ユーザ端末１４ｂから受け付ける要求データは、Ｂ社に関する音声が録音された対象音声データを含む。また、ユーザ端末１４ｃから受け付ける要求データは、Ｃ社に関する音声が録音された対象音声データを含む。 The request receiving unit 50 receives a plurality of request data for requesting transcription of voice from a plurality of user terminals 14. The request receiving unit 50 stores the received request data in the target voice storage unit 30 in association with the request source user or the user terminal 14. The request data received from the user terminal 14a includes target voice data in which a voice (company's remark, a conference voice, etc.) regarding the company A is recorded. Further, the request data received from the user terminal 14b includes target voice data in which the voice regarding the company B is recorded. Further, the request data received from the user terminal 14c includes target voice data in which the voice regarding the company C is recorded.

変換部５２は、要求受付部５０により受け付けられた複数の対象音声データのうち少なくとも１つの対象音声データを公知の音声変換機能により変換することで、複数の対象音声データの声質（音高、音圧、音色等）を均質化させる。これにより、複数の区間音声データを聞いた作業者が、それら区間音声データの元の対象音声が同一か否かを見分けることを困難にし、対象音声の内容の秘匿性を高めることができる。 The conversion unit 52 converts at least one target voice data out of the plurality of target voice data received by the request receiving unit 50 by a known voice conversion function, so that the voice quality (pitch, pitch) of the plurality of target voice data is converted. Pressure, tone, etc.). This makes it difficult for an operator who has heard a plurality of section voice data to distinguish whether or not the original target voices of those section voice data are the same, and the confidentiality of the content of the target voice can be improved.

実施例では、変換部５２は、要求受付部５０により受け付けられた複数の対象音声データの声質を、ダミー音声データの声質と同一または類似するものとなるよう変換する。これにより、複数の区間音声データを聞いた作業者が、それら区間音声データの元の対象音声が同一か否かを見分けることを困難にでき、また、ダミー音声か否かを見分けることを困難にでき、対象音声の内容の秘匿性を一層高めることができる。 In the embodiment, the conversion unit 52 converts the voice qualities of the plurality of target voice data received by the request reception unit 50 so as to be the same or similar to the voice qualities of the dummy voice data. This makes it difficult for an operator who has heard a plurality of section audio data to distinguish whether or not the original target sounds of the section audio data are the same, and also it is difficult to distinguish whether or not they are dummy sounds. Therefore, the confidentiality of the content of the target voice can be further enhanced.

分割部５４は、対象音声記憶部３０に記憶された対象音声データを複数の区間に係る複数の区間音声データに分割する。図３は、音声データの分割例を示す。分割部５４は、Ａ社の対象音声データＡａを、区間音声データＡａ−１、区間音声データＡａ−２、区間音声データＡａ−３の３つに分割する。また、分割部５４は、Ａ社の対象音声データＡｂを、区間音声データＡｂ−１、区間音声データＡｂ−２、区間音声データＡｂ−３の３つに分割する。同様に、分割部５４は、Ｂ社の対象音声データＢａおよび対象音声データＢｂを分割する。 The division unit 54 divides the target voice data stored in the target voice storage unit 30 into a plurality of section voice data related to a plurality of sections. FIG. 3 shows an example of division of audio data. The dividing unit 54 divides the target voice data Aa of the company A into three pieces of section voice data Aa-1, section voice data Aa-2, and section voice data Aa-3. Further, the dividing unit 54 divides the target voice data Ab of the company A into three, that is, the section voice data Ab-1, the section voice data Ab-2, and the section voice data Ab-3. Similarly, the dividing unit 54 divides the target voice data Ba and the target voice data Bb of the company B.

既述したように、実施例では、ダミー音声データは、複数の区間音声データに予め分割されている。例えば図３では、ダミー音声データＣａは、区間音声データＣａ−１と区間音声データＣａ−２の２つに分割されている。変形例として、分割部５４は、対象音声データの分割時に、ダミー音声データを複数の区間音声データに分割してもよい。 As described above, in the embodiment, the dummy voice data is divided into a plurality of section voice data in advance. For example, in FIG. 3, the dummy audio data Ca is divided into two, that is, interval audio data Ca-1 and interval audio data Ca-2. As a modification, the dividing unit 54 may divide the dummy voice data into a plurality of section voice data when dividing the target voice data.

分割部５４は、複数の区間音声データのそれぞれについて、分割前の対象音声データまたはダミー音声データにおける位置情報（例えば先頭からの順番や時間位置等）を記憶部２２に保存する。例えば、分割部５４は、区間音声データＡａ−１について、対象音声データＡａの１番目の区間であることを示す情報を保存し、また、区間音声データＡａ−２について、対象音声データＡａの２番目の区間であることを示す情報を保存してもよい。 The division unit 54 stores, for each of the plurality of section voice data, position information (for example, the order from the beginning or the time position) in the target voice data or the dummy voice data before the division in the storage unit 22. For example, the division unit 54 stores information indicating that the section audio data Aa-1 is the first section of the target audio data Aa, and the section audio data Aa-2 includes 2 of the target audio data Aa. Information indicating the second section may be stored.

図４は、音声データの例を示す。同図は音声の波形を示し、具体的には、同図の横軸は音声開始からの経過時間を示し、縦軸は音量を示している。分割部５４は、音声を区切る区間がとりうる予め定められた最小時間と最大時間（言い換えれば最大長）を保持する。実施例における区間の最小時間は１０秒（図４の終了範囲始点７０）であり、最大時間は２０秒（図４の終了範囲終点７２）である。区間が短いほど音声内容の秘匿性は高くなるが、文字起こしの正確度は低下する。区間の最小時間と最大時間は、音声内容の秘匿性と文字起こしの正確度とを比較衡量して、適切な値に決定されてよい。 FIG. 4 shows an example of audio data. The figure shows the waveform of the voice, and specifically, the horizontal axis of the figure shows the elapsed time from the start of the voice, and the vertical axis shows the volume. The dividing unit 54 holds a predetermined minimum time and maximum time (in other words, maximum length) that the section that divides the voice can take. In the embodiment, the minimum time of the section is 10 seconds (end range start point 70 in FIG. 4) and the maximum time is 20 seconds (end range end point 72 in FIG. 4). The shorter the section, the higher the confidentiality of the voice content, but the lower the accuracy of transcription. The minimum time and the maximum time of the section may be determined as appropriate values by weighing the confidentiality of the voice content and the accuracy of the transcription.

分割部５４は、対象音声データにおける１つの区間の終了位置を決定する場合、予め定められた最小時間以上かつ最大時間以下の範囲内で、かつ、音量が所定の閾値未満の時点を区間の終了位置として決定する。例えば、図４の例では、音声開始から１５．５秒の時点を区間の終了位置（分割点７４）に決定する。次の区間については、分割部５４は、図４の分割点７４を開始位置とし、分割点７４から１０秒〜２０秒の範囲内で、かつ、音量が所定の閾値未満の時点を次の区間の終了位置として決定する。なお、音量の閾値は、無音と見なされる音量の値でもよく、また、静かな室内の場合に想定される音量の値でもよい。例えば、音量の閾値は、０．００２パスカル（４０デシベル）であってもよい。 When determining the end position of one section in the target audio data, the dividing unit 54 ends the section at a time point within a predetermined minimum time or more and maximum time or less and the sound volume is less than a predetermined threshold value. Determine as position. For example, in the example of FIG. 4, the time point of 15.5 seconds from the voice start is determined as the end position (division point 74) of the section. For the next section, the dividing unit 54 sets the division point 74 of FIG. 4 as the start position, and within the range of 10 seconds to 20 seconds from the division point 74 and when the volume is less than the predetermined threshold value, the next section. Is determined as the end position of. The volume threshold may be a volume value that is considered to be silent, or may be a volume value that is assumed in a quiet room. For example, the volume threshold may be 0.002 pascals (40 decibels).

対象音声において、単語の切れ目や意味の切れ目は、音量が小さくなりやすい。実施例では音量が閾値未満の位置を区間の終了位置とすることで、単語の切れ目や意味の切れ目を区間の終了位置とすることができ、文字起こしの正確性を高めることができる。 In the target voice, the volume of word breaks or meaning breaks tends to be low. In the embodiment, by setting the position where the volume is less than the threshold as the end position of the section, it is possible to set the word break or the break of the meaning as the end position of the section, and the accuracy of the transcription can be improved.

図２に戻り、割当部５６は、割当規則記憶部３４に記憶された割当規則にしたがって、対象音声データに基づく区間音声データと、ダミー音声データに基づく区間音声データの組を、複数の作業者のそれぞれに割り当てる。割当部５６は、各作業者に割り当てた対象音声データに基づく区間音声データと、ダミー音声データに基づく区間音声データの組を配信データ記憶部３６に格納する。 Returning to FIG. 2, the assigning unit 56 sets a set of the section voice data based on the target voice data and the section voice data based on the dummy voice data according to the assignment rule stored in the assignment rule storage unit 34. Assign to each. The assigning unit 56 stores, in the distribution data storage unit 36, a set of section voice data based on the target voice data assigned to each worker and section voice data based on the dummy voice data.

実施例の割当規則は、１人の作業者に対して割り当てる複数の区間音声データが、互いに時間的・空間的に離れたものになるよう定められる。具体的には、（１）割当規則は、作業者装置１６が複数存在する場合に、１つの作業者装置１６に対して、１つの対象音声データを起原とする複数の区間音声データのうち一部の区間音声データを割り当てるよう定める。すなわち、割当規則は、１つの作業者装置１６に対して、１つの対象音声データに基づく全ての区間音声データを割り当てることを禁止する。これにより、対象音声の内容の秘匿性を高めることができる。 The allocation rule of the embodiment is set so that a plurality of section voice data allocated to one worker are temporally and spatially separated from each other. Specifically, (1) the allocation rule is that, when a plurality of worker devices 16 exist, one of the plurality of section voice data originating from one target voice data is set for one worker device 16. It is decided to allocate some section audio data. That is, the allocation rule prohibits allocation of all section audio data based on one target audio data to one worker device 16. Thereby, the confidentiality of the content of the target voice can be improved.

また、（２）割当規則は、１つの作業者装置１６に対して、１つの対象音声データにおいて時間的に連続する複数の区間音声データを割り当てることを禁止する。言い換えれば、割当規則は、１つの対象音声データにおいて時間的に連続する複数の区間音声データを異なる作業者に割り当てるよう規定する。例えば、図３の区間音声データＡａ−１と区間音声データＡａ−２を同じ作業者に割り当てることを禁止し、異なる作業者に割り当てるよう規定する。これにより、対象音声の内容の秘匿性をさらに高めることができる。 Further, (2) allocation rule prohibits allocation of a plurality of time-sequential section voice data in one target voice data to one worker device 16. In other words, the allocation rule defines that a plurality of time-sequential section voice data in one target voice data is assigned to different workers. For example, the section voice data Aa-1 and the section voice data Aa-2 in FIG. 3 are prohibited from being assigned to the same worker, and are defined to be assigned to different workers. Thereby, the confidentiality of the content of the target voice can be further enhanced.

また、（３）割当規則は、１つの作業者装置１６に対して複数の区間音声データを提供する場合に、１つの対象音声データを起原とする複数の区間音声データを提供することより、異なる対象音声データを起原とする複数の区間音声データを提供することを優先するよう定める。異なる対象音声データを起原とする複数の区間音声データは、内容が関連しない可能性が高いため、各対象音声の内容の秘匿性をさらに一層高めることができる。 Further, (3) the allocation rule is that when a plurality of section voice data is provided to one worker device 16, by providing a plurality of section voice data originating from one target voice data, It is determined that the provision of a plurality of section voice data originating from different target voice data has priority. The plurality of section voice data originating from different target voice data are likely to be unrelated in content, and thus the confidentiality of the content of each target voice can be further enhanced.

また、（４）割当規則は、１つの作業者装置１６に対して異なる対象音声データを起原とする複数の区間音声データを提供する場合、同じ組織に関する異なる対象音声データを起原とする複数の区間音声データを提供することより、異なる組織に関する異なる対象音声データを起原とする複数の区間音声データを提供することを優先するよう定める。異なる組織に関する異なる対象音声データを起原とする複数の区間音声データは、内容が関連しない可能性が一層高いため、各対象音声の内容の秘匿性をさらに一層高めることができる。 Further, (4) the allocation rule is that when a plurality of section voice data originating from different target voice data is provided to one worker device 16, a plurality of section voice data originating from different target voice data regarding the same organization are provided. The provision of a plurality of section voice data originating from different target voice data related to different organizations is prioritized over the provision of the section voice data of. The plurality of section voice data originating from different target voice data relating to different organizations are more likely to be unrelated in content, and thus the confidentiality of the content of each target voice can be further enhanced.

図５は、区間音声データの割当例を示す。同図に示す区間音声データは、図３に示した区間音声データに対応する。同図の例では、割当部５６は、作業者ｘに対して、対象音声データＡａを起原とする区間音声データＡａ−１と、ダミー音声データＣａを起原とする区間音声データＣａ−１と、対象音声データＢｂを起原とする区間音声データＢｂ−２を割り当てている。また、割当部５６は、作業者ｙと作業者ｚにもそれぞれ、異なる組織の異なる対象音声データを起原とする複数の区間音声データを割り当てている。 FIG. 5 shows an example of allocation of section voice data. The section voice data shown in the figure corresponds to the section voice data shown in FIG. In the example of the figure, the allocation unit 56 gives the worker x section voice data Aa-1 originating from the target voice data Aa and section voice data Ca-1 originating from the dummy voice data Ca. And the section voice data Bb-2 originating from the target voice data Bb are assigned. Further, the assigning unit 56 assigns a plurality of section voice data originating from different target voice data of different organizations to the worker y and the worker z, respectively.

図２に戻り、配信部５８は、割当部５６による割当結果にしたがって、各作業者へ区間音声データを提供する。具体的には、配信部５８は、配信データ記憶部３６に記憶された対象音声データに基づく区間音声データと、ダミー音声データに基づく区間音声データの組を各作業者の作業者装置１６へ提供する。 Returning to FIG. 2, the distribution unit 58 provides the section voice data to each worker according to the allocation result by the allocation unit 56. Specifically, the distribution unit 58 provides the worker device 16 of each worker with a set of section sound data based on the target sound data stored in the distribution data storage unit 36 and section sound data based on the dummy sound data. To do.

実施例では、配信部５８は、文字起こし作業を行うためのウェブページ（以下「作業ページ」とも呼ぶ。）を複数の作業者装置１６に送信し、表示させる。配信部５８は、作業者ｘ用の作業ページを作業者装置１６ｘに提供し、作業者ｙ用の作業ページを作業者装置１６ｙに提供し、作業者ｚ用の作業ページを作業者装置１６ｚに提供する。なお、配信部５８は、各作業者用の作業ページのＵＲＬを電子メール等により各作業者の作業者装置１６へ通知してもよい。 In the embodiment, the delivery unit 58 transmits a web page (hereinafter also referred to as “work page”) for performing the transcription work to the plurality of worker devices 16 and displays the web page. The distribution unit 58 provides a work page for the worker x to the worker device 16x, a work page for the worker y to the worker device 16y, and a work page for the worker z to the worker device 16z. provide. The delivery unit 58 may notify the URL of the work page for each worker to the worker device 16 of each worker by e-mail or the like.

配信部５８は、作業者ｘ用の作業ページのデータに、割当部５６により作業者ｘに割り当てられた区間音声データ（図５の例では区間音声データＡａ−１、区間音声データＣａ−１、区間音声データＢｂ−２）を含める。同様に、配信部５８は、作業者ｙ（作業者ｚ）用の作業ページのデータに、割当部５６により作業者ｙ（作業者ｚ）に割り当てられた区間音声データを含める。なお、配信部５８は、各作業者用の作業ページに、各区間音声データを再生するためのボタン、各区間音声データの音声を文字起こしした結果のテキストを入力するエリア、送信ボタンを配置する。 The distribution unit 58 adds the section voice data (section voice data Aa-1, section voice data Ca-1, in the example of FIG. 5) assigned to the worker x by the assigning unit 56 to the data of the work page for the worker x. The section voice data Bb-2) is included. Similarly, the distribution unit 58 includes the section audio data assigned to the worker y (worker z) by the assigning unit 56 in the data of the work page for the worker y (worker z). Note that the distribution unit 58 arranges a button for reproducing each section audio data, an area for inputting a text as a result of transcribing the voice of each section audio data, and a transmission button on the work page for each worker. ..

作業結果受付部６０は、各作業者の作業者装置１６から送信された、各作業者による作業結果を受け付ける。実施例では、作業結果受付部６０は、作業者ｘ用の作業ページに入力された作業者ｘによる文字起こし結果を受け付け、作業者ｙ用の作業ページに入力された作業者ｙによる文字起こし結果を受け付け、作業者ｚ用の作業ページに入力された作業者ｚによる文字起こし結果を受け付ける。作業結果受付部６０は、各作業者の作業結果を作業結果記憶部３８に格納する。 The work result reception unit 60 receives the work result of each worker transmitted from the worker device 16 of each worker. In the embodiment, the work result reception unit 60 receives the transcription result by the worker x input on the work page for the worker x, and the transcription result by the worker y input on the work page for the worker y. Is received, and the transcription result by the worker z input on the work page for the worker z is received. The work result acceptance unit 60 stores the work results of each worker in the work result storage unit 38.

図６は、作業者による作業結果の例を示す。同図は、図５の割当に基づく作業結果を示している。例えば、作業者ｘによる作業結果は、テキストデータＡａ−１、テキストデータＣａ−１、テキストデータＢｂ−２を含む。テキストデータＡａ−１は、対象音声データＡａを起原とする区間音声データＡａ−１の音声を文字起こししたものである。また、テキストデータＣａ−１は、ダミー音声データＣａを起原とする区間音声データＣａ−１の音声を文字起こししたものである。また、テキストデータＢｂ−２は、対象音声データＢｂを起原とする区間音声データＢｂ−２の音声を文字起こししたものである。 FIG. 6 shows an example of the work result by the worker. The figure shows the work result based on the allocation of FIG. For example, the work result by the worker x includes text data Aa-1, text data Ca-1, and text data Bb-2. The text data Aa-1 is a transcription of the voice of the section voice data Aa-1 which originates from the target voice data Aa. Further, the text data Ca-1 is a transcription of the voice of the section voice data Ca-1 originating from the dummy voice data Ca. Further, the text data Bb-2 is a transcription of the voice of the section voice data Bb-2, which originates from the target voice data Bb.

図２に戻り、文章生成部６２は、作業結果受付部６０により受け付けられ、作業結果記憶部３８に記憶されたテキストデータのうち、対象音声データを起原とする区間音声データをもとに文字起こししたテキストデータを用いて、対象音声の全体を文字起こしした文章データを生成する。文章生成部６２は、文章データを生成する際、ダミー音声データを起原とする区間音声データをもとに文字起こししたテキストデータは使用しない。 Returning to FIG. 2, the sentence generator 62 receives characters from the text data received by the work result receiver 60 and stored in the work result storage 38 based on the section voice data originating from the target voice data. Using the transcribed text data, sentence data in which the entire target voice is transcribed is generated. When generating the sentence data, the sentence generator 62 does not use the text data transcribed based on the section voice data that originates from the dummy voice data.

文章生成部６２は、分割部５４により記憶部２２に格納された各区間音声データの位置情報（すなわち対象音声データ内での位置情報）にしたがって、複数の区間音声データに基づくテキストデータを組み合わせることにより文章データを生成する。文章生成部６２は、或る対象音声に対する文章データを、その対象音声の文字起こしを要求したユーザ（またはユーザ端末１４）に対応付けて文章記憶部４０に格納する。 The sentence generation unit 62 combines text data based on a plurality of section voice data according to the position information of each section voice data stored in the storage unit 22 by the division unit 54 (that is, the position information in the target voice data). Generates text data by. The sentence generation unit 62 stores the sentence data for a certain target voice in the sentence storage unit 40 in association with the user (or the user terminal 14) who has requested the transcription of the target voice.

図６の作業者ｘの作業結果に含まれるテキストデータＡａ−１は、対象音声データＡａの１番目の区間に対応する区間音声データＡａ−１のテキストである。また、図６の作業者ｙの作業結果に含まれるテキストデータＡａ−２は、対象音声データＡａの２番目の区間に対応する区間音声データＡａ−２のテキストである。また、図６の作業者ｚの作業結果に含まれるテキストデータＡａ−３は、対象音声データＡａの３番目の区間に対応する区間音声データＡａ−３のテキストである。文章生成部６２は、テキストデータＡａ−１、テキストデータＡａ−２、テキストデータＡａ−３をこの順に合成することにより、対象音声データＡａの全体をテキスト化した文章データＡａを生成する。 The text data Aa-1 included in the work result of the worker x in FIG. 6 is the text of the section voice data Aa-1 corresponding to the first section of the target voice data Aa. The text data Aa-2 included in the work result of the worker y in FIG. 6 is the text of the section voice data Aa-2 corresponding to the second section of the target voice data Aa. Further, the text data Aa-3 included in the work result of the worker z in FIG. 6 is the text of the section voice data Aa-3 corresponding to the third section of the target voice data Aa. The sentence generator 62 synthesizes the text data Aa-1, the text data Aa-2, and the text data Aa-3 in this order to generate the sentence data Aa in which the entire target voice data Aa is converted to text.

図２に戻り、文章提供部６４は、文章記憶部４０に記憶された文章データを、文字起こしの要求元のユーザ（ユーザ端末１４）へ送信する。例えば、文章提供部６４は、図６に示した対象音声データＡａが文字起こしされた文章データＡａを、その文字起こしを要求したユーザａ（ユーザ端末１４ａ）へ送信する。なお、文章提供部６４は、ユーザ端末１４ａから文章データの提供要求を受け付けたことを契機に、文章記憶部４０に記憶された複数の文章データのうち、ユーザａに対応付けられた文章データをユーザ端末１４ａへ送信してもよい。 Returning to FIG. 2, the text providing unit 64 transmits the text data stored in the text storage unit 40 to the user (user terminal 14) that is the source of the transcription request. For example, the text providing unit 64 transmits the text data Aa obtained by transcribing the target voice data Aa shown in FIG. 6 to the user a (user terminal 14a) who has requested the transcription. Note that the text providing unit 64 receives the text data providing request from the user terminal 14a and, when the text data providing request is received, selects the text data associated with the user a from among the plurality of text data stored in the text storage unit 40. You may transmit to the user terminal 14a.

評価部６６は、正解記憶部４２に予め記憶された正解データと、各作業者によるダミー音声の文字起こし結果（作業結果受付部６０により受け付けられ、作業結果記憶部３８に記憶されたテキストデータ）とを比較することにより、各作業者を評価する。例えば、評価部６６は、ダミー音声データＣａを起原とする区間音声Ｃａ−１の正解データと、作業ｘによる区間音声Ｃａ−１の文字起こし結果であるテキストデータＣａ−１とを比較することにより、作業者ｘを評価する。 The evaluation unit 66 includes correct answer data stored in advance in the correct answer storage unit 42, and a dummy transcription result of each worker (text data received by the work result receiving unit 60 and stored in the work result storing unit 38). Evaluate each worker by comparing and. For example, the evaluation unit 66 compares the correct answer data of the section voice Ca-1 originating from the dummy voice data Ca with the text data Ca-1 which is the transcription result of the section voice Ca-1 by the work x. Thus, the worker x is evaluated.

実施例では、評価部６６は、形態素解析により、正解データを構成する形態素と、作業結果のテキストデータを構成する形態素とを抽出し、両者の間で一致する形態素が多いほど、作業者の変換精度が高いと評価し、作業者に高い評価値を付与する。なお、評価部６６は、類義語辞書を参照し、正解データを構成する形態素と、作業結果のテキストデータを構成する形態素とが不一致であっても、類義語であれば一致すると見なしてもよい。このように実施例では、ダミー音声の文字起こし結果に基づいて、作業者を客観的に評価することができる。 In the embodiment, the evaluation unit 66 extracts the morphemes forming the correct answer data and the morphemes forming the text data of the work result by morpheme analysis, and the more morphemes that match between them, the more the conversion of the worker. It is evaluated that the accuracy is high, and a high evaluation value is given to the worker. The evaluation unit 66 may refer to the synonym dictionary and consider that the morphemes forming the correct answer data and the morphemes forming the text data of the work result do not match but may match as long as they are synonyms. As described above, in the embodiment, the worker can be objectively evaluated based on the transcription result of the dummy voice.

評価部６６は、複数の作業者それぞれの評価結果（評価値）を評価記憶部４４に格納する。管理装置１２は、評価記憶部４４に記憶された各作業者の評価結果を外部装置に提供する評価結果出力部（不図示）をさらに備えてもよい。この場合の外部装置は、例えば、作業者との料金交渉や契約を行う担当者の端末でもよい。 The evaluation unit 66 stores the evaluation result (evaluation value) of each of the plurality of workers in the evaluation storage unit 44. The management device 12 may further include an evaluation result output unit (not shown) that provides the external device with the evaluation result of each worker stored in the evaluation storage unit 44. In this case, the external device may be, for example, a terminal of a person in charge who makes price negotiations or contracts with workers.

以上の構成による文字起こしシステム１０の動作を説明する。文字起こしシステム１０の複数のユーザはそれぞれ、対象音声データをユーザ端末１４から管理装置１２へアップロードする。管理装置１２の要求受付部５０は、複数のユーザ端末１４から送信された複数の対象音声データを受け付ける。管理装置１２の変換部５２は、複数の対象音声データの声質を予め定められた基準の声質（実施例ではダミー音声データと同じ声質であり、合成音声の声質でもよい）に変換する。 The operation of the transcription system 10 having the above configuration will be described. Each of the plurality of users of the transcription system 10 uploads the target voice data from the user terminal 14 to the management device 12. The request receiving unit 50 of the management device 12 receives a plurality of target voice data transmitted from a plurality of user terminals 14. The conversion unit 52 of the management device 12 converts the voice qualities of the plurality of target voice data into a predetermined reference voice quality (in the embodiment, the voice quality is the same as the dummy voice data and may be the voice quality of the synthesized voice).

管理装置１２の分割部５４は、複数の対象音声データのそれぞれを複数の区間音声データに分割する。管理装置１２の割当部５６は、対象音声データの区間音声データと、ダミー音声データの区間音声データの組を、各作業者に割り当てる。管理装置１２の配信部５８は、各作業者用のウェブページにて、対象音声データの区間音声データと、ダミー音声データの区間音声データの組を各作業者に提示する。 The dividing unit 54 of the management device 12 divides each of the plurality of target voice data into a plurality of section voice data. The allocation unit 56 of the management device 12 allocates a set of the section voice data of the target voice data and the section voice data of the dummy voice data to each worker. The distribution unit 58 of the management device 12 presents each worker with a set of the section voice data of the target voice data and the section voice data of the dummy voice data on the web page for each worker.

作業者は、自身向けのウェブページにて自身に割り当てられた区間音声データを再生し、その音声内容を示すテキストをウェブページの所定エリアに入力する。作業者がウェブページの送信ボタンを押下すると、作業者装置１６は、作業者が上記所定エリアに入力したテキストデータを管理装置１２へ送信する。 The worker reproduces the section voice data assigned to himself / herself on the web page for himself / herself and inputs the text indicating the voice content into a predetermined area of the web page. When the worker presses the send button on the web page, the worker device 16 transmits the text data input by the worker in the predetermined area to the management device 12.

管理装置１２の作業結果受付部６０は、各作業者の作業者装置１６から送信された対象音声データの区間音声を文字起こししたテキストデータと、ダミー音声データの区間音声を文字起こししたテキストデータを受け付ける。管理装置１２の文章生成部６２は、各作業者の作業者装置１６から送信された対象音声データの区間音声を文字起こししたテキストデータを合成して、対象音声データ全体の音声をテキスト化した文章データを生成する。 The work result acceptance unit 60 of the management device 12 outputs the text data obtained by transcribing the section voice of the target voice data transmitted from the worker apparatus 16 of each worker and the text data obtained by transcribing the section voice of the dummy voice data. Accept. The sentence generation unit 62 of the management device 12 synthesizes text data obtained by transcribing the section voice of the target voice data transmitted from the worker device 16 of each worker to synthesize a sentence in which the voice of the entire target voice data is converted into a text. Generate data.

管理装置１２の文章提供部６４は、各対象音声データに対応する文章データを、各対象音声データをアップロードしたユーザ端末１４へ送信する。ユーザは、自身がアップロードした対象音声データに対応する文章データを得て業務を進める。管理装置１２の評価部６６は、予め内容が定められたダミー音声データに対する文字起こし結果をもとに、各作業者の評価値を決定する。 The text providing unit 64 of the management device 12 transmits the text data corresponding to each target voice data to the user terminal 14 that uploaded each target voice data. The user obtains the sentence data corresponding to the target voice data uploaded by himself / herself and advances the business. The evaluation unit 66 of the management device 12 determines the evaluation value of each worker based on the transcription result for the dummy voice data whose content is predetermined.

以上、本発明を実施例をもとに説明した。この実施例は例示であり、実施例に記載の各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。以下、変形例を示す。 The present invention has been described above based on the embodiments. This embodiment is merely an example, and it is understood by those skilled in the art that various modifications can be made to the combination of each constituent element and each processing process described in the embodiment, and such modifications are also within the scope of the present invention. This is where Hereinafter, modified examples will be shown.

第１変形例を説明する。管理装置１２の分割部５４は、対象音声データを分割して、第１区間に係る第１区間音声データと、第１区間の直後の第２区間に係る第２区間音声データを生成する場合に、第１区間の一部と第２区間の一部を重複させてもよい。言い換えれば、分割部５４は、第１区間と第２区間にのりしろとなる時間領域を設けてもよい。 A first modification will be described. The dividing unit 54 of the management device 12 divides the target voice data to generate the first section voice data related to the first section and the second section voice data related to the second section immediately after the first section. , Part of the first section and part of the second section may overlap. In other words, the division unit 54 may provide a time region in which the first section and the second section have a margin.

図７は、音声データの分割例を示す。ここでは、対象音声データ８０は、区間音声データ８２ａ、区間音声データ８２ｂ、区間音声データ８２ｃ、区間音声データ８２ｄの４つに分割される。分割部５４は、区間音声データ８２ａと区間音声データ８２ｂに、重複期間８４ａと重複期間８４ｂを設ける。また、分割部５４は、区間音声データ８２ｂと区間音声データ８２ｃに、重複期間８４ｃと重複期間８４ｄを設ける。また、分割部５４は、区間音声データ８２ｃと区間音声データ８２ｄに、重複期間８４ｅと重複期間８４ｆを設ける。ここでは、重複期間８４ａ〜重複期間８４ｆのそれぞれは、２．５秒とする。 FIG. 7 shows an example of division of audio data. Here, the target voice data 80 is divided into four pieces of section voice data 82a, section voice data 82b, section voice data 82c, and section voice data 82d. The dividing unit 54 provides the overlapping period 84a and the overlapping period 84b in the section sound data 82a and the section sound data 82b. Further, the division unit 54 provides the overlapping period 84c and the overlapping period 84d in the section sound data 82b and the section sound data 82c. The division unit 54 also provides an overlap period 84e and an overlap period 84f in the section audio data 82c and the section audio data 82d. Here, each of the overlapping period 84a to the overlapping period 84f is 2.5 seconds.

図７の例では、区間音声データ８２ａは、対象音声データ８０の開始点から１５秒の区間の音声である。区間の終了位置は、実施例に記載の方法により決定してよい。この区間では終了前５秒が重複期間（重複期間８４ａ＋重複期間８４ｂ）となる。区間音声データ８２ｂは、対象音声データ８０の開始点から１０秒以降、２５秒までの区間の音声である。この区間では開始後５秒と終了前５秒が重複期間となる。区間音声データ８２ａと区間音声データ８２ｂは、異なる作業者に割り当てられるが、重複期間８４ａと重複期間８４ｂの音声は、異なる作業者の両者が文字起こしを行う。 In the example of FIG. 7, the section audio data 82a is the section 15 seconds from the start point of the target audio data 80. The end position of the section may be determined by the method described in the example. In this section, 5 seconds before the end is the overlap period (the overlap period 84a + the overlap period 84b). The section voice data 82b is a section voice from the start point of the target voice data 80 to 10 seconds to 25 seconds. In this section, the overlap period is 5 seconds after the start and 5 seconds before the end. The section voice data 82a and the section voice data 82b are assigned to different workers, but the voices of the overlapping period 84a and the overlapping period 84b are transcribed by both different workers.

また、区間音声データ８２ｃは、対象音声データ８０の開始点から２０秒以降、３５秒までの区間の音声である。この区間では開始後５秒と終了前５秒が重複期間となる。区間音声データ８２ｄは、対象音声データ８０の開始点から３０秒以降、４５秒までの区間の音声である。この区間では開始後５秒が重複期間となる。 In addition, the section voice data 82c is a section voice from the start point of the target voice data 80 to 20 seconds to 35 seconds. In this section, the overlap period is 5 seconds after the start and 5 seconds before the end. The section voice data 82d is the voice of the section from the start point of the target voice data 80 to 30 seconds to 45 seconds. In this section, the overlap period is 5 seconds after the start.

文章生成部６２は、時間的に連続する第１区間音声データ（例えば区間音声データ８２ａ）のテキストデータと、第２区間音声データ（例えば区間音声データ８２ｂ）のテキストデータについて、重複期間における所定数の文字（所定数の形態素でもよい）が一致するように両者のテキストデータを合成する。 The sentence generation unit 62 sets a predetermined number of text data of the first section voice data (for example, section voice data 82a) and text data of the second section voice data (for example, section voice data 82b) that are temporally consecutive in a predetermined number in the overlapping period. The text data of both are combined so that the characters (may be a predetermined number of morphemes) match.

また、文章生成部６２は、重複期間におけるテキストデータとして、端部から遠い方の区間音声データのテキストを採用する。言い換えれば、各区間音声データの端部に対応するテキストデータ（例えば所定数の文字や形態素）は、合成語の文章データには反映しない。例えば、文章生成部６２は、重複期間８４ａについて、区間音声データ８２ａに基づくテキストデータを採用する一方、重複期間８４ｂについては、区間音声データ８２ｂに基づくテキストデータを採用する。同様に、文章生成部６２は、重複期間８４ｃについて、区間音声データ８２ｂに基づくテキストデータを採用する一方、重複期間８４ｄについては、区間音声データ８２ｃに基づくテキストデータを採用する。 In addition, the text generation unit 62 adopts the text of the section voice data farther from the end as the text data in the overlapping period. In other words, the text data (for example, a predetermined number of characters or morphemes) corresponding to the end of each section voice data is not reflected in the sentence data of the synthesized word. For example, the sentence generation unit 62 adopts the text data based on the section voice data 82a for the overlap period 84a, while adopting the text data based on the section voice data 82b for the overlap period 84b. Similarly, the sentence generator 62 adopts the text data based on the section voice data 82b for the overlap period 84c, while adopting the text data based on the section voice data 82c for the overlap period 84d.

本発明者は、文字起こしに関するＰｏＣ（Proof of Concept）を実施する中で、区間音声データにおける開始時と終了時はテキスト化の正確度が低下することを認識した。そこで、本変形例では、時間的に連続する第１区間音声データと第２区間音声データに重複期間を設け、第１区間音声データのテキストデータと第２区間音声データのテキストデータにおいてテキスト化の正確度が高いと考えられる部分を文章データに反映することにより、文章データの正確度を高めることができる。 The present inventor has recognized that, while implementing PoC (Proof of Concept) relating to transcription, the accuracy of text conversion decreases at the start and end of the section voice data. Therefore, in this modified example, an overlapping period is provided for the first section voice data and the second section voice data that are temporally continuous, and the text data of the first section voice data and the text data of the second section voice data are converted into text. The accuracy of the text data can be increased by reflecting the portion considered to have high accuracy in the text data.

第２変形例を説明する。上記実施例では、人が対象音声およびダミー音声を聞いて文字起こししたが、変形例として、コンピュータ（少なくとも一部の作業者装置１６）が、文字起こし処理を自動で実行してもよい。この場合、配信部５８は、作業者装置１６がネットワーク上に公開する文字起こし依頼用ＡＰＩを呼び出すとともに、１つ以上の区間音声データ（例えば図５の配信データ）を作業者装置１６へ送信してもよい。作業結果受付部６０は、作業者装置１６の文字起こし依頼用ＡＰＩの返値として、文字起こし結果のテキストデータを受け付けてもよい。 A second modification will be described. In the above-described embodiment, the person transcribes the target voice and the dummy voice when hearing the target voice, but as a modification, the computer (at least some of the worker devices 16) may automatically execute the transcription process. In this case, the distribution unit 58 calls the transcription requesting API that the worker device 16 publishes on the network, and transmits one or more section voice data (for example, the distribution data in FIG. 5) to the worker device 16. May be. The work result reception unit 60 may receive the text data of the transcription result as a return value of the transcription request API of the worker device 16.

第２変形例に関連する第３変形例を説明する。文字起こしは、コンピュータによる文字起こしと人による文字起こしの両方が実行されてもよい。具体的には、管理装置１２の配信部５８は、まず、文字起こし処理を自動実行する第１の作業者装置へ１つ以上の区間音声データ（例えば図５の配信データ）を送信し、作業結果受付部６０は、文字起こし処理の結果を第１の作業者装置から取得してもよい。次に、配信部５８は、人手により文字起こしを行う第２の作業者装置へ、第１の作業者装置による文字起こし処理の結果を送信し、作業結果受付部６０は、人手による文字起こし（ここでは点検・編集）の結果を第２の作業者装置から取得してもよい。この構成によると、人は、コンピュータによる文字起こしの結果を点検・編集する役目となるため、人件費を抑えつつ、文字起こしの正確度を高めることができる。 A third modified example related to the second modified example will be described. Transcription may be performed by both computer and human transcription. Specifically, the distribution unit 58 of the management device 12 first transmits one or more section voice data (for example, the distribution data of FIG. 5) to the first worker device that automatically executes the transcription process, and the work is performed. The result receiving unit 60 may obtain the result of the transcription process from the first worker device. Next, the delivery unit 58 transmits the result of the transcription process by the first worker device to the second worker device that performs the transcription manually, and the work result acceptance unit 60 causes the manual transcription ( Here, the result of inspection / editing) may be acquired from the second worker device. With this configuration, a person plays a role of inspecting and editing the result of the transcription by the computer, so that it is possible to improve the accuracy of the transcription while suppressing the labor cost.

第４変形例を説明する。上記実施例では言及していないが、割当部５６は、評価記憶部４４に記憶された評価値が高い作業者ほど優先して、区間音声データの文字起こしを割り当ててもよい。また、配信部５８は、評価記憶部４４に記憶された評価値が高い作業者ほど優先して、区間音声データを配信してもよい。言い換えれば、評価記憶部４４に記憶された評価値が相対的に高い作業者に対して、評価値が相対的に低い作業者より優先して、区間音声データを割り当て、または配信してもよい。これにより、文字起こしの正確度を高めやすくなる。 A fourth modification will be described. Although not mentioned in the above embodiment, the assigning unit 56 may assign the transcription of the section voice data in preference to the worker having a higher evaluation value stored in the evaluation storage unit 44. Further, the distribution unit 58 may distribute the section audio data by giving priority to a worker having a higher evaluation value stored in the evaluation storage unit 44. In other words, the section voice data may be assigned or distributed to a worker having a relatively high evaluation value stored in the evaluation storage unit 44 in preference to a worker having a relatively low evaluation value. .. This makes it easier to increase the accuracy of transcription.

第５変形例を説明する。上記実施例では言及していないが、評価部６６による作業者の評価は、文章生成部６２による文章データ生成前に実行されてもよい。文章生成部６２は、或る作業者の評価値が所定の閾値未満の場合、当該作業者（以下「低評価者」と呼ぶ。）による文字起こし結果（テキストデータ）を用いた文章データの生成を中止してもよい。この場合、割当部５６は、低評価者に対して割り当てた区間音声データを、他の作業者（評価値が上記閾値以上の作業者）に割り当て直してもよい。配信部５８は、低評価者に対して提供した区間音声データを、上記他の作業者へ提供し、文字起こしを依頼してもよい。これにより、正確度が低い文字起こし結果をユーザに提供してしまうことを回避し、また、文字起こしの正確度を一層高めることができる。 A fifth modification will be described. Although not mentioned in the above embodiment, the evaluation of the worker by the evaluation unit 66 may be executed before the generation of the sentence data by the sentence generation unit 62. When the evaluation value of a certain worker is less than a predetermined threshold value, the sentence generator 62 generates sentence data using the transcription result (text data) by the worker (hereinafter referred to as “low evaluator”). May be canceled. In this case, the allocation unit 56 may reallocate the section voice data allocated to the low-rated person to another worker (a worker whose evaluation value is the threshold value or more). The distribution unit 58 may provide the section voice data provided to the low-rated person to the other worker to request the transcription. As a result, it is possible to avoid providing the user with a transcription result with low accuracy and further improve the accuracy of transcription.

第６変形例を説明する。上記実施例では言及していないが、割当部５６は、同一の作業者に対する配信データでは、少なくとも所定期間、異なるダミー音声データ（少なくとも区間音声データとしては異なるもの）を提供することが望ましい。例えば、割当部５６は、作業者毎に、割り当てたダミー音声データ（その区間音声データ）の識別情報を保存し、依頼の都度、ダミー音声データ（区間音声データ）を変化させてもよい。これにより、作業者がダミー音声を判別することを困難にし、対象音声の内容の秘匿性を高めることができる。 A sixth modification will be described. Although not mentioned in the above embodiment, it is desirable that the allocation unit 56 provide different dummy voice data (at least different as segment voice data) for at least a predetermined period in the distribution data for the same worker. For example, the assigning unit 56 may store the identification information of the assigned dummy voice data (the section voice data) for each worker, and change the dummy voice data (the section voice data) each time the request is made. This makes it difficult for the operator to distinguish the dummy voice and enhances the confidentiality of the content of the target voice.

第７変形例を説明する。上記実施例の管理装置１２は、各作業者用のウェブページを作業者装置１６へ提供し、各作業者用のウェブページにて、各作業者に区間音声データを再生させ、また、各作業者に区間音声データの文字起こし結果を入力させた。変形例では、管理装置１２の配信部５８は、対象音声の区間音声データおよびダミー音声の区間音声暗号データを暗号化した暗号データを作業者装置１６へ送信してもよい。管理装置１２の作業結果受付部６０は、各作業者による文字起こし結果のテキストデータを暗号化した暗号データを作業者装置１６から受け付けてもよい。 A seventh modification will be described. The management device 12 of the above-mentioned embodiment provides the web page for each worker to the worker device 16, causes each worker to reproduce the section voice data on the web page for each worker, and performs each work. I asked the person to input the transcription result of the section voice data. In a modification, the distribution unit 58 of the management device 12 may transmit the encrypted data obtained by encrypting the section voice data of the target voice and the section voice encrypted data of the dummy voice to the worker apparatus 16. The work result acceptance unit 60 of the management device 12 may accept from the worker device 16 encrypted data obtained by encrypting the text data of the transcription result of each worker.

第８変形例を説明する。対象音声の区間音声データの中に、ダミー音声の区間音声データをいくつ挿入するか、または、作業者に割り当てる対象音声の区間音声データとダミー音声の区間音声データとの比率は、ユーザが要求する対象音声の秘匿性強度により決定されてもよい。すなわち、要求される秘匿性強度が強いほど、ダミー音声の区間音声データが配信データに挿入される個数が多くなるよう割当規則が定められてもよい。または、配信データにおける、対象音声の区間音声データに対するダミー音声の区間音声データの比率が高くなるよう割当規則が定められてもよい。なお、上記の挿入数または比率が大きいほど、秘匿性が高まるため、文字起こしサービスの販売価格が高く定められてもよい。 An eighth modified example will be described. The user requests how many dummy voice section voice data are inserted into the target voice section voice data, or the ratio between the target voice section voice data and the dummy voice section voice data to be assigned to the worker. It may be determined by the confidentiality level of the target voice. That is, the allocation rule may be set such that the greater the required confidentiality strength, the greater the number of dummy voice section voice data inserted into the distribution data. Alternatively, the allocation rule may be set such that the ratio of the dummy voice section voice data to the target voice section voice data in the distribution data is high. Note that the greater the number of insertions or the ratio, the higher the confidentiality, and thus the selling price of the transcription service may be set higher.

第９変形例を説明する。上記実施例に記載の文字起こしシステム１０の構成は一例であり、物理的な構成（筐体数等）に制限がないことはもちろんである。例えば、オリジナルの音声データを分割し、区間音声データを作業者装置１６へ提供する機能と、作業者による文字起こし結果を収集し、文章データを生成してユーザに提供する機能とは、別の装置により実現されてもよい。 A ninth modification will be described. The configuration of the transcription system 10 described in the above embodiments is an example, and it goes without saying that the physical configuration (the number of housings, etc.) is not limited. For example, the function of dividing the original voice data and providing the section voice data to the worker device 16 and the function of collecting the transcription result by the worker, generating the sentence data, and providing it to the user are different from each other. It may be realized by the device.

上述した実施例および変形例の任意の組み合わせもまた本発明の実施の形態として有用である。組み合わせによって生じる新たな実施の形態は、組み合わされる実施例および変形例それぞれの効果をあわせもつ。また、請求項に記載の各構成要件が果たすべき機能は、実施例および変形例において示された各構成要素の単体もしくはそれらの連携によって実現されることも当業者には理解されるところである。 Any combination of the above-described examples and modifications is also useful as an embodiment of the present invention. The new embodiment resulting from the combination has the effects of the combined examples and modifications. It is also understood by those skilled in the art that the function that each constituent element described in the claims should fulfill is realized by the individual constituent elements shown in the embodiment and the modified examples or by their cooperation.

１０文字起こしシステム、１２管理装置、１４ユーザ端末、１６作業者装置、５２変換部、５４分割部、５８配信部、６０作業結果受付部、６２文章生成部、６４文章提供部、６６評価部。 10 character transcription system, 12 management device, 14 user terminal, 16 worker device, 52 conversion unit, 54 division unit, 58 distribution unit, 60 work result reception unit, 62 sentence generation unit, 64 sentence provision unit, 66 evaluation unit.

Claims

A first storage unit for storing target voice data in which a voice of a transcription target is recorded;
A second storage unit for storing the dummy voice data in which the dummy voice is recorded;
A dividing unit that divides the target voice data into a plurality of section voice data relating to a plurality of sections,
A providing unit for providing an external device with a set of at least one of the plurality of section voice data and the dummy voice data;
A receiving unit for receiving text data transcribed based on at least one of the plurality of section voice data, and text data transcribed based on the dummy voice data;
A generation unit that generates, from the text data received by the reception unit, text data that has been transcribed based on at least one of the plurality of section voice data, and that generates text data that has transcribed the target voice. ,
An information processing system comprising: