JP2019090917A

JP2019090917A - Voice-to-text conversion device, method and computer program

Info

Publication number: JP2019090917A
Application number: JP2017219292A
Authority: JP
Inventors: 昌二朗白石; Shojiro Shiraishi
Original assignee: Res Institute Of Information Environment Design; RESEARCH INSTITUTE OF INFORMATION-ENVIRONMENT DESIGN
Current assignee: Res Institute Of Information Environment Design; RESEARCH INSTITUTE OF INFORMATION-ENVIRONMENT DESIGN
Priority date: 2017-11-14
Filing date: 2017-11-14
Publication date: 2019-06-13

Abstract

To provide accurate text data without imposing a burden on an engine for converting voice data into text data in processing of converting voice data into a text.SOLUTION: A voice data-to-text conversion device 1 divides voice data by prescribed unit time with a division processing section 12 and generates compressed data obtained by deleting a silent part from divided data by a compression processing section 13. When pre-connection text data obtained by converting the compressed data into a text is received from a voice-to-text conversion engine after the compressed data is transmitted to the voice-to-text conversion engine, a connection processing section 15 generates post-connection text data obtained by connecting the pre-connection text data in a pre-division order.SELECTED DRAWING: Figure 1

Description

本発明は、音声データをテキスト化する技術に関する。 The present invention relates to a technology for converting speech data into text.

近年、音声データをテキスト化するサービスの需要が高まっている。議事録の作成などでは音声データが作成されるが、音声データよりもテキストデータのうほうが後々の確認等では便利である。
この点について例えば、特許文献１では、音声通話していする複数の通話者に対して、同時的に通話音声を文字化してデータ配信する方法が提案されている。
また、特許文献２、３では、作業者によって音声データをテキストデータに変換させる装置が提案されている。 In recent years, the demand for services for converting speech data into text has been increasing. Although voice data is created in the creation of the minutes, text data is more convenient for later confirmation than voice data.
With regard to this point, for example, Patent Document 1 proposes a method of simultaneously digitizing call speech and distributing data to a plurality of callers who are in voice communication.
Further, in Patent Documents 2 and 3, an apparatus for converting voice data into text data by an operator is proposed.

特開２０１０−４１３０１号公報Unexamined-Japanese-Patent No. 2010-43101 特開２００８−９６９３号公報JP 2008-9693A 特開２０１３−１８２３５３号公報JP, 2013-182353, A

しかしながら、上記特許文献記載の技術では、無音部分を含めて音声データをテキスト化するため、装置に対して無駄な処理負担をかけてしまっている。即ち、無音部分はテキスト化されないにもかかわらず、テキスト化処理をかけることは無駄な処理を実行することになる。 However, in the technique described in the above-mentioned patent document, since the voice data including the silent part is converted into text, an unnecessary processing load is imposed on the device. That is, although the silent part is not textified, applying textification processing will execute useless processing.

一方、データサイズが大きくなり易い長時間の音声データ等を分割して随時テキスト化させる場合、分割位置で誤変換がなされ、正確なテキスト化を期すことができなくなるおそれがある。 On the other hand, when long-time voice data or the like in which the data size tends to be large is divided and converted into text as needed, incorrect conversion may be performed at divided positions, and accurate conversion into text may not be possible.

そこで、本発明は、音声データのテキスト化処理において、音声データをテキストデータに変換するエンジンに負担をかけることなく、正確なテキストデータを得ることを目的とする。 Therefore, the present invention has an object of obtaining accurate text data without putting a load on an engine for converting speech data into text data in text conversion processing of speech data.

上記目的を達成するため、本発明に係る音声データテキスト化装置は、音声データをテキスト化するための装置であって、上記音声データをテキスト化する音声テキスト化エンジンと、ネットワークを介して通信可能に構成され、上記音声データを所定の単位時間で分割して分割データを生成する分割処理手段と、上記分割データから無音部を削除した圧縮データを生成する圧縮処理手段と、上記音声テキスト化エンジンに対し、上記圧縮データを送信する圧縮データ送信手段と、上記音声テキスト化エンジンから、上記圧縮データをテキスト化させた結合前テキストデータを受信する結合前テキストデータ受信手段と、上記結合前テキストデータを分割前の順序で結合した結合後テキストデータを生成する結合処理手段と、を有することを特徴とする。 In order to achieve the above object, an audio data text converting apparatus according to the present invention is an apparatus for converting audio data into text, and can communicate with an audio text converting engine which converts the audio data into text through a network A division processing unit configured to divide the audio data by a predetermined unit time to generate divided data; a compression processing unit configured to generate compressed data in which a silent portion is deleted from the divided data; A compressed data transmission unit for transmitting the compressed data, a pre-combination text data receiving unit for receiving the pre-combination text data obtained by converting the compressed data into text data from the speech-to-text engine, the pre-combination text data And combining processing means for generating text data after combining in the order before division. And butterflies.

また、上記分割手段は、上記音声データの開始位置から所定の単位時間ごとの区切位置が無音部である場合には当該無音部で上記音声データを分割して分割データを生成し、上記音声データの開始位置から所定の単位時間ごとの区切位置が有音部である場合には、当該有音部より前の一定時間内にある無音部で上記音声データを分割して分割データを生成するものとしてもよい。 Further, the dividing means divides the audio data by the silent portion to generate divided data when the sectioning position for each predetermined unit time from the start position of the audio data is a silent portion, and the audio data is generated. When the demarcation position for each predetermined unit time from the start position of is a talkative part, the speech data is divided by a silent part within a certain time before the talkative part to generate divided data It may be

また、上記音声データの音量を所定の音量に調整する音量調整手段、をさらに有し、上記分割手段は、所定の音量に調整された音声データを所定の単位時間で分割して分割データを生成するものとしてもよい。 The image processing apparatus further includes volume adjustment means for adjusting the volume of the audio data to a predetermined volume, and the division means divides the audio data adjusted to a predetermined volume by a predetermined unit time to generate divided data. It is also possible to

また、上記圧縮データに対して識別情報を発行する識別情報発行手段、をさらに有し、上記結合処理手段は、上記識別情報に基づき、上記結合前テキストデータを分割前の順序で結合した結合後テキストデータを生成するものとしてもよい。 Further, the information processing apparatus further comprises identification information issuing means for issuing identification information to the compressed data, and the combination processing means combines the pre-combination text data in the order before division based on the identification information. Text data may be generated.

また、ユーザが利用するユーザ端末と、さらにネットワークを介して通信可能に構成され、上記ユーザ端末から、上記音声データを受信する音声データ受信手段と、上記ユーザ端末に対し、上記結合後テキストデータを送信する結合後テキストデータ送信手段と、をさらに有するものとしてもよい。 Further, it is configured to be communicable with the user terminal used by the user via the network, and the voice data receiving means for receiving the voice data from the user terminal, and the post-join text data to the user terminal. And a post-combination text data transmission means for transmitting.

また、本発明の別の観点に係る音声データテキスト化方法は、音声データをテキスト化するための方法であって、上記音声データをテキスト化する音声テキスト化エンジンと、ネットワークを介して通信可能に構成されたコンピュータにより、上記音声データを所定の単位時間で分割して分割データを生成する分割処理と、上記分割データから無音部を削除した圧縮データを生成する圧縮処理と、上記音声テキスト化エンジンに対し、上記圧縮データを送信する圧縮データ送信処理と、上記音声テキスト化エンジンから、上記圧縮データをテキスト化させた結合前テキストデータを受信する結合前テキストデータ受信処理と、上記結合前テキストデータを分割前の順序で結合した結合後テキストデータを生成する結合処理と、を実行することを特徴とする。 Further, according to another aspect of the present invention, there is provided an audio data text conversion method for converting audio data into text data, the method comprising: communication with an audio text conversion engine for converting the audio data into text data via a network A division process of dividing the audio data by a predetermined unit time to generate divided data, a compression process of generating compressed data in which silent parts are deleted from the divided data, and the speech-to-text engine; A compressed data transmission process for transmitting the compressed data, a pre-combination text data reception process for receiving the pre-combination text data obtained by converting the compressed data into a text form from the speech-to-text engine, the pre-combination text data Execute combining processing to generate text data after combining combining in the order before division And features.

また、本発明の別の観点に係るコンピュータプログラムは、音声データをテキスト化するためのコンピュータプログラムであって、上記音声データをテキスト化する音声テキスト化エンジンと、ネットワークを介して通信可能に構成されたコンピュータに対し、上記音声データを所定の単位時間で分割して分割データを生成する分割処理と、上記分割データから無音部を削除した圧縮データを生成する圧縮処理と、上記音声テキスト化エンジンに対し、上記圧縮データを送信する圧縮データ送信処理と、上記音声テキスト化エンジンから、上記圧縮データをテキスト化させた結合前テキストデータを受信する結合前テキストデータ受信処理と、上記結合前テキストデータを分割前の順序で結合した結合後テキストデータを生成する結合処理と、を実行させる。 A computer program according to another aspect of the present invention is a computer program for converting voice data into text, and is configured to be communicable with a voice-to-text engine that converts the voice data into text via a network. A division process of dividing the voice data by a predetermined unit time to generate divided data, a compression process of generating compressed data obtained by deleting silent parts from the divided data, and the voice-to-text engine A compressed data transmission process for transmitting the compressed data, a pre-combination text data reception process for receiving the pre-combination text data obtained by converting the compressed data into a text from the speech-to-text engine, the pre-combination text data A combining process for generating post-join text data combined in the order before division To the execution.

本発明に係る音声テキスト化装置によれば、音声データのテキスト化処理において、音声データをテキストデータに変換するエンジンに負担をかけることなく、正確なテキストデータを得ることができる。 According to the speech-to-text device according to the present invention, accurate text data can be obtained without putting a load on an engine for converting speech data into text data in the textification process of speech data.

本発明の実施形態に係る音声テキスト化装置の機能を示した機能ブロック図である。It is the functional block diagram which showed the function of the speech-to-text device concerning the embodiment of the present invention. 本実施形態に係る音声テキスト化装置において、識別情報記憶部に記憶されるデータの一例を示した図である。FIG. 6 is a diagram showing an example of data stored in an identification information storage unit in the voice-to-text device according to the present embodiment. 本実施形態に係る音声テキスト化装置によって実行される処理を説明する概念図であり、（ａ）無音部に区切位置が設けられる場合、（ｂ）有音部に区切位置が設けられる場合、を示す。It is a conceptual diagram explaining the process performed by the audio text conversion apparatus which concerns on this embodiment, (a) When a break position is provided in a silent part, (b) When a break position is provided in a sound part Show. 本実施形態に係る音声テキスト化装置によって実行される処理を説明する概念図であり、無音部において音声データが分割された状態を示す。It is a conceptual diagram explaining the process performed by the audio text processing apparatus which concerns on this embodiment, and shows the state by which audio | speech data was divided | segmented in the silent part. 本実施形態に係る音声テキスト化装置によって実行される処理を説明する概念図であり、圧縮データが生成された状態を示す。It is a conceptual diagram explaining the process performed by the speech-to-text apparatus which concerns on this embodiment, and shows the state in which compression data were produced | generated. 本実施形態に係る音声テキスト化装置によって実行される一連の処理の流れを示すシーケンス図である。It is a sequence diagram which shows the flow of a series of processes performed by the speech-to-text apparatus based on this embodiment. 本発明の別の実施形態に係る音声テキスト化処理部を備えたユーザ端末が備える機能を示した機能ブロック図である。It is the functional block diagram which showed the function with which the user terminal provided with the speech-to-text processing part concerning another embodiment of the present invention is provided. 本実施形態に係る音声テキスト化処理部を備えたユーザ端末によって実行される一連の処理の流れを示すシーケンス図である。It is a sequence diagram which shows the flow of a series of processes performed by the user terminal provided with the audio text conversion process part which concerns on this embodiment. 本実施形態に係る音声テキスト化処理部を備えたユーザ端末によって実行される処理を説明する概念図であり、（ａ）区切位置が設定される状態、（ｂ）無音部において音声データが分割される状態、（ｃ）圧縮データが生成された状態、を示す。It is a conceptual diagram explaining the process performed by the user terminal provided with the audio text conversion process part which concerns on this embodiment, (a) State where a break position is set, (b) voice data is divided in a silent part State (c) when compressed data is generated. 本発明の別の実施形態において、音声データを随時、テキスト化させる場合の処理の流れを示すシーケンス図である。In another embodiment of the present invention, it is a sequence diagram showing a flow of processing when making voice data into text at any time.

以下、本発明の実施形態に係る音声テキスト化装置について、図を参照して説明する。
音声テキスト化装置は、音声データをテキスト化するための装置であって、音声データに所定の処理を実行した上、処理済みの音声データを音声テキスト化エンジンに送ってテキスト化させる装置である。 Hereinafter, a speech-to-text device according to an embodiment of the present invention will be described with reference to the drawings.
The voice-to-text device is a device for text-to-speech voice data, and is a device that performs predetermined processing on the voice data and sends processed voice data to the voice-to-text engine to make it into text.

図１に示される音声テキスト化装置１は、音声テキスト化エンジン２及びユーザ端末３とインターネット等のネットワークＮＷを介して通信可能に構成されている。 The voice-to-text device 1 shown in FIG. 1 is configured to be able to communicate with the voice-to-text engine 2 and the user terminal 3 via a network NW such as the Internet.

音声テキスト化装置１は、CPU（Central Processing Unit）などの演算装置、RAM（Random Access Memory）やROM（Read Only Memory）などの記憶装置により、識別情報記憶部１Ａ、音量調整部１１、分割処理部１２、圧縮処理部１３、識別情報発行部１４、結合処理部１５、及び通信処理部１６からなる機能ブロックを構成する。 The voice-to-text device 1 includes an identification information storage unit 1A, a volume adjustment unit 11, and division processing using an arithmetic device such as a central processing unit (CPU) or a storage device such as a random access memory (RAM) or a read only memory (ROM). A functional block composed of the unit 12, the compression processing unit 13, the identification information issuing unit 14, the combination processing unit 15, and the communication processing unit 16 is configured.

識別情報記憶部１Ａは、識別情報発行部１４が圧縮データに対して発行、付与した識別情報や、当該圧縮データに関する情報を記憶する記憶部である。
この識別情報記憶部１Ａには例えば、図２に示されるように、後述する圧縮データごとに発行、付与された識別情報のほか、圧縮データのファイル名、ファイルサイズ、作成日などが記憶される。
ここで、識別情報は、個々の圧縮データを識別すると共に結合順序を把握可能にするための情報であり、結合処理部１５によって実行される結合処理の際に参照される。 The identification information storage unit 1A is a storage unit that stores identification information issued and added to compressed data by the identification information issuing unit 14, and information related to the compressed data.
For example, as shown in FIG. 2, the identification information storage unit 1A stores the file name, the file size, the creation date, etc. of compressed data in addition to the identification information issued and added for each compressed data described later, as shown in FIG. .
Here, the identification information is information for identifying individual compressed data and making it possible to grasp the coupling order, and is referred to in the coupling processing performed by the coupling processing unit 15.

音量調整部１１は、音声データの音量を所定の音量に調整する処理を実行する。
この処理は例えば、音声データを-10[db]に抑えるもので、これにより、処理を施す前の音声データ中に音量の異なる部分が存在しても、テキスト化処理において誤った変換がなされるのを防ぐことができる。 The volume adjustment unit 11 executes a process of adjusting the volume of audio data to a predetermined volume.
This processing is, for example, to suppress audio data to -10 [db], whereby erroneous conversion is made in the textification processing even if there is a portion with different volume in the audio data before processing You can prevent that.

分割処理部１２は、音声データを所定の単位時間で分割して分割データを生成する。
この処理では、音声データを開始位置から所定の単位時間ごとに区切る。このとき、区切位置が無音部である場合には、当該無音部で音声データを分割し、区切位置が有音部である場合には、区切位置より前の一定時間内にある無音部で音声データを分割して分割データを生成する。
なお、本例における分割処理部１２は、音量調整部１１によって所定の音量に調整された音声データを分割して分割データを生成する。 The division processing unit 12 divides voice data by a predetermined unit time to generate divided data.
In this process, voice data is divided from the start position every predetermined unit time. At this time, when the break position is a silent portion, the voice data is divided at the silent portion, and when the break position is a talkative portion, the sound is silent at a silent portion within a predetermined time before the break position. Divide data to generate divided data.
The division processing unit 12 in this example divides the audio data adjusted to a predetermined volume by the volume adjustment unit 11 to generate divided data.

ここで、分割処理部１２による音声データの分割処理について、具体例を図３によって説明する。
音声データ１００は、有音部１０１ａ、１０１ｂ、１０１ｃ、１０１ｄ、１０１ｅと、無音部１０２ａ、１０２ｂ、１０２ｃ、１０２ｄ、１０２ｅによって構成される。なお、以下では説明の便宜のため、有音部１０１ａ、１０１ｂ、１０１ｃ、１０１ｄ、１０１ｅについて、各構成要素に着目しない場合にはまとめて有音部１０１と称することがある。また、無音部１０２ａ、１０２ｂ、１０２ｃ、１０２ｄ、１０２ｅについても同様に無音部１０２と称することがある。
なお、有音部１０１とは、一定以上の音量を有する部分である。また、無音部１０２とは、無音の部分、又は一定未満の音量しか有さず、無音とみなされた部分である。有音部１０１と無音部１０２とに分ける音量の基準は、任意に設定し得る。 Here, a specific example of audio data division processing by the division processing unit 12 will be described with reference to FIG.
The audio data 100 is configured by sound parts 101a, 101b, 101c, 101d and 101e, and silent parts 102a, 102b, 102c, 102d and 102e. In the following, for the convenience of description, the sound producing units 101a, 101b, 101c, 101d, and 101e may be collectively referred to as the sound producing unit 101 when not focusing on each component. Similarly, silent sections 102a, 102b, 102c, 102d and 102e may also be referred to as silent sections 102.
Note that the sound producing unit 101 is a portion having a certain volume or more. The silent portion 102 is a silent portion or a portion having a volume less than a predetermined level and regarded as silent. The reference of the volume divided into the sound part 101 and the silent part 102 can be set arbitrarily.

このような音声データについて、分割処理部１２は、音声データの開始位置から50[sec]を単位時間として順次、区切位置を設定する。
図３（ａ）の例は、区切位置１００ａが、無音部１０２ｂに設定された場合を示している。この場合には、当該区切位置１００ａ又は無音部１０２ｂを分割位置として音声データ１００が分割される。
一方、図３（ｂ）の例は、区切位置１００ｂが、有音部１０１ｃに設定された場合を示している。この場合には、区切位置１００ｂから10[sec]前以内にある無音部１０２ｂを分割位置として音声データ１００が分割される。
このように、分割処理部１２によって音声データ１００に対して分割処理が実行されると、図４に示されるように、音声データ１００は所定の分割位置（図３の例では無音部１０２ｂ）で分割された複数の分割データ１１０ａ、１１０ｂとなる。なお、分割データ１１０ｂには、区切位置１００ｂから分割位置の１０２ｂまでの時間分の音声データが含まれる。
このように音声データ１００が分割されることで、一つのまとまった意味をもった単語が不自然に分割されるのを防ぐことができる。その結果、音声テキスト化エンジン２によるテキスト化処理において、誤った変換がなされるのを防ぐことができる。 With respect to such voice data, the division processing unit 12 sequentially sets a break position with 50 [sec] as a unit time from the start position of the voice data.
The example of FIG. 3A shows the case where the break position 100a is set to the silent portion 102b. In this case, the audio data 100 is divided with the division position 100a or the silent portion 102b as a division position.
On the other hand, the example of FIG. 3B shows the case where the break position 100b is set to the sound part 101c. In this case, the audio data 100 is divided with the silent portion 102b within 10 [sec] before the separation position 100b as the division position.
As described above, when the division processing unit 12 performs division processing on the audio data 100, as shown in FIG. 4, the audio data 100 is at a predetermined division position (in the example of FIG. 3, the silent portion 102b). A plurality of divided data 110a and 110b are obtained. The divided data 110 b includes audio data for the time from the break position 100 b to the split position 102 b.
By dividing the speech data 100 in this manner, it is possible to prevent unnatural division of a word having a single meaning. As a result, it is possible to prevent an erroneous conversion from being made in the textification process by the speech-to-text engine 2.

圧縮処理部１３は、分割処理部１２によって生成された分割データから無音部を削除した圧縮データを生成する。
この処理の具体例を図４、図５に示す。
図４は、図３の例から続く処理によって、音声データ１００が有音部１０１ｂと有音部１０１ｃの間で分割された状態を示している。なお、無音部１０２ｂは分割されて無音部１０２ｂ_１、１０２ｂ_２となっている。
圧縮処理部１３はこの状態から、図５に示されるように無音部１０２ａ、１０２ｂ_１、１０２ｂ_２、１０２ｃ、１０２ｄを削除する。これによって音声データ１００は、有音部１００が連続した複数の圧縮データ１２０ａ、１２０ｂとなる。 The compression processing unit 13 generates compressed data in which the silent portion is deleted from the divided data generated by the division processing unit 12.
The specific example of this process is shown in FIG. 4 and FIG.
FIG. 4 shows a state in which the audio data 100 is divided between the sound producing unit 101 b and the sound producing unit 101 c by the processing that follows the example of FIG. 3. The silent portion 102 b is divided into silent portions 102 b ₁ and 102 b ₂ .
Compression processing section 13 from this state, deleting silence _{_{102a, 102b 1, 102b 2,}} 102c, and 102d as shown in FIG. As a result, the audio data 100 becomes a plurality of compressed data 120a and 120b in which the sound production unit 100 is continuous.

識別情報発行部１４は、圧縮処理部１３によって生成された各圧縮データに対して識別情報を発行し、付与する。
識別情報は、個々の圧縮データを識別すると共にその順序を把握可能にするための情報であり、識別情報記憶部１Ａに記憶されると共に、結合処理部１５によって実行される結合処理の際に参照される。 The identification information issuing unit 14 issues and gives identification information to each compressed data generated by the compression processing unit 13.
The identification information is information for identifying each compressed data and making it possible to grasp the order, and is stored in the identification information storage unit 1A and referred to in the combining process performed by the combining process unit 15 Be done.

結合処理部１５は、結合前テキストデータを分割前の順序で結合した結合後テキストデータを生成する処理を実行する。
ここで、音声テキスト化装置１は、圧縮データを生成すると、音声テキスト化エンジン２に対して当該圧縮データのテキスト化処理要求と共に当該圧縮データを送信する。音声テキスト化エンジン２において圧縮データをテキスト化させた結合前テキストデータが生成されると、音声テキスト化装置１は音声テキスト化エンジン２から当該結合前テキストデータを受信する。このとき、結合前テキストデータは、テキスト化する前の圧縮データに対して付与されていた識別情報を保持しており、結合処理部１５は識別情報記憶部１Ａを参照して、結合前テキストデータを音声データ時の順序に並べて結合する。
これにより、音声データをテキスト化させた結合後テキストデータが生成される。 The combining processing unit 15 executes processing for generating post-combination text data by combining pre-combination text data in the order before division.
Here, when generating the compressed data, the voice-to-text device 1 transmits the compressed data to the voice-to-text engine 2 together with a request for processing the compressed data into a text. When pre-combination text data in which the compressed data is converted into text data is generated in the voice-to-text engine 2, the voice-to-text apparatus 1 receives the pre-combination text data from the voice-to-text engine 2. At this time, the pre-combination text data holds identification information attached to the compressed data before being converted into text, and the combination processing unit 15 refers to the identification information storage unit 1A to perform pre-combination text data In order of voice data and combine them.
As a result, post-combination text data in which speech data is converted into text is generated.

通信処理部１６は、インターネト等のネットワークＮＷを介して、音声テキスト化エンジン２やユーザ端末３と種々のデータの送受信を実行する処理部である。
この通信処理部１６は例えば、音声テキスト化エンジン２との間で、圧縮データを送信したり、結合前テキストデータを受信したりする。また、ユーザ端末３との間では、音声データを受信したり、結合後テキストデータを送信したりする。 The communication processing unit 16 is a processing unit that executes transmission and reception of various data with the voice-to-text engine 2 and the user terminal 3 via the network NW such as the Internet.
The communication processing unit 16 transmits, for example, compressed data and receives uncombined text data with the voice-to-text engine 2. Further, with the user terminal 3, voice data is received, and text data after combination is transmitted.

音声テキスト化エンジン２は、音声データをテキスト化するデータ処理エンジンである。
この音声テキスト化エンジン２は、音声データ中の単語を識別する識別手段、識別される単語の音声データとテキストデータが関連付けられ、変換処理の参照先となる辞書手段などの機能部を有している。 The voice-to-text engine 2 is a data processing engine that converts voice data into text.
The speech-to-text engine 2 has functional units such as identification means for identifying words in speech data, speech data of the words to be identified and text data associated with each other, and dictionary means serving as a reference destination of conversion processing. There is.

ユーザ端末３は、音声データテキスト化装置と音声データテキスト化エンジンによって構成される音声データテキスト化システムのユーザが利用する端末であり、ユーザは当該ユーザ端末３により、所定の音声データをテキスト化したテキストデータを得る。 The user terminal 3 is a terminal used by a user of an audio data text converting system constituted by an audio data text converting device and an audio data text converting engine, and the user converts predetermined audio data into text by the user terminal 3 Get text data.

このユーザ端末３は例えば、所謂スマートフォンやタブレット端末、パーソナルコンピュータなどの端末で構成され、音声テキスト化装置１とインターネット等のネットワークＮＷを介したデータの送受信を実行したり、各種のデータの入出力を実行したりすることができる。 The user terminal 3 is constituted by, for example, a terminal such as a so-called smart phone, a tablet terminal, a personal computer, etc., executes transmission and reception of data via the voice text processing apparatus 1 and the network NW such as the Internet, and inputs and outputs various data. Can be performed.

続いて、本実施形態に係る音声テキスト化装置１によって実行される一連の処理の流れについて、図６を参照して説明する。
まず、ユーザはユーザ端末３により、音声テキスト化装置１に対して、所望の音声データと共に当該音声データのテキスト化要求を送信する（Ｓ１０１）。 Subsequently, a flow of a series of processes executed by the speech to text device 1 according to the present embodiment will be described with reference to FIG.
First, the user transmits a request for converting the voice data into a text as well as desired voice data to the voice-to-text device 1 from the user terminal 3 (S101).

音声データを受信した音声テキスト化装置１は、音量調整部１１により、音声データの音量を所定の音量に調整する（Ｓ１０２）。
音声データの音量が調整されると、分割処理部１２は、音量調整後の音声データを所定の単位時間で分割して分割データを生成する（Ｓ１０３）。
分割データを生成する分割処理では、音声データを開始位置から所定の単位時間ごとの区切位置で区切る。このとき、区切位置が無音部である場合には当該無音部で音声データを分割し、区切位置が有音部である場合には、当該有音部より前の一定時間内にある無音部で音声データを分割して分割データを生成する。 In the voice-to-text device 1 having received the voice data, the volume adjuster 11 adjusts the volume of the voice data to a predetermined volume (S102).
When the volume of the audio data is adjusted, the division processing unit 12 divides the audio data after the volume adjustment by a predetermined unit time to generate divided data (S103).
In division processing for generating divided data, audio data is divided at a division position for each predetermined unit time from the start position. At this time, when the break position is a silent portion, the voice data is divided by the silent portion, and when the break position is a sound portion, the silent portion is within a predetermined time before the sound portion. The voice data is divided to generate divided data.

続けて圧縮処理部１３は、分割処理部１２によって生成された分割データから無音部を削除した圧縮データを生成する（Ｓ１０４）。
圧縮データが生成されると、識別情報発行部１４により、各圧縮データに対して識別情報が発行、付与される（Ｓ１０５）。
識別情報は、個々の圧縮データを識別すると共にその順序を把握可能にするための情報であり、発行に応じて識別情報記憶部１Ａに登録される（Ｓ１０６）。 Subsequently, the compression processing unit 13 generates compressed data in which the silent portion is deleted from the divided data generated by the division processing unit 12 (S104).
When the compressed data is generated, the identification information issuing unit 14 issues and adds identification information to each compressed data (S105).
The identification information is information for identifying each compressed data and making it possible to grasp the order, and is registered in the identification information storage unit 1A according to the issue (S106).

圧縮データは、音声テキスト化装置１から音声テキスト化エンジン２に対し、テキスト化要求と共に送信される（Ｓ１０７）。
音声テキスト化エンジン２は、圧縮データの受信に応じて、当該圧縮データのテキスト化処理を実行する（Ｓ１０８）。
テキスト化処理によって圧縮データがテキスト化され、結合前テキストデータが生成されると、当該結合前テキストデータが音声テキスト化装置１に対して送信される（Ｓ１０９）。 The compressed data is transmitted from the speech processing apparatus 1 to the speech processing engine 2 together with the text conversion request (S107).
In response to the reception of the compressed data, the voice-to-text engine 2 executes a process of converting the compressed data into text (S108).
When the compressed data is converted into text by the text conversion process and the pre-combination text data is generated, the pre-combination text data is transmitted to the speech to text apparatus 1 (S109).

音声テキスト化エンジン２から結合前テキストデータを受信した音声テキスト化装置１は、結合処理部１５により、結合前テキストデータを分割前の順序で結合した結合後テキストデータを生成する（Ｓ１１０）。なお、結合処理部１５はこの処理において、結合前テキストデータが保持している識別情報に基づき、識別情報記憶部１Ａを参照して、結合前テキストデータを音声データ時の順序に並べて結合する。 The speech processing apparatus 1 receives the pre-combination text data from the speech-to-text engine 2 and the combination processing unit 15 generates post-combination text data by combining the pre-combination text data in the order before division (S110). In this process, the combining processing unit 15 arranges and combines the pre-combination text data in the order of voice data with reference to the identification information storage unit 1A based on the identification information held by the pre-combination text data.

生成された結合後テキストデータは、通信処理部１６により、ユーザ端末３に対して送信される（Ｓ１１１）。
これにより、ユーザは、所望の音声データをテキスト化させたデータを得ることができる。
本実施形態に係る音声テキスト化装置１によれば、音声テキスト化装置において予め無音部が削除されるため、音声テキスト化エンジンに無音部をテキスト化させることがなく、音声テキスト化エンジンの処理負担を軽減できる。
また、音声データは、所定の単位時間の区切位置の無音部、又は当該区切位置前に存する無音部で分割されるため、単語が途中で分割されることがなく、誤った変換が行われるのを防ぐことができる。 The generated combined text data is transmitted by the communication processing unit 16 to the user terminal 3 (S111).
Thereby, the user can obtain data in which desired voice data is converted into text.
According to the voice-to-text device 1 according to the present embodiment, since the silent part is deleted in advance in the voice-to-text device, the voice-to-text engine does not make silent parts into text, and the processing load of the voice-to-text engine Can be reduced.
In addition, since the voice data is divided by the silent portion at the break position of the predetermined unit time or the silent portion existing before the break position, the word is not divided midway, and erroneous conversion is performed. You can prevent.

次に、以上の本実施形態に係る音声テキスト化装置の変形例について、図７を参照して説明する。
本例においては、上述した音声テキスト化装置１は、音声テキスト化処理部４１として所謂スマートフォンやパーソナルコンピュータ等によって構成されるユーザ端末４に組み込まれ、アプリケーションソフトウェアとして機能する。
なお、以下の説明において、上述した音声テキスト化装置１が備える機能部と同様の機能を奏する機能部については上記と同様の符号を付している。 Next, a modification of the speech-to-text device according to the above-described embodiment will be described with reference to FIG.
In the present embodiment, the above-described voice-to-text device 1 is incorporated as a voice-to-text processing unit 41 into a user terminal 4 configured by a so-called smart phone, personal computer or the like, and functions as application software.
In the following description, the same reference numerals as in the above description are assigned to functional units that perform the same functions as the functional units included in the above-described voice-to-text device 1.

本実施形態において、ユーザ端末４は、音声テキスト化処理部４１、入出力処理部４２、及び通信処理部４３からなる機能部を有し、音声テキスト化エンジン２とインターネット等のネットワークＮＷを介して通信可能に構成されている。
さらに、音声テキスト化処理部４１は、識別情報記憶部１Ａ、音量調整部１１、分割処理部１２、圧縮処理部１３、識別情報発行部１４、及び結合処理部１５から構成される。 In the present embodiment, the user terminal 4 has a functional unit including a voice-to-text processing unit 41, an input / output processing unit 42, and a communication processing unit 43, and via the voice-to-text engine 2 and the network NW such as the Internet. It is configured to be communicable.
Further, the voice-to-text processing unit 41 includes an identification information storage unit 1A, a volume adjustment unit 11, a division processing unit 12, a compression processing unit 13, an identification information issuing unit 14, and a combination processing unit 15.

入出力処理部４２は、各種のデータを入力したり、出力したりする機能部であって、ディスプレイやタッチパネル等によって構成される。
通信処理部４３は、音声テキスト化エンジン２とネットワークＮＷを介したデータの送受信を実行するための機能部であって、ブラウザプログラム等によって実現される。 The input / output processing unit 42 is a functional unit that inputs and outputs various data, and is configured of a display, a touch panel, and the like.
The communication processing unit 43 is a functional unit for executing transmission and reception of data via the voice-to-text engine 2 and the network NW, and is realized by a browser program or the like.

次に、本例に係るユーザ端末において、音声データがテキスト化される処理の流れを図８に示す。なお、図６を参照して説明した上述の例と同様の処理については同様の符号を付している。
まず、ユーザがユーザ端末４に対し、端末内に蓄積されている所望の音声データについて、テキスト化処理の実行を要求すると、音声調整部１１によって当該音声データの音量が所定の音量に調整される（Ｓ１０２）。
音声データの音量が調整されると、分割処理部１２は、音量調整後の音声データを所定の単位時間で分割して分割データを生成し（Ｓ１０３）、圧縮処理部１３は、分割処理部１２によって生成された分割データから無音部を削除した圧縮データを生成する（Ｓ１０４）。
各圧縮データに対しては、識別情報発行部１４により識別情報が発行、付与され（Ｓ１０５）、当該識別情報は、個々の圧縮データを識別すると共にその順序を把握可能にするための情報として識別情報記憶部１Ａに登録される（Ｓ１０６）。 Next, in the user terminal according to this example, a flow of processing for converting voice data into text is shown in FIG. The same processes as those in the above-described example described with reference to FIG.
First, when the user requests the user terminal 4 to execute textification processing for desired voice data stored in the terminal, the volume of the voice data is adjusted to a predetermined volume by the voice adjustment unit 11 (S102).
When the sound volume of the audio data is adjusted, the division processing unit 12 divides the sound data after the volume adjustment by a predetermined unit time to generate divided data (S103), and the compression processing unit 13 divides the sound data. The compressed data from which the silent part is deleted is generated from the divided data generated by the (S104).
Identification information is issued and added to each compressed data by the identification information issuing unit 14 (S105), and the identification information is identified as information for identifying individual compressed data and making it possible to grasp the order thereof. It is registered in the information storage unit 1A (S106).

圧縮データは、ユーザ端末４から音声テキスト化エンジン２に対し、テキスト化要求と共に送信される（Ｓ１０７）。
音声テキスト化エンジン２において、圧縮データのテキスト化処理が実行されると（Ｓ１０８）、結合前テキストデータがユーザ端末４に対して送信される（Ｓ１０９）。 The compressed data is transmitted from the user terminal 4 to the voice-to-text engine 2 together with the text-to-text request (S107).
When the text-to-text processing of the compressed data is executed in the voice-to-text engine 2 (S108), the pre-combination text data is transmitted to the user terminal 4 (S109).

音声テキスト化エンジン２から結合前テキストデータを受信したユーザ端末４は、結合処理部１５により、結合前テキストデータを分割前の順序で結合した結合後テキストデータを生成する（Ｓ１１０）。
このように、ユーザ端末４にアプリケーションソフトウェアとしてテキスト化処理部４１をインストールさせて実行可能とすれば、ユーザ端末４から直接、音声テキスト化エンジン２に音声データを送信させて、テキストデータを得られるようにすることができる。 The user terminal 4 that has received the pre-combination text data from the speech-to-text engine 2 generates the post-combination text data by combining the pre-combination text data in the order before division by the combining processing unit 15 (S110).
As described above, if the textification processing unit 41 is installed as application software in the user terminal 4 and can be executed, voice data can be transmitted from the user terminal 4 directly to the speech processing engine 2 to obtain text data. You can do so.

なお、以上の本実施形態においては、一のファイルとして完結した音声データのみならず、作成中の音声データも随時、テキスト化させることができる。
この場合の処理について、図９を参照して説明する。
図９（ａ）は、録音中の音声データのように、音声データが生成されている途中の様子を示している。音声データ１００には、開始位置から50[sec]を単位時間として順次、区切位置が設定され、図９（ａ）では、区切位置１００ｂが作成中の有音部１０１ｃに設けられた状態となっている。 In the above embodiment, not only voice data completed as one file but also voice data being created can be converted to text as needed.
The process in this case will be described with reference to FIG.
FIG. 9A shows a state in which voice data is being generated as voice data being recorded. In the voice data 100, the break positions are sequentially set with 50 [sec] from the start position as the unit time, and in FIG. 9A, the break positions 100b are provided in the talks portion 101c being created. ing.

この例ではまず、区切位置１００ｂによって区切られた50[sec]の区間について音量が調整された上で、上述した分割以降の処理が実行される。
即ち、図９（ａ）の例については、上述のように、区切位置１００ｂから10[sec]前以内にある無音部１０２ｂを分割位置として音声データ１００が分割され、図９（ｂ）に示すにように分割データ１１０ａが生成される。 In this example, first, after the sound volume is adjusted for the section of 50 [sec] divided by the division position 100b, the processing after the above-described division is executed.
That is, for the example of FIG. 9A, as described above, the audio data 100 is divided with the silent portion 102b within 10 [sec] before the separation position 100b as the division position, and is shown in FIG. The divided data 110a is generated as shown in FIG.

圧縮処理部１３はこの状態から、図９（ｃ）に示されるように無音部１０２ａ、１０２ｂ_１を削除し、圧縮データ１２０ａを生成する。圧縮データ１２０ａは、音声テキスト化エンジンによってテキスト化され、ユーザ端末に送信される。
一方、分割データ１１０ａに続く分割データ１１０ｂは、区切り位置１００ｂから所定の単位時間分（本例では50[sec]）、音声データが生成されるのを待ってテキスト化が実行される。 Compression processing section 13 from this state, delete silence 102a, the 102b ₁ as shown in FIG. 9 (c), to generate compressed data 120a. The compressed data 120a is textified by the speech-to-text engine and transmitted to the user terminal.
On the other hand, the divided data 110b following the divided data 110a is converted into text after waiting for voice data to be generated for a predetermined unit time (50 [sec] in this example) from the dividing position 100b.

本例による処理の流れについて、上述のユーザ端末４による場合を例にとって、図１０を参照して説明する。
まず、ユーザがユーザ端末４により例えば、録音中の音声データについて、テキスト化処理の実行を要求する。
これに応じてテキスト化処理部４１は、音声データが所定時間分に達したか否かを判断する（Ｓ２０２）。音声データが所定時間分に達した場合には、当該所定時間で音声データを区切り、音声調整部１１によって当該音声データの音量を所定の音量に調整する（Ｓ１０２）。
音声データの音量が調整されると、分割処理部１２は、音量調整後の音声データを所定の単位時間で分割して分割データを生成し（Ｓ１０３）、圧縮処理部１３は、分割処理部１２によって生成された分割データから無音部を削除した圧縮データを生成する（Ｓ１０４）。
各圧縮データに対しては、識別情報発行部１４により識別情報が発行、付与され（Ｓ１０５）、当該識別情報は、個々の圧縮データを識別すると共にその順序を把握可能にするための情報として識別情報記憶部１Ａに登録される（Ｓ１０６）。 The flow of processing according to this example will be described with reference to FIG. 10, using the above-described case of the user terminal 4 as an example.
First, the user requests the user terminal 4 to execute text conversion processing, for example, for voice data being recorded.
In response to this, the text conversion processing unit 41 determines whether the audio data has reached a predetermined time (S202). When the audio data has reached a predetermined time, the audio data is divided at the predetermined time, and the audio adjustment unit 11 adjusts the volume of the audio data to a predetermined volume (S102).
When the sound volume of the audio data is adjusted, the division processing unit 12 divides the sound data after the volume adjustment by a predetermined unit time to generate divided data (S103), and the compression processing unit 13 divides the sound data. The compressed data from which the silent part is deleted is generated from the divided data generated by the (S104).
Identification information is issued and added to each compressed data by the identification information issuing unit 14 (S105), and the identification information is identified as information for identifying individual compressed data and making it possible to grasp the order thereof. It is registered in the information storage unit 1A (S106).

圧縮データは、ユーザ端末４から音声テキスト化エンジン２に対し、テキスト化要求と共に送信される（Ｓ１０７）。
音声テキスト化エンジン２において、圧縮データのテキスト化処理が実行されると（Ｓ１０８）、結合前テキストデータがユーザ端末４に対して送信される（Ｓ１０９）。
音声テキスト化エンジン２から結合前テキストデータを受信したユーザ端末４は、結合処理部１５により、結合前テキストデータを分割前の順序で結合した結合後テキストデータを生成する（Ｓ１１０）。
音声データがまだ続くか否かが判別され（Ｓ２０２）、まだテキスト化していない音声データが続く場合には、Ｓ２０１の処理に戻って、続く音声データのテキスト化処理が続行される。一方、テキスト化する音声データが終了した場合には、一連の処理を終了する。
これにより、作成中の音声データについても随時、テキスト化される。 The compressed data is transmitted from the user terminal 4 to the voice-to-text engine 2 together with the text-to-text request (S107).
When the text-to-text processing of the compressed data is executed in the voice-to-text engine 2 (S108), the pre-combination text data is transmitted to the user terminal 4 (S109).
The user terminal 4 that has received the pre-combination text data from the speech-to-text engine 2 generates the post-combination text data by combining the pre-combination text data in the order before division by the combining processing unit 15 (S110).
It is determined whether or not the audio data still continues (S202), and if the audio data that has not been converted into text continues, the process returns to S201 to continue the conversion of the audio data into text. On the other hand, when the voice data to be converted to text is finished, the series of processes is finished.
As a result, the voice data being created is also textified as needed.

なお、以上の本発明の実施形態においては、音声データの分割は、区切位置が有音部となる場合、当該区切位置よりも前の一定時間内にある無音部で実行されるものとしたが、これに限らず、区切位置の後の一定時間内にある無音部とすることもできる。 In the above embodiment of the present invention, the division of the audio data is performed in the silent part within a predetermined time before the division position, when the division position becomes the talk part. However, the present invention is not limited to this, and it may be a silent part within a fixed time after the break position.

また、音声データの音量調整は、分割データの生成前でなくてもよく、分割データの生成時や、圧縮データの生成時に行ってもよい。 Further, the volume adjustment of the audio data may not be performed before the generation of the divided data, and may be performed at the time of generation of the divided data or at the time of generation of the compressed data.

１音声テキスト化装置
１１音量調整部
１２分割処理部
１３圧縮処理部
１４識別発行部
１５結合処理部
１６通信処理部
１Ａ識別情報記憶部
２音声テキスト化エンジン
３ユーザ端末
４ユーザ端末
４１音声テキスト化処理部
４２入出力処理部
４３通信処理部
ＮＷネットワーク DESCRIPTION OF SYMBOLS 1 voice text-to-speech device 11 volume adjustment unit 12 division processing unit 13 compression processing unit 14 identification issuing unit 15 combination processing unit 16 communication processing unit 1 A identification information storage unit 2 voice text conversion engine 3 user terminal 4 user terminal 41 voice text conversion processing Unit 42 Input / output processing unit 43 Communication processing unit NW network

Claims

A device for converting voice data into text,
A speech-to-text engine for converting the speech data into text, and communicable via a network,
Division processing means for dividing the audio data by a predetermined unit time to generate divided data;
Compression processing means for generating compressed data obtained by deleting silent parts from the divided data;
Compressed data transmitting means for transmitting the compressed data to the voice to text engine;
Pre-combination text data receiving means for receiving pre-combination text data obtained by converting the compressed data into text form from the speech-to-text engine;
Combining processing means for generating post-join text data by joining the pre-join text data in a pre-division order;
An audio-to-text device characterized in that.

The dividing means divides the voice data by the silent portion to generate divided data when the break position at a predetermined unit time from the start position of the voice data is a silent portion, and starts the voice data. When the demarcation position for each predetermined unit time from the position is a talkative part, the voice data is divided by a silent part within a predetermined time before the talkable part to generate divided data.
A speech to text apparatus according to claim 1.

The apparatus further comprises volume adjustment means for adjusting the volume of the audio data to a predetermined volume,
The dividing unit divides the audio data adjusted to a predetermined volume at a predetermined unit time to generate divided data.
An apparatus according to claim 1 or 2.

Identification information issuing means for issuing identification information to the compressed data;
The combination processing means generates combined text data by combining the pre-combination text data in the order before division based on the identification information.
The speech-to-text device according to any one of claims 1 to 3.

It is configured to be communicable with the user terminal used by the user via the network.
Voice data receiving means for receiving the voice data from the user terminal;
The combined text data transmitting means for transmitting the combined text data to the user terminal;
An apparatus for converting text to speech according to any one of claims 1 to 4.

A method for converting voice data into text,
A speech-to-text engine for converting the speech data into text and a computer configured to be communicable via a network,
A division process of dividing the audio data by a predetermined unit time to generate divided data;
Compression processing for generating compressed data in which silent parts are removed from the divided data;
Compressed data transmission processing for transmitting the compressed data to the voice-to-text engine;
A pre-combination text data reception process for receiving pre-combination text data obtained by converting the compressed data into text data from the speech-to-text engine;
Performing a combining process of combining the pre-combination text data in a pre-division order to generate post-combination text data;
An audio text conversion method characterized by

A computer program for converting voice data into text,
A voice-to-text engine for converting the voice data into text, and a computer configured to be able to communicate via a network,
A division process of dividing the audio data by a predetermined unit time to generate divided data;
Compression processing for generating compressed data in which silent parts are removed from the divided data;
Compressed data transmission processing for transmitting the compressed data to the voice-to-text engine;
A pre-combination text data reception process for receiving pre-combination text data obtained by converting the compressed data into text data from the speech-to-text engine;
Performing a combining process of combining the text data before combining in the order before splitting to generate text data after combining;
Computer program.