JP7088645B2

JP7088645B2 - Data converter

Info

Publication number: JP7088645B2
Application number: JP2017179920A
Authority: JP
Inventors: 知優志田
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2017-09-20
Filing date: 2017-09-20
Publication date: 2022-06-21
Anticipated expiration: 2037-09-20
Also published as: JP2019056746A

Description

本発明は、データ変換装置に関する。 The present invention relates to a data conversion device.

従来、人が話す音声等を含む音データを、テキストデータに変換する技術が研究されている。音データをテキストデータに変換する技術によれば、ある言語を用いて行われた会話やスピーチを録音して、その内容を文字に変換することができる。このような技術は、議事録の自動作成や翻訳の前処理に応用されている。 Conventionally, a technique for converting sound data including voice spoken by a person into text data has been studied. According to the technique of converting sound data into text data, it is possible to record conversations and speeches made in a certain language and convert the contents into characters. Such techniques have been applied to the automatic creation of minutes and the pre-processing of translations.

議事録の自動作成に関して、下記特許文献１には、複数の話者の音声を符号化した音声データを文字情報に変換して議事録を作成する議事録自動作成システムであって、文字情報への変換を終えた音声データの一部分が全体に占める割合である変換進捗度と仕上がり希望日から処理優先度をタスク毎に算出し、変換進捗度及び処理優先度に基づいて少なくとも１つの特定分野辞書とその収録語彙数を選択することで議事録の精度を調節する議事録自動作成システムが記載されている。 Regarding the automatic creation of minutes, the following Patent Document 1 describes an automatic minutes creation system that creates minutes by converting voice data in which voices of a plurality of speakers are encoded into text information, and to text information. The processing priority is calculated for each task from the conversion progress and the desired finish date, which is the ratio of a part of the converted voice data to the whole, and at least one specific field dictionary is calculated based on the conversion progress and the processing priority. And the minutes automatic creation system that adjusts the accuracy of the minutes by selecting the number of recorded vocabulary is described.

特許第４７０３３８５号Patent No. 4703385

近年、音データからテキストデータへの変換を高精度で行うことのできる言語モデルがクラウドサービスの形態で提供されるようになり、音データをテキストデータに変換する技術が容易に利用できるようになりつつある。そのようなサービスでは、インターネットを介して音データの入力を受け付けて、サーバに記憶された言語モデルによって、入力された音データに基づいてテキストデータを生成し、得られたテキストデータを、インターネットを介してクライアントに返送する。 In recent years, a language model capable of converting sound data to text data with high accuracy has been provided in the form of a cloud service, and technology for converting sound data to text data has become easily available. It's getting better. Such services accept input of sound data via the Internet, generate text data based on the input sound data by the language model stored in the server, and use the obtained text data on the Internet. Send it back to the client via.

音データをテキストデータに変換するクラウドサービスは、利便性が高い反面、任意のユーザによって利用可能な場合があるため、秘密情報を含む音データをテキストデータに変換したい場合には利用しづらいことがあった。例えば、会議において外部に流出してはならない会話が行われた場合、会議の録音をクラウドサービスによってテキストデータに変換することは、セキュリティの観点から推奨されないことがある。 The cloud service that converts sound data to text data is highly convenient, but it may be available to any user, so it may be difficult to use if you want to convert sound data containing confidential information to text data. there were. For example, when a conversation that should not be leaked to the outside is held in a conference, it may not be recommended from the viewpoint of security to convert the recording of the conference into text data by a cloud service.

そこで、本発明は、利便性とセキュリティを両立させて、音データをテキストデータに変換することのできるデータ変換装置を提供することを目的とする。 Therefore, an object of the present invention is to provide a data conversion device capable of converting sound data into text data while achieving both convenience and security.

本発明の一態様に係るデータ変換装置は、入力される一連の音に所定の音が含まれているか否かを判定する判定部と、少なくとも、一連の音のうち所定の音に基づいて特定される区間の音のデータを記憶する記憶部と、判定部により一連の音に所定の音が含まれていると判定された場合に、音データに基づいてテキストデータを生成するサーバに対して、区間の音のデータを送信する送信部と、サーバから、区間の音のデータに基づいて生成されたテキストデータを受信する受信部と、を備える。 The data conversion device according to one aspect of the present invention is specified based on a determination unit for determining whether or not a predetermined sound is included in a series of input sounds, and at least a predetermined sound among the series of sounds. For the storage unit that stores the sound data of the section to be processed and the server that generates text data based on the sound data when it is determined by the determination unit that the series of sounds contains a predetermined sound. , A transmitting unit for transmitting section sound data, and a receiving unit for receiving text data generated based on the section sound data from the server.

この態様によれば、所定の音に基づいて特定される区間の音のデータをサーバに送信し、入力される音のデータ全体をサーバに送信しないことで、外部に流出してはならない音が入力された場合であっても、テキストデータに変換する区間を限定することができ、クラウドサービスの利便性とセキュリティを両立させて、音データをテキストデータに変換することができる。 According to this aspect, the sound data of the specified section based on a predetermined sound is transmitted to the server, and the entire input sound data is not transmitted to the server, so that the sound that should not be leaked to the outside is produced. Even if it is input, the section to be converted into text data can be limited, and the sound data can be converted into text data while achieving both convenience and security of the cloud service.

また、上記態様において、記憶部は、判定部により一連の音に所定の音が含まれていると判定された場合に、所定の音より後に入力される一連の音の少なくとも一部を区間の音のデータとして記憶してもよい。 Further, in the above embodiment, when the determination unit determines that the series of sounds contains a predetermined sound, the storage unit may use at least a part of the series of sounds input after the predetermined sound in the section. It may be stored as sound data.

この態様によれば、所定の音より後に入力される一連の音の少なくとも一部を区間の音のデータとして記憶することで、記憶部に記憶すべき音データの容量を少なくすることができ、記憶された音のデータに所定の音が含まれているか否かを処理する必要が無いため、演算負荷を減らすことができる。 According to this aspect, by storing at least a part of a series of sounds input after a predetermined sound as sound data of a section, the capacity of sound data to be stored in the storage unit can be reduced. Since it is not necessary to process whether or not a predetermined sound is included in the stored sound data, the calculation load can be reduced.

また、上記態様において、記憶部は、一連の音のデータを記憶し、記憶部に記憶された一連の音のデータから、所定の音より後に入力された一連の音の少なくとも一部を区間の音のデータとして抽出する抽出部をさらに備えてもよい。 Further, in the above aspect, the storage unit stores data of a series of sounds, and at least a part of the series of sounds input after the predetermined sound is set as a section from the data of the series of sounds stored in the storage unit. Further, an extraction unit for extracting as sound data may be provided.

この態様によれば、入力される一連の音のデータを記憶し、所定の音より後に入力された一連の音の少なくとも一部を抽出することで、抽出された区間の音のデータ以外のデータであっても事後的に選択してサーバに送信し、テキストに変換することができるようになる。 According to this aspect, data other than the sound data of the extracted section is stored by storing the data of the input series of sounds and extracting at least a part of the series of sounds input after the predetermined sound. Even so, you will be able to select it after the fact, send it to the server, and convert it to text.

また、上記態様において、区間の音のデータを、複数の音データに分割する分割部をさらに備え、送信部は、複数の音データの順序を入れ替えて、複数の音データをサーバに送信し、受信部は、複数の音データに基づいて生成された複数のテキストデータを受信し、送信部による複数の音データの順序の入れ替えに基づいて、複数のテキストデータを一つのテキストデータに合成する合成部をさらに備えてもよい。 Further, in the above embodiment, a division unit for dividing the sound data of the section into a plurality of sound data is further provided, and the transmission unit rearranges the order of the plurality of sound data and transmits the plurality of sound data to the server. The receiver receives a plurality of text data generated based on a plurality of sound data, and synthesizes the plurality of text data into one text data based on the rearrangement of the order of the plurality of sound data by the transmitter. Further units may be provided.

この態様によれば、所定の音に基づいて特定された区間の音のデータを、複数の音データに分割して、その順序を入れ替えてサーバに送信することで、送信した音データの内容が第三者に読み取られることを防止することができる。 According to this aspect, the sound data of the section specified based on a predetermined sound is divided into a plurality of sound data, and the order is changed and transmitted to the server, so that the content of the transmitted sound data can be obtained. It is possible to prevent it from being read by a third party.

また、上記態様において、送信部は、複数の音データを、音データに基づいてテキストデータを生成する複数のサーバに分配して送信してもよい。 Further, in the above aspect, the transmission unit may distribute and transmit a plurality of sound data to a plurality of servers that generate text data based on the sound data.

この態様によれば、所定の音に基づいて特定された区間の音のデータを、複数の音データに分割して、複数のサーバに分配して送信することで、それぞれのサーバに送信した音データの内容から全体の内容を再現することが困難となり、音データの内容が第三者に読み取られるおそれをさらに低減させることができる。 According to this aspect, sound data in a section specified based on a predetermined sound is divided into a plurality of sound data, distributed to a plurality of servers, and transmitted to each server. It becomes difficult to reproduce the entire content from the content of the data, and the possibility that the content of the sound data is read by a third party can be further reduced.

本発明によれば、利便性とセキュリティを両立させて、音データをテキストデータに変換することのできるデータ変換装置を提供することができる。 According to the present invention, it is possible to provide a data conversion device capable of converting sound data into text data while achieving both convenience and security.

本発明の実施形態に係るデータ変換装置のネットワーク構成を示す図である。It is a figure which shows the network configuration of the data conversion apparatus which concerns on embodiment of this invention. 本実施形態に係るデータ変換装置の物理構成を示す図である。It is a figure which shows the physical structure of the data conversion apparatus which concerns on this embodiment. 本実施形態に係るデータ変換装置の機能ブロックを示す図である。It is a figure which shows the functional block of the data conversion apparatus which concerns on this embodiment. 本実施形態に係るデータ変換装置により特定される音データの区間の一例を示す図である。It is a figure which shows an example of the section of the sound data specified by the data conversion apparatus which concerns on this embodiment. 本実施形態に係るデータ変換装置により実行される第１処理のフローチャートである。It is a flowchart of the 1st process executed by the data conversion apparatus which concerns on this embodiment. 本実施形態に係るデータ変換装置により更新された議事録の一例を示す図である。It is a figure which shows an example of the minutes updated by the data conversion apparatus which concerns on this embodiment. 本実施形態に係るデータ変換装置により実行される第２処理のフローチャートである。It is a flowchart of the 2nd process executed by the data conversion apparatus which concerns on this embodiment. 本実施形態に係るデータ変換装置により特定される音データの区間の他の例を示す図である。It is a figure which shows the other example of the section of the sound data specified by the data conversion apparatus which concerns on this embodiment. 本実施形態に係るデータ変換装置により音データの区間を指定する例を示す図である。It is a figure which shows the example which specifies the section of the sound data by the data conversion apparatus which concerns on this embodiment.

添付図面を参照して、本発明の実施形態について説明する。なお、各図において、同一の符号を付したものは、同一又は同様の構成を有する。 An embodiment of the present invention will be described with reference to the accompanying drawings. In each figure, those with the same reference numerals have the same or similar configurations.

図１は、本発明の実施形態に係るデータ変換装置１０のネットワーク構成を示す図である。データ変換装置１０は、マイクロフォン等の入力部によって入力される一連の音から所定の区間の音のデータを切り出して、当該区間の音のデータを、通信ネットワークＮを介して第１音声認識サーバ２０、第２音声認識サーバ３０及び第３音声認識サーバ４０の少なくともいずれかに送信する。第１音声認識サーバ２０、第２音声認識サーバ３０及び第３音声認識サーバ４０は、受信した音データに基づいてテキストデータを生成するサーバであり、生成したテキストデータをデータ変換装置１０に返送する。 FIG. 1 is a diagram showing a network configuration of a data conversion device 10 according to an embodiment of the present invention. The data conversion device 10 cuts out sound data in a predetermined section from a series of sounds input by an input unit such as a microphone, and obtains the sound data in the section from the first voice recognition server 20 via the communication network N. , At least one of the second voice recognition server 30 and the third voice recognition server 40. The first voice recognition server 20, the second voice recognition server 30, and the third voice recognition server 40 are servers that generate text data based on the received sound data, and return the generated text data to the data conversion device 10. ..

ここで、通信ネットワークＮは、有線又は無線の通信網であり、例えばインターネットであってよい。第１音声認識サーバ２０、第２音声認識サーバ３０及び第３音声認識サーバ４０は、通信ネットワークＮを介して、いわゆるパブリッククラウドの形態で、音声データをテキストデータに変換するサービスを提供するサーバであってよい。すなわち、第１音声認識サーバ２０、第２音声認識サーバ３０及び第３音声認識サーバ４０は、利用者を限定せずに、音声データをテキストデータに変換するサービスを提供するサーバであってよい。なお、本例では、仮に３台のサーバが通信ネットワークＮに接続されている場合を示しているが、パブリッククラウドとして利用可能な音声認識サーバの台数は３台に限られず、任意である。また、データ変換装置１０は、パブリッククラウドのみならず、プライベートクラウドの形態（すなわち利用者を限定する形態）で、音声データをテキストデータに変換するサービスを提供するサーバに接続されてもよい。 Here, the communication network N is a wired or wireless communication network, and may be, for example, the Internet. The first voice recognition server 20, the second voice recognition server 30, and the third voice recognition server 40 are servers that provide a service for converting voice data into text data in the form of a so-called public cloud via the communication network N. It may be there. That is, the first voice recognition server 20, the second voice recognition server 30, and the third voice recognition server 40 may be servers that provide a service for converting voice data into text data without limiting users. In this example, a case where three servers are connected to the communication network N is shown, but the number of voice recognition servers that can be used as a public cloud is not limited to three, and is arbitrary. Further, the data conversion device 10 may be connected not only to a public cloud but also to a server that provides a service for converting voice data into text data in the form of a private cloud (that is, a form that limits users).

データ変換装置１０は、例えば、会議において録音された音のデータから、議事録として記録すべき内容が含まれる区間の音のデータを切り出す。データ変換装置１０は、会議において録音された音のデータ全体を第１音声認識サーバ２０、第２音声認識サーバ３０及び第３音声認識サーバ４０に送信することはせず、議事録として記録すべき内容が含まれる区間の音のデータを切り出して、当該区間の音のデータを第１音声認識サーバ２０、第２音声認識サーバ３０及び第３音声認識サーバ４０の少なくともいずれかに送信する。このように、会議において録音された音のデータ全体をサーバに送信せず、テキスト化すべき区間の音のデータを切り出してサーバに送信することで、会議において外部に流出してはならない会話が行われた場合であっても、テキストデータに変換する区間を限定することができ、クラウドサービスの利便性とセキュリティを両立させて、音データをテキストデータに変換することができる。 The data conversion device 10 cuts out, for example, sound data in a section including contents to be recorded as minutes from sound data recorded at a conference. The data conversion device 10 should not transmit the entire sound data recorded in the conference to the first speech recognition server 20, the second speech recognition server 30, and the third speech recognition server 40, but should record them as minutes. The sound data of the section including the content is cut out, and the sound data of the section is transmitted to at least one of the first voice recognition server 20, the second voice recognition server 30, and the third voice recognition server 40. In this way, by cutting out the sound data of the section to be converted into text and sending it to the server without sending the entire sound data recorded in the meeting to the server, a conversation that should not be leaked to the outside is performed in the meeting. Even if it is broken, the section to be converted into text data can be limited, and the sound data can be converted into text data while achieving both convenience and security of the cloud service.

図２は、本発明の実施形態に係るデータ変換装置１０の物理的な構成を示す図である。データ変換装置１０は、ハードウェアプロセッサに相当するＣＰＵ（Central Processing Unit）１０ａと、メモリに相当するＲＡＭ（Random Access Memory）１０ｂと、メモリに相当するＲＯＭ（Read only Memory）１０ｃと、通信部１０ｄと、入力部１０ｅと、表示部１０ｆとを有する。これら各構成は、バスを介して相互にデータ送受信可能に接続される。なお、本例ではデータ変換装置１０が一台のコンピュータで構成される場合について説明するが、データ変換装置１０は、複数のコンピュータを用いて実現されてもよい。 FIG. 2 is a diagram showing a physical configuration of the data conversion device 10 according to the embodiment of the present invention. The data conversion device 10 includes a CPU (Central Processing Unit) 10a corresponding to a hardware processor, a RAM (Random Access Memory) 10b corresponding to a memory, a ROM (Read only Memory) 10c corresponding to the memory, and a communication unit 10d. And an input unit 10e and a display unit 10f. Each of these configurations is connected to each other via a bus so that data can be transmitted and received. In this example, the case where the data conversion device 10 is composed of one computer will be described, but the data conversion device 10 may be realized by using a plurality of computers.

ＣＰＵ１０ａは、ＲＡＭ１０ｂ又はＲＯＭ１０ｃに記憶されたプログラムの実行に関する制御やデータの演算、加工を行う制御部である。ＣＰＵ１０ａは、音データからテキストデータへの変換の制御に関するプログラム（データ変換プログラム）を実行する演算装置である。ＣＰＵ１０ａは、入力部１０ｅや通信部１０ｄから種々の入力データを受け取り、入力データの演算結果を表示部１０ｆに表示したり、ＲＡＭ１０ｂやＲＯＭ１０ｃに格納したりする。 The CPU 10a is a control unit that controls execution of a program stored in the RAM 10b or ROM 10c, calculates data, and processes data. The CPU 10a is an arithmetic unit that executes a program (data conversion program) related to control of conversion from sound data to text data. The CPU 10a receives various input data from the input unit 10e and the communication unit 10d, displays the calculation result of the input data on the display unit 10f, and stores the input data in the RAM 10b and the ROM 10c.

ＲＡＭ１０ｂは、データの書き換えが可能な記憶部であり、例えば半導体記憶素子で構成される。ＲＡＭ１０ｂは、ＣＰＵ１０ａが実行するアプリケーション等のプログラムやデータを記憶する。 The RAM 10b is a storage unit capable of rewriting data, and is composed of, for example, a semiconductor storage element. The RAM 10b stores programs and data such as applications executed by the CPU 10a.

ＲＯＭ１０ｃは、データの読み出しのみが可能な記憶部であり、例えば半導体記憶素子で構成される。ＲＯＭ１０ｃは、例えばファームウェア等のプログラムやデータを記憶する。 The ROM 10c is a storage unit capable of only reading data, and is composed of, for example, a semiconductor storage element. The ROM 10c stores programs and data such as firmware.

通信部１０ｄは、データ変換装置１０を通信ネットワークＮに接続するインターフェースであり、例えば、有線又は無線回線のデータ伝送路により構成されたＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）、インターネット等の通信ネットワークＮに接続される。 The communication unit 10d is an interface for connecting the data conversion device 10 to the communication network N, and is, for example, a LAN (Local Area Network), a WAN (Wide Area Network), the Internet, or the like configured by a wired or wireless line data transmission path. It is connected to the communication network N of.

入力部１０ｅは、ユーザからデータの入力を受け付けるものであり、例えば、マイクロフォン、キーボード、マウス及びタッチパネルを含む。 The input unit 10e receives data input from the user, and includes, for example, a microphone, a keyboard, a mouse, and a touch panel.

表示部１０ｆは、ＣＰＵ１０ａによる演算結果を視覚的に表示するものであり、例えば、ＬＣＤ（Liquid Crystal Display）により構成される。 The display unit 10f visually displays the calculation result by the CPU 10a, and is configured by, for example, an LCD (Liquid Crystal Display).

データ変換プログラムは、ＲＡＭ１０ｂやＲＯＭ１０ｃ等のコンピュータによって読み取り可能な記憶媒体に記憶されて提供されてもよいし、通信部１０ｄにより接続される通信ネットワークを介して提供されてもよい。データ変換装置１０では、ＣＰＵ１０ａがデータ変換プログラムを実行することにより、次図を用いて説明する様々な機能が実現される。なお、これらの物理的な構成は例示であって、必ずしも独立した構成でなくてもよい。例えば、データ変換装置１０は、ＣＰＵ１０ａとＲＡＭ１０ｂやＲＯＭ１０ｃが一体化したＬＳＩ（Large-Scale Integration）を備えていてもよい。 The data conversion program may be stored in a storage medium readable by a computer such as RAM 10b or ROM 10c and provided, or may be provided via a communication network connected by the communication unit 10d. In the data conversion device 10, the CPU 10a executes a data conversion program to realize various functions described with reference to the following figures. It should be noted that these physical configurations are examples and do not necessarily have to be independent configurations. For example, the data conversion device 10 may include an LSI (Large-Scale Integration) in which the CPU 10a and the RAM 10b or ROM 10c are integrated.

図３は、本実施形態に係るデータ変換装置１０の機能ブロックを示す図である。データ変換装置１０は、判定部１１、音データ記憶部１２、送信部１３、抽出部１４、分割部１５、受信部１６、合成部１７、修正部１８及び議事録記憶部１９を備える。なお、本例では、これらの機能部が一台のコンピュータで実現される場合について説明するが、これらの機能部は、複数のコンピュータによって実現されていてもよい。 FIG. 3 is a diagram showing a functional block of the data conversion device 10 according to the present embodiment. The data conversion device 10 includes a determination unit 11, a sound data storage unit 12, a transmission unit 13, an extraction unit 14, a division unit 15, a reception unit 16, a synthesis unit 17, a correction unit 18, and a minutes storage unit 19. In this example, a case where these functional parts are realized by one computer will be described, but these functional parts may be realized by a plurality of computers.

判定部１１は、入力部１０ｅにより入力される一連の音に所定の音が含まれているか否かを判定する。ここで、所定の音は、予め設定された音であればどのような音であってもよいが、例えば、物理的なベルを鳴らした音であったり、電子的に合成された音であったりしてよい。判定部１１は、所定の音を認識できるように予め学習されたＲＮＮ（Recurrent Neural Network）等の学習済みモデルであってよい。会議においてデータ変換装置１０を利用するユーザは、議事録に記録すべき発言が行われる前に、所定の音を鳴らして、その後に発言される内容をテキスト化するように指定することができる。 The determination unit 11 determines whether or not a predetermined sound is included in the series of sounds input by the input unit 10e. Here, the predetermined sound may be any sound as long as it is a preset sound, and is, for example, a sound of ringing a physical bell or an electronically synthesized sound. You may do it. The determination unit 11 may be a trained model such as an RNN (Recurrent Neural Network) trained in advance so that a predetermined sound can be recognized. A user who uses the data conversion device 10 in a meeting can specify that a predetermined sound is played before a statement to be recorded in the minutes is made, and the content to be stated thereafter is converted into text.

また、所定の音は、所定の規則に従った発言であってもよい。例えば、判定部１１は、約２秒間の沈黙に続いて「議事録お願いします」と発言されたか否かによって、所定の音が含まれているか否かを判定してもよい。このような場合も、判定部１１は、所定の音を認識できるように予め学習されたＲＮＮ等であってよい。そして、所定の音は、ユーザ毎に設定できる構成であってもよく、話者別若しくはシステム利用者別に所定の音を設定でき、各ユーザが任意のタイミングで議事録対象を特定する動作を行うことができる構成であってもよい。また、後説する議事録に音声認識結果後のテキストを挿入する例においても、ユーザ毎に議事録を記憶し、所定の音を発話したユーザ用に音声認識を実施して対象ユーザ用の音声認識結果を対象ユーザ用の議事録に挿入する構成にすることもでき、ユーザ個別の議事録を個別に保有することとしてもよい。そして、議事録担当者のみが自己が保有する議事録を編集した後に、共有用に当該議事録をファイルサーバにアップロードすることも可能である。 Further, the predetermined sound may be a statement according to a predetermined rule. For example, the determination unit 11 may determine whether or not a predetermined sound is included depending on whether or not "Please minutes" is said after silence for about 2 seconds. Even in such a case, the determination unit 11 may be an RNN or the like learned in advance so that a predetermined sound can be recognized. The predetermined sound may be configured to be set for each user, and the predetermined sound can be set for each speaker or system user, and each user performs an operation of specifying the minutes target at an arbitrary timing. It may be a configuration that can be used. Also, in the example of inserting the text after the voice recognition result into the minutes to be described later, the minutes are memorized for each user, voice recognition is performed for the user who utters a predetermined sound, and the voice for the target user is performed. The recognition result may be inserted into the minutes for the target user, or the minutes for each user may be held individually. Then, after editing the minutes owned by only the person in charge of the minutes, it is also possible to upload the minutes to the file server for sharing.

音データ記憶部１２は、少なくとも、入力部１０ｅにより入力される一連の音のうち所定の音に基づいて特定される区間の音のデータを記憶する。音データ記憶部１２は、判定部１１により一連の音に所定の音が含まれていると判定された場合に、所定の音より後に入力される一連の音の少なくとも一部を、所定の音に基づいて特定される区間の音のデータとして記憶してもよい。例えば、音データ記憶部１２は、所定の音より後に入力され、再度所定の音が入力されるまでの区間の音のデータを記憶してもよい。この場合、音データ記憶部１２は、所定の音が録音区間に含まれないように、音データを記憶してもよい。このように、所定の音より後に入力される一連の音の少なくとも一部を区間の音のデータとして記憶することで、記憶部に記憶すべき音データの容量を少なくすることができ、記憶された音のデータに所定の音が含まれているか否かを事後的に判定する必要が無いため、演算負荷を減らすことができる。なお、区間の開始を特定するための音と、区間の終了を特定するための音は、同じ音であってもよいし、別の音であってもよい。また、区間の終了は、区間の開始を特定するための所定の音が入力された時からの経過時間によって定めてもよい。 The sound data storage unit 12 stores at least sound data in a section specified based on a predetermined sound in a series of sounds input by the input unit 10e. When the determination unit 11 determines that the series of sounds contains a predetermined sound, the sound data storage unit 12 may use at least a part of the series of sounds input after the predetermined sound as the predetermined sound. It may be stored as sound data of a section specified based on. For example, the sound data storage unit 12 may store sound data in a section that is input after a predetermined sound and until the predetermined sound is input again. In this case, the sound data storage unit 12 may store the sound data so that the predetermined sound is not included in the recording section. In this way, by storing at least a part of a series of sounds input after a predetermined sound as sound data in a section, the capacity of sound data to be stored in the storage unit can be reduced and stored. Since it is not necessary to determine after the fact whether or not a predetermined sound is included in the sound data, the calculation load can be reduced. The sound for specifying the start of the section and the sound for specifying the end of the section may be the same sound or different sounds. Further, the end of the section may be determined by the elapsed time from the time when a predetermined sound for specifying the start of the section is input.

また、音データ記憶部１２は、入力部１０ｅにより入力される一連の音のデータを記憶してもよい。抽出部１４は、音データ記憶部１２に記憶された一連の音のデータから、所定の音より後に入力された一連の音の少なくとも一部を、所定の音に基づいて特定される区間の音のデータとして抽出する。抽出部１４は、所定の音より後に入力され、再度所定の音が入力されるまでの区間の音のデータを抽出してもよい。この場合、抽出部１４は、所定の音が抽出する区間に含まれないように、音データを抽出してもよい。このように、入力される一連の音のデータを記憶し、所定の音より後に入力された一連の音の少なくとも一部を抽出することで、抽出された区間の音のデータ以外のデータも記憶部に記憶されることとなり、抽出された区間の音のデータ以外のデータを事後的にテキストに変換することができるようになり、より柔軟な音データのテキスト化が可能となる。 Further, the sound data storage unit 12 may store a series of sound data input by the input unit 10e. The extraction unit 14 extracts at least a part of a series of sounds input after a predetermined sound from the series of sound data stored in the sound data storage unit 12, and the sound in a section specified based on the predetermined sound. Extract as data of. The extraction unit 14 may extract sound data in a section that is input after a predetermined sound and until the predetermined sound is input again. In this case, the extraction unit 14 may extract sound data so that the predetermined sound is not included in the extraction section. In this way, by storing the data of the input series of sounds and extracting at least a part of the series of sounds input after the predetermined sound, the data other than the sound data of the extracted section is also stored. It will be stored in the section, and data other than the sound data of the extracted section can be converted into text after the fact, and more flexible sound data can be converted into text.

送信部１３は、判定部１１により一連の音に所定の音が含まれていると判定された場合に、音データに基づいてテキストデータを生成するサーバ（第１音声認識サーバ２０、第２音声認識サーバ３０及び第３音声認識サーバ４０）に対して、所定の音に基づいて特定される区間の音のデータを送信する。なお、図３では、サーバを図示せず、通信ネットワークＮを図示している。 When the determination unit 11 determines that a series of sounds contains a predetermined sound, the transmission unit 13 generates text data based on the sound data (first voice recognition server 20, second voice). The sound data of the section specified based on the predetermined sound is transmitted to the recognition server 30 and the third voice recognition server 40). Note that FIG. 3 does not show the server, but shows the communication network N.

受信部１６は、サーバから、所定の音に基づいて特定される区間の音のデータに基づいて生成されたテキストデータを受信する。 The receiving unit 16 receives the text data generated based on the sound data of the section specified based on the predetermined sound from the server.

分割部１５は、所定の音に基づいて特定される区間の音のデータを、複数の音データに分割する。送信部１３は、複数の音データの順序を入れ替えて、複数の音データをサーバに送信してもよい。この場合、受信部１６は、複数の音データに基づいて生成された複数のテキストデータを受信する。そして、合成部１７は、送信部１３による複数の音データの順序の入れ替えに基づいて、受信した複数のテキストデータを一つのテキストデータに合成する。このように、所定の音に基づいて特定された区間の音のデータを、複数の音データに分割して、その順序を入れ替えてサーバに送信することで、送信した音データの内容が第三者に読み取られることを防止することができる。 The division unit 15 divides sound data in a section specified based on a predetermined sound into a plurality of sound data. The transmission unit 13 may change the order of the plurality of sound data and transmit the plurality of sound data to the server. In this case, the receiving unit 16 receives a plurality of text data generated based on the plurality of sound data. Then, the synthesizing unit 17 synthesizes the received plurality of text data into one text data based on the rearrangement of the order of the plurality of sound data by the transmitting unit 13. In this way, the sound data in the section specified based on the predetermined sound is divided into a plurality of sound data, and the order is changed and transmitted to the server, so that the content of the transmitted sound data is third. It can be prevented from being read by a person.

また、送信部１３は、分割部１５により得られた複数の音データを、音データに基づいてテキストデータを生成する複数のサーバに分配して送信してもよい。本実施形態の場合、送信部１３は、分割した複数の音データを、第１音声認識サーバ２０、第２音声認識サーバ３０及び第３音声認識サーバ４０に分配して送信してもよい。所定の音に基づいて特定された区間の音のデータを、複数の音データに分割して、複数のサーバに分配して送信することで、それぞれのサーバに送信した音データの一部から全体の内容を再現することが困難となり、音データの内容が第三者に読み取られるおそれをさらに低減させることができる。分割した音データを複数のサーバに分配する方法としては、ランダムに分配する方法の他、過去の音声データを音声認識した結果、品質の良いサーバに優先的に分配を行う構成であってもよい。品質の良いサーバに基づき優先的に分配を行う場合、発話者であるユーザ毎にサーバを決定して分配してもよい。具体的には、あるユーザＡの過去の音声認識結果の品質が、サーバαが最も優れている場合には、当該ユーザＡに関する音データはサーバαに優先的に分配し、他のユーザＢの過去の音声認識結果の品質が、サーバβが最も優れている場合には、当該ユーザＢに関する音データはサーバβに優先的に分配することとしてよい。 Further, the transmission unit 13 may distribute and transmit the plurality of sound data obtained by the division unit 15 to a plurality of servers that generate text data based on the sound data. In the case of the present embodiment, the transmission unit 13 may distribute and transmit the divided plurality of sound data to the first voice recognition server 20, the second voice recognition server 30, and the third voice recognition server 40. By dividing the sound data of the section specified based on a predetermined sound into a plurality of sound data and distributing and transmitting the sound data to a plurality of servers, a part of the sound data transmitted to each server can be used as a whole. It becomes difficult to reproduce the contents of the sound data, and the possibility that the contents of the sound data are read by a third party can be further reduced. As a method of distributing the divided sound data to a plurality of servers, in addition to a method of randomly distributing the data, a configuration may be used in which the past voice data is recognized by voice and the distribution is preferentially distributed to a high-quality server. .. When preferentially distributing based on a high-quality server, the server may be determined and distributed for each user who is the speaker. Specifically, when the quality of the past voice recognition result of a certain user A is the best in the server α, the sound data related to the user A is preferentially distributed to the server α, and the sound data of the other user B is preferentially distributed. When the quality of the past voice recognition result is the best in the server β, the sound data related to the user B may be preferentially distributed to the server β.

修正部１８は、得られたテキストデータに含まれる各単語について、テキスト化の処理を実行したサーバにより出力されたテキスト化の信頼度に基づいて、単語の修正を行う。第１音声認識サーバ２０、第２音声認識サーバ３０及び第３音声認識サーバ４０は、一般的な用語について音データをテキストデータに変換することができるものであり、社内用語等、一般には用いられていない単語が音データに含まれていると、音データを正しくテキストデータに変換することが困難な場合がある。修正部１８は、議事録記憶部１９に記憶された過去の議事録を学習用データとして学習されたＲＮＮ等の言語モデルを含み、テキスト化の信頼度が低い単語について、正しい単語への修正を行う。これにより、社内用語等の一般には用いられていない単語が音データに含まれており、サーバによって音データを正しくテキスト化することが困難な場合であっても、より正確な内容の議事録が作成できるようになる。 The correction unit 18 corrects each word included in the obtained text data based on the reliability of the text conversion output by the server that executed the text conversion process. The first voice recognition server 20, the second voice recognition server 30, and the third voice recognition server 40 can convert sound data into text data for general terms, and are generally used for in-house terms and the like. If the sound data contains unspoken words, it may be difficult to correctly convert the sound data into text data. The correction unit 18 includes a language model such as RNN learned from the past minutes stored in the minutes storage unit 19 as learning data, and corrects words with low reliability in text conversion to correct words. conduct. As a result, even if the sound data contains words that are not commonly used, such as in-house terms, and it is difficult for the server to correctly convert the sound data into text, the minutes with more accurate content can be obtained. You will be able to create it.

議事録記憶部１９は、音データを変換して得られたテキストデータを、議事録の形式で記憶する。データ変換装置１０は、音データを変換して得られたテキストデータに含まれる議題を表す文字又は記号を検出し、議事録記憶部１９に記憶された議事録の適切な箇所に新たなテキストデータを追記する。また、データ変換装置１０は、音データを変換して得られたテキストデータに含まれる人名を表す文字を検出し、議事録記憶部１９に記憶された議事録の適切な議題に担当者名を追記する。これらの処理については、後に図６を用いて詳細に説明する。 The minutes storage unit 19 stores the text data obtained by converting the sound data in the form of minutes. The data conversion device 10 detects characters or symbols representing the agenda contained in the text data obtained by converting the sound data, and new text data is stored in an appropriate part of the minutes stored in the minutes storage unit 19. Is added. Further, the data conversion device 10 detects a character representing a person's name included in the text data obtained by converting the sound data, and assigns the name of the person in charge to an appropriate agenda of the minutes stored in the minutes storage unit 19. Append. These processes will be described in detail later with reference to FIG.

図４は、本実施形態に係るデータ変換装置１０により特定される音データの区間の一例を示す図である。同図では、会議において録音された音データの波形を示している。本例の音データは、第１区間Ａ１、第２区間Ａ２及び第３区間Ａ３を含む。第１区間Ａ１は、議事録に記録する必要の無い発言に対応する区間であり、第２区間Ａ２は、所定の音として予め設定されているベルの音に対応する区間であり、第３区間Ａ３は、議事録に記録する必要がある発言に対応する区間である。 FIG. 4 is a diagram showing an example of a section of sound data specified by the data conversion device 10 according to the present embodiment. The figure shows the waveform of the sound data recorded at the conference. The sound data of this example includes the first section A1, the second section A2, and the third section A3. The first section A1 is a section corresponding to a statement that does not need to be recorded in the minutes, and the second section A2 is a section corresponding to a bell sound preset as a predetermined sound, and a third section. A3 is a section corresponding to a statement that needs to be recorded in the minutes.

データ変換装置１０は、判定部１１によって、入力される一連の音に所定の音が含まれているか否かを判定する。ここで、所定の音が含まれているか否かは、所定の音の波形が含まれているか否かによって判定してよい。本例では、判定部１１は、第２区間Ａ２の波形が入力されることで、所定の音が含まれていると判定する。 The data conversion device 10 determines whether or not a predetermined sound is included in a series of input sounds by the determination unit 11. Here, whether or not a predetermined sound is included may be determined by whether or not a waveform of a predetermined sound is included. In this example, the determination unit 11 determines that a predetermined sound is included by inputting the waveform of the second section A2.

判定部１１により一連の音に所定の音が含まれていると判定されると、音データ記憶部１２は、所定の音より後に入力される第３区間Ａ３の音を、所定の音に基づいて特定される区間の音のデータとして記憶する。その後、データ変換装置１０は、第３区間Ａ３の音データを第１音声認識サーバ２０等に送信し、その内容に対応するテキストデータを受信する。 When the determination unit 11 determines that the series of sounds contains a predetermined sound, the sound data storage unit 12 uses the sound in the third section A3, which is input after the predetermined sound, as the sound based on the predetermined sound. It is stored as sound data of the specified section. After that, the data conversion device 10 transmits the sound data of the third section A3 to the first voice recognition server 20 and the like, and receives the text data corresponding to the content thereof.

また、データ変換装置１０は、第３区間Ａ３の音データを第１音声認識サーバ２０等に送信する前に、第３区間Ａ３の音データを複数の音データに分割して、その順序を変えてサーバに送信したり、複数のサーバに分配したりしてもよい。この場合、分割の方法を幾つか変えて、第１音声認識サーバ２０等によるテキストデータへの変換の信頼度が良好となる分割方法を採用することとしてもよい。例えば、第３区間Ａ３の音データを等間隔で３分割して第１音声認識サーバ２０等にそれぞれの音データを認識させた場合と、第３区間Ａ３の音データを等間隔で１０分割して第１音声認識サーバ２０等にそれぞれの音データを認識させた場合と、におけるテキスト化の信頼度を比較して、より信頼度の高い分割数を採用することとしてもよい。これにより、サーバによる音声認識精度を向上させることができ、より正確なテキスト化が行えるようになり、テキスト化の精度とセキュリティを両立することができる。 Further, the data conversion device 10 divides the sound data of the third section A3 into a plurality of sound data and changes the order thereof before transmitting the sound data of the third section A3 to the first voice recognition server 20 or the like. It may be sent to a server or distributed to a plurality of servers. In this case, some division methods may be changed to adopt a division method in which the reliability of conversion to text data by the first speech recognition server 20 or the like is good. For example, a case where the sound data of the third section A3 is divided into three at equal intervals and the first voice recognition server 20 or the like recognizes each sound data, and a case where the sound data of the third section A3 is divided into 10 at equal intervals. The first voice recognition server 20 or the like may be made to recognize each sound data, and the reliability of text conversion may be compared, and a more reliable number of divisions may be adopted. As a result, the accuracy of voice recognition by the server can be improved, more accurate text conversion can be performed, and both the accuracy of text conversion and security can be achieved.

また、音データの分割箇所を、波形の振幅に応じて決定することとしてもよい。例えば、音データの振幅が所定値以下となる区間の中央で音データを分割することとしてもよい。これにより、音データを所定の区間数に分割したり、所定の区間幅で分割したりする場合よりも、サーバによる音声認識精度を向上させることができ、より正確なテキスト化が行えるようになり、テキスト化の精度とセキュリティを両立することができる。 Further, the division point of the sound data may be determined according to the amplitude of the waveform. For example, the sound data may be divided at the center of a section where the amplitude of the sound data is equal to or less than a predetermined value. As a result, the voice recognition accuracy by the server can be improved and more accurate text conversion can be performed as compared with the case where the sound data is divided into a predetermined number of sections or a predetermined section width. , It is possible to achieve both accuracy and security of text conversion.

図５は、本実施形態に係るデータ変換装置１０により実行される第１処理のフローチャートである。第１処理は、入力される一連の音に所定の音が含まれていると判定された場合に、所定の音に基づいて特定される区間の音のデータを記憶し、テキスト化して議事録を更新する処理である。 FIG. 5 is a flowchart of the first process executed by the data conversion device 10 according to the present embodiment. In the first process, when it is determined that the input series of sounds contains a predetermined sound, the sound data of the section specified based on the predetermined sound is stored, converted into text, and the minutes are recorded. Is the process of updating.

データ変換装置１０は、入力部１０ｅにより、入力される音のデータを取得する（Ｓ１０）。判定部１１は、入力される一連の音に所定の音が含まれているか否かを判定する（Ｓ１１）。入力される一連の音に所定の音が含まれていない場合（Ｓ１１：Ｎｏ）、音データの取得と、所定の音が含まれているか否かの判定を継続する。 The data conversion device 10 acquires input sound data by the input unit 10e (S10). The determination unit 11 determines whether or not a predetermined sound is included in the input series of sounds (S11). When the input series of sounds does not include a predetermined sound (S11: No), the acquisition of sound data and the determination of whether or not the predetermined sound is included are continued.

一方、入力される一連の音に所定の音が含まれている場合（Ｓ１１：Ｙｅｓ）、音データ記憶部１２は、所定の音の後に入力される一連の音を、所定の音に基づいて特定される区間の音のデータとして記憶する（Ｓ１２）。 On the other hand, when a predetermined series of sounds is included in the input series of sounds (S11: Yes), the sound data storage unit 12 stores the series of sounds input after the predetermined sounds based on the predetermined sounds. It is stored as sound data of the specified section (S12).

分割部１５は、特定された区間の音のデータを、複数の音データに分割する（Ｓ１３）。送信部１３は、複数の音データの順序を入れ替えて、第１音声認識サーバ２０、第２音声認識サーバ３０及び第３音声認識サーバ４０のうち１又は複数のサーバに送信する（Ｓ１４）。すなわち、送信部１３は、複数の音データの順序を入れ替え、且つ、複数の音データを複数のサーバに分配して送信してもよい。 The division unit 15 divides the sound data of the specified section into a plurality of sound data (S13). The transmission unit 13 rearranges the order of the plurality of sound data and transmits the data to one or more of the first voice recognition server 20, the second voice recognition server 30, and the third voice recognition server 40 (S14). That is, the transmission unit 13 may change the order of the plurality of sound data and distribute the plurality of sound data to the plurality of servers for transmission.

受信部１６は、１又は複数のサーバから、複数の音データをテキスト化した複数のテキストデータを受信する（Ｓ１５）。合成部１７は、複数の音データの順序の入れ替え及びサーバへの分配に基づいて、複数のテキストデータの順序を入れ替えて、一つのテキストデータに合成する（Ｓ１６）。 The receiving unit 16 receives a plurality of text data obtained by converting a plurality of sound data into text from one or a plurality of servers (S15). The synthesizing unit 17 rearranges the order of the plurality of text data and synthesizes them into one text data based on the rearrangement of the order of the plurality of sound data and the distribution to the server (S16).

修正部１８は、１又は複数のサーバによるテキスト化の信頼度に基づいて、信頼度が低い単語を、適切と推定される単語に修正する（Ｓ１７）。なお、修正部１８による単語の修正は、合成部１７によって一つのテキストデータが合成された後に行われることが望ましい。分割された複数の音データに対応する複数のテキストデータの状態で単語の修正を行うこととすると、文章の前後関係が不明となり、適切な修正が困難になる場合があるからである。 The correction unit 18 corrects a word having a low reliability into a word presumed to be appropriate based on the reliability of text conversion by one or a plurality of servers (S17). It is desirable that the correction unit 18 corrects the word after the synthesis unit 17 synthesizes one text data. This is because if a word is corrected in the state of a plurality of text data corresponding to a plurality of divided sound data, the context of the sentence becomes unclear and appropriate correction may be difficult.

データ変換装置１０は、得られたテキストデータの中の所定の文字に基づいて、議事録への追記箇所を特定する（Ｓ１８）。例えば、特定の議題を表す文字列や記号を認識して、議事録のうちその議題を記載した箇所に、得られたテキストデータを追記する。また、データ変換装置１０は、得られたテキストデータの中の人名に基づいて、担当者を特定する（Ｓ１９）。データ変換装置１０は、特定された担当者の名前を、対応する議題の担当者として議事録に追記してよい。 The data conversion device 10 identifies a place to be added to the minutes based on a predetermined character in the obtained text data (S18). For example, it recognizes a character string or symbol representing a specific agenda, and adds the obtained text data to the part of the minutes where the agenda is described. Further, the data conversion device 10 identifies the person in charge based on the person's name in the obtained text data (S19). The data conversion device 10 may add the name of the identified person in charge to the minutes as the person in charge of the corresponding agenda.

最後に、データ変換装置１０は、当日の日付を記載日として議事録に付加して、議事録を更新する（Ｓ２０）。なお、日付のみならず、会議が行われた時刻を付加することとしてもよい。以上で第１処理が終了する。 Finally, the data conversion device 10 adds the date of the current day to the minutes as the description date, and updates the minutes (S20). In addition to the date, the time when the meeting was held may be added. This completes the first process.

図６は、本実施形態に係るデータ変換装置１０により更新された議事録Ｄの一例を示す図である。本例の議事録Ｄは、７月１日と７月３日に記載された内容を含み、さらに７月４日に最新の更新が行われたものである。議事録Ｄは、「＃１０００」と名付けられた第１議題Ｄ１と、「＃２５１７正しい在り方での証明書の検証」と名付けられた第２議題Ｄ２と、に関する記載を含む。 FIG. 6 is a diagram showing an example of minutes D updated by the data conversion device 10 according to the present embodiment. Minutes D of this example include the content described on July 1st and July 3rd, and was updated on July 4th. Minutes D includes a description of the first agenda item D1 named "# 1000" and the second agenda item D2 entitled "# 2517 Verification of Certificate in Correct Way".

第１議題Ｄ１について、「→まずは設計書に記載する（７／１記載）」、「→明日議論する（７／３記載）」という記載を含む。このことから、７月３日の時点で、翌日の７月４日に第１議題Ｄ１について議論することが決定していたことがわかる。そして、第１議題Ｄ１には、「→ＡＢＣパラメータは１０００とする（７／４記載）」と追記されている。 Regarding the first agenda item D1, the description includes "→ first described in the design document (described on 7/1)" and "→ discussed tomorrow (described on 7/3)". From this, it can be seen that as of July 3, it was decided to discuss the first agenda item D1 on July 4, the following day. Then, in the first agenda item D1, it is added that "→ ABC parameter is 1000 (described on 7/4)".

このような記載は、例えば以下のようにして追記される。まず、会議において様々な議論がなされ、「ＡＢＣパラメータ」をどのような値とするかについて結論が得られたとする。そのような段階で、所定の音に相当するベルが鳴らされると、データ変換装置１０は、所定の音が入力された後に発言された「シャープ１０００、ＡＢＣパラメータは１０００とする」という音データを記憶し、第１音声認識サーバ２０等に送信して、その内容をテキスト化したテキストデータを受信する。そして、「シャープ１０００」という文字列に基づいて、「＃１０００」と名付けられた第１議題Ｄ１の記載箇所に、「ＡＢＣパラメータは１０００とする」というテキストデータを追記する。この際、会議が行われた当日の日付である７月４日（７／４）を付加する。 Such a description is added, for example, as follows. First, it is assumed that various discussions were held at the meeting and a conclusion was reached on what value the "ABC parameter" should be. At such a stage, when the bell corresponding to the predetermined sound is ringed, the data conversion device 10 outputs the sound data "Sharp 1000, ABC parameter is 1000", which is said after the predetermined sound is input. It is stored, transmitted to the first voice recognition server 20 or the like, and receives text data in which the contents are converted into text. Then, based on the character string "Sharp 1000", the text data "ABC parameter is 1000" is added to the description part of the first agenda item D1 named "# 1000". At this time, July 4 (7/4), which is the date of the day when the meeting was held, is added.

第２議題Ｄ２は、「→７／１２リリース予定。手順をアプリＴに連携済み。（７／１記載）」、「→［Ａさん宿題］品質管理委員のリリース予定に書く（７／３記載）」、「→記載済み、本日実行をお願いします。（７／４追記）」という記載を含む。 The second agenda item D2 is "→ 7/12 release schedule. The procedure has been linked to the application T. (7/1 description)", "→ [Mr. A's homework] Write in the release schedule of the quality control committee (7/3 description) ) ”,“ → Already described, please execute today. (7/4 postscript) ”is included.

このような記載は、例えば以下のようにして追記される。まず、会議において様々な議論がなされ、「品質管理委員のリリース予定に書く」というタスクを実行する必要があることが決定され、その担当者を「Ａさん」とすることが決定されたとする。そのような段階で、所定の音に相当するベルが鳴らされると、データ変換装置１０は、所定の音が入力された後に発言された「シャープ２５１７、Ａさん宿題、品質管理委員のリリース予定に書く」という音データを記憶し、第１音声認識サーバ２０等に送信して、その内容をテキスト化したテキストデータを受信する。そして、「シャープ２５１７」という文字列に基づいて、「＃２５１７」と名付けられた第２議題Ｄ２の記載箇所に、「品質管理委員のリリース予定に書く」というテキストデータを追記する。また、「Ａさん宿題」という文字列に基づいて、そのタスクの担当者を明らかにするように「［Ａさん宿題］」と追記する。そして、会議が行われた当日の日付である７月３日（７／３）を付加する。 Such a description is added, for example, as follows. First, it is assumed that various discussions were held at the meeting, and it was decided that it was necessary to carry out the task of "writing in the release schedule of the quality control committee", and it was decided that the person in charge would be "Mr. A". At such a stage, when the bell corresponding to the predetermined sound is ringed, the data conversion device 10 makes a statement after the predetermined sound is input, "Sharp 2517, Mr. A's homework, scheduled to be released by the quality control committee member. The sound data "write" is stored, transmitted to the first voice recognition server 20 or the like, and the text data in which the contents are converted into text is received. Then, based on the character string "Sharp 2517", the text data "Write in the release schedule of the quality control committee" is added to the description part of the second agenda item D2 named "# 2517". Also, based on the character string "Mr. A's homework", add "[Mr. A's homework]" to clarify the person in charge of the task. Then, July 3 (7/3), which is the date of the day when the meeting was held, is added.

このように、本実施形態に係るデータ変換装置１０によれば、会議がどのような議題に関するものであるかを識別して、議事録の適切な箇所に追記を行うことができる。これにより、議事録作成者の作業負担が低減する。また、担当者名を識別して、議事録の適切な箇所に担当者を追記することができ、作業の円滑な進行を支援することができる。 As described above, according to the data conversion device 10 according to the present embodiment, it is possible to identify what kind of agenda the meeting is related to and add it to an appropriate part of the minutes. This reduces the workload of the minutes creator. In addition, the name of the person in charge can be identified and the person in charge can be added to an appropriate place in the minutes to support the smooth progress of the work.

図７は、本実施形態に係るデータ変換装置１０により実行される第２処理のフローチャートである。第２処理は、入力される一連の音を記憶し、記憶された一連の音に所定の音が含まれていると判定された場合に、所定の音に基づいて特定される区間の音のデータを抽出し、テキスト化して議事録を更新する処理である。 FIG. 7 is a flowchart of the second process executed by the data conversion device 10 according to the present embodiment. The second process stores a series of input sounds, and when it is determined that the stored series of sounds contains a predetermined sound, the sound in the section specified based on the predetermined sound is recorded. It is a process to extract data, convert it into text, and update the minutes.

データ変換装置１０は、入力部１０ｅにより入力される音のデータを音データ記憶部１２に記憶する（Ｓ３０）。ここで、音データ記憶部１２への音データの記憶は、会議中連続的に行われてよい。判定部１１は、記憶された一連の音に所定の音が含まれているか否かを判定する（Ｓ３１）。一連の音に所定の音が含まれていない場合（Ｓ３１：Ｎｏ）、第２処理は終了する。 The data conversion device 10 stores the sound data input by the input unit 10e in the sound data storage unit 12 (S30). Here, the sound data may be continuously stored in the sound data storage unit 12 during the conference. The determination unit 11 determines whether or not a predetermined sound is included in the stored series of sounds (S31). When the series of sounds does not include a predetermined sound (S31: No), the second process ends.

一方、一連の音に所定の音が含まれている場合（Ｓ３１：Ｙｅｓ）、抽出部１４は、所定の音の後に入力される一連の音を、所定の音に基づいて特定される区間の音のデータとして抽出する（Ｓ３２）。 On the other hand, when a predetermined sound is included in the series of sounds (S31: Yes), the extraction unit 14 sets the series of sounds input after the predetermined sound in a section specified based on the predetermined sound. It is extracted as sound data (S32).

分割部１５は、特定された区間の音のデータを、複数の音データに分割する（Ｓ３３）。送信部１３は、複数の音データの順序を入れ替えて、第１音声認識サーバ２０、第２音声認識サーバ３０及び第３音声認識サーバ４０のうち１又は複数のサーバに送信する（Ｓ３４）。すなわち、送信部１３は、複数の音データの順序を入れ替え、且つ、複数の音データを複数のサーバに分配して送信してもよい。 The division unit 15 divides the sound data of the specified section into a plurality of sound data (S33). The transmission unit 13 rearranges the order of the plurality of sound data and transmits the data to one or more of the first voice recognition server 20, the second voice recognition server 30, and the third voice recognition server 40 (S34). That is, the transmission unit 13 may change the order of the plurality of sound data and distribute the plurality of sound data to the plurality of servers for transmission.

受信部１６は、１又は複数のサーバから、複数の音データをテキスト化した複数のテキストデータを受信する（Ｓ３５）。合成部１７は、複数の音データの順序の入れ替え及びサーバへの分配に基づいて、複数のテキストデータの順序を入れ替えて、一つのテキストデータに合成する（Ｓ３６）。 The receiving unit 16 receives a plurality of text data obtained by converting a plurality of sound data into text from one or a plurality of servers (S35). The synthesizing unit 17 rearranges the order of the plurality of text data and synthesizes them into one text data based on the rearrangement of the order of the plurality of sound data and the distribution to the server (S36).

修正部１８は、１又は複数のサーバによるテキスト化の信頼度に基づいて、信頼度が低い単語を、適切と推定される単語に修正する（Ｓ３７）。 The correction unit 18 corrects a word with low reliability to a word presumed to be appropriate based on the reliability of text conversion by one or a plurality of servers (S37).

データ変換装置１０は、得られたテキストデータの中の所定の文字に基づいて、議事録への追記箇所を特定する（Ｓ３８）。例えば、特定の議題を表す文字列や記号を認識して、議事録のうちその議題を記載した箇所に、得られたテキストデータを追記する。また、データ変換装置１０は、得られたテキストデータの中の人名に基づいて、担当者を特定する（Ｓ３９）。データ変換装置１０は、特定された担当者の名前を、対応する議題の担当者として議事録に追記してよい。 The data conversion device 10 identifies a place to be added to the minutes based on a predetermined character in the obtained text data (S38). For example, it recognizes a character string or symbol representing a specific agenda, and adds the obtained text data to the part of the minutes where the agenda is described. Further, the data conversion device 10 identifies the person in charge based on the person's name in the obtained text data (S39). The data conversion device 10 may add the name of the identified person in charge to the minutes as the person in charge of the corresponding agenda.

最後に、データ変換装置１０は、当日の日付を記載日として議事録に付加して、議事録を更新する（Ｓ４０）。なお、日付のみならず、会議が行われた時刻を付加することとしてもよい。以上で第２処理が終了する。 Finally, the data conversion device 10 adds the date of the day as the description date to the minutes and updates the minutes (S40). In addition to the date, the time when the meeting was held may be added. This completes the second process.

図８は、本実施形態に係るデータ変換装置１０により特定される音データの区間の他の例を示す図である。同図では、会議において録音された音データの波形の他の例を示している。本例の音データは、第５区間Ａ５、第６区間Ａ６、第７区間Ａ７及び第８区間Ａ８を含む。第５区間Ａ５は、議事録に記録する必要の無い発言に対応する区間であり、第６区間Ａ６は、約２秒間のほとんど無音の区間であり、第７区間Ａ７は、所定の発言として設定された「議事録お願いします」という発言に対応する区間であり、第８区間Ａ８は、議事録に記録する必要がある発言に対応する区間である。 FIG. 8 is a diagram showing another example of a section of sound data specified by the data conversion device 10 according to the present embodiment. The figure shows another example of the waveform of the sound data recorded at the conference. The sound data of this example includes a fifth section A5, a sixth section A6, a seventh section A7, and an eighth section A8. The fifth section A5 is a section corresponding to a statement that does not need to be recorded in the minutes, the sixth section A6 is an almost silent section for about 2 seconds, and the seventh section A7 is set as a predetermined statement. It is a section corresponding to the remark "Please give me the minutes", and the eighth section A8 is a section corresponding to the remark that needs to be recorded in the minutes.

データ変換装置１０は、判定部１１によって、入力される一連の音に所定の音が含まれているか否かを判定する。ここで、所定の音が含まれているか否かは、所定の音の波形が含まれているか否かによって判定してよい。本例では、判定部１１は、第６区間Ａ６及び第７区間Ａ７の波形が入力されることで、所定の音が含まれていると判定する。すなわち、判定部１１は、約２秒間の沈黙の後に、「議事録お願いします」と発言されたか否かによって、一連の音に所定の音が含まれているか否かを判定する。 The data conversion device 10 determines whether or not a predetermined sound is included in a series of input sounds by the determination unit 11. Here, whether or not a predetermined sound is included may be determined by whether or not a waveform of a predetermined sound is included. In this example, the determination unit 11 determines that a predetermined sound is included by inputting the waveforms of the sixth section A6 and the seventh section A7. That is, the determination unit 11 determines whether or not a predetermined sound is included in the series of sounds depending on whether or not "Please minutes" is said after silence for about 2 seconds.

判定部１１により一連の音に所定の音が含まれていると判定されると、音データ記憶部１２は、所定の音より後に入力される第８区間Ａ８の音を、所定の音に基づいて特定される区間の音のデータとして記憶する。その後、データ変換装置１０は、第８区間Ａ８の音データを第１音声認識サーバ２０等に送信し、その内容に対応するテキストデータを受信する。 When the determination unit 11 determines that the series of sounds includes a predetermined sound, the sound data storage unit 12 uses the sound in the eighth section A8 input after the predetermined sound as the sound based on the predetermined sound. It is stored as sound data of the specified section. After that, the data conversion device 10 transmits the sound data of the eighth section A8 to the first voice recognition server 20 and the like, and receives the text data corresponding to the contents thereof.

このように、所定の規則に従った発言が行われたか否かによって、入力される一連の音に所定の音が含まれているか否かを判定することで、所定の音として特殊な音（例えば、物理的なベルの音や電子的に合成したベルの音）を鳴らすための用意が不要となり、より手軽に議事録作成の指示を出すことができるようになる。 In this way, by determining whether or not a predetermined sound is included in a series of input sounds depending on whether or not a statement is made in accordance with a predetermined rule, a special sound (a special sound as a predetermined sound) is determined. For example, it becomes unnecessary to prepare for ringing a physical bell sound or an electronically synthesized bell sound, and it becomes possible to give an instruction to create minutes more easily.

図９は、本実施形態に係るデータ変換装置１０により音データの区間を指定する例を示す図である。同図では、図８で示した第５区間Ａ５、第６区間Ａ６、第７区間Ａ７及び第８区間Ａ８を含む音データについて、第９区間Ａ９及び第１０区間Ａ１０を指定した例を示している。 FIG. 9 is a diagram showing an example in which a section of sound data is designated by the data conversion device 10 according to the present embodiment. FIG. 8 shows an example in which the 9th section A9 and the 10th section A10 are designated for the sound data including the 5th section A5, the 6th section A6, the 7th section A7, and the 8th section A8 shown in FIG. There is.

データ変換装置１０は、録音した音データの波形と、認識された音データの区間（本例の場合、第５区間Ａ５、第６区間Ａ６、第７区間Ａ７及び第８区間Ａ８）を表示部１０ｆに表示して、入力部１０ｅに含まれるポインティングデバイス等によって、ユーザから区間の修正や追加を受け付けてよい。例えば、会議を行った当初は、所定の音を発生させた後の発言、すなわち第８区間Ａ８における発言のみを議事録に記録すれば十分だと考えていたところ、事後的に第５区間Ａ５で話し合った内容の一部も議事録に残したいと考える場合があり得る。このような場合に、ユーザは、ポインタＰＴ等によって抽出する音データの区間を指定することができる。本例では、ユーザは、第９区間Ａ９及び第１０区間Ａ１０を新たに抽出する区間として指定している。 The data conversion device 10 displays the waveform of the recorded sound data and the recognized sound data section (in this example, the fifth section A5, the sixth section A6, the seventh section A7, and the eighth section A8). It may be displayed on 10f, and correction or addition of a section may be accepted from the user by a pointing device or the like included in the input unit 10e. For example, at the beginning of the meeting, I thought that it would be sufficient to record only the remarks made after generating a predetermined sound, that is, the remarks in the 8th section A8 in the minutes, but after the fact, the 5th section A5 You may want to keep some of the content discussed in the minutes. In such a case, the user can specify the section of the sound data to be extracted by the pointer PT or the like. In this example, the user designates the ninth section A9 and the tenth section A10 as a section to be newly extracted.

データ変換装置１０は、新たに指定された第９区間Ａ９及び第１０区間Ａ１０の音データを第１音声認識サーバ２０等に送信し、テキスト化したテキストデータを受信し、議事録の適切な箇所に当該テキストデータを追記する。 The data conversion device 10 transmits the sound data of the newly designated 9th section A9 and the 10th section A10 to the first voice recognition server 20 and the like, receives the text data in text form, and receives an appropriate part of the minutes. Add the text data to.

このように、テキスト化する音データの区間を視覚的に確認できるように表示して、修正や追加を行えるようにすることで、より柔軟にテキスト化する音データを選択することができるようになり、データ変換装置１０の利便性が向上する。 In this way, by displaying the section of the sound data to be converted into text so that it can be visually confirmed and making corrections and additions, it is possible to select the sound data to be converted into text more flexibly. Therefore, the convenience of the data conversion device 10 is improved.

以上説明した実施形態は、本発明の理解を容易にするためのものであり、本発明を限定して解釈するためのものではない。実施形態が備える各要素並びにその配置、材料、条件、形状及びサイズ等は、例示したものに限定されるわけではなく適宜変更することができる。また、実施形態で示した構成同士を部分的に置換し又は組み合わせることが可能である。 The embodiments described above are for facilitating the understanding of the present invention, and are not for limiting the interpretation of the present invention. Each element included in the embodiment and its arrangement, material, condition, shape, size, and the like are not limited to those exemplified, and can be appropriately changed. Further, it is possible to partially replace or combine the configurations shown in the embodiments.

１０…データ変換装置、１０ａ…ＣＰＵ、１０ｂ…ＲＡＭ、１０ｃ…ＲＯＭ、１０ｄ…通信部、１０ｅ…入力部、１０ｆ…表示部、１１…判定部、１２…音データ記憶部、１３…送信部、１４…抽出部、１５…分割部、１６…受信部、１７…合成部、１８…修正部、１９…議事録記憶部、２０…第１音声認識サーバ、３０…第２音声認識サーバ、４０…第３音声認識サーバ、Ｎ…通信ネットワーク 10 ... Data conversion device, 10a ... CPU, 10b ... RAM, 10c ... ROM, 10d ... Communication unit, 10e ... Input unit, 10f ... Display unit, 11 ... Judgment unit, 12 ... Sound data storage unit, 13 ... Transmission unit, 14 ... extraction unit, 15 ... division unit, 16 ... reception unit, 17 ... synthesis unit, 18 ... correction unit, 19 ... minutes storage unit, 20 ... first voice recognition server, 30 ... second voice recognition server, 40 ... Third voice recognition server, N ... Communication network

Claims

A determination unit that determines whether or not a predetermined sound is included in a series of input sounds,
At least, a storage unit that stores sound data of a section specified based on the predetermined sound in the series of sounds, and a storage unit.
A division unit that divides the sound data in the section into a plurality of sound data,
When the determination unit determines that the series of sounds contains the predetermined sound , the order of the plurality of sound data is assigned to a plurality of servers that generate text data based on the sound data. A transmission unit that is interchanged and distributes and transmits each of the plurality of sound data to at least one server selected based on the information of the speaker of each of the plurality of sound data .
Receiving each of the plurality of text data generated based on each of the plurality of sound data from the at least one server, including the reliability of each word. Department and
A compositing unit that synthesizes the plurality of text data into one text data based on the rearrangement of the order of the plurality of sound data by the transmitting unit.
Based on a language model learned from past documents as learning data, a correction unit that corrects the word whose reliability is below a certain value, and a correction unit.
A data conversion device equipped with.

When the determination unit determines that the series of sounds includes the predetermined sound, the storage unit may use at least a part of the series of sounds input after the predetermined sound in the section. Store as sound data,
The data conversion device according to claim 1.

The storage unit stores the series of sound data and stores it.
Further provided is an extraction unit that extracts at least a part of a series of sounds input after the predetermined sound as sound data of the section from the data of the series of sounds stored in the storage unit.
The data conversion device according to claim 1.

The correction unit is a word included in the one text data synthesized by the synthesis unit, and corrects the word whose reliability is a certain value or less.
The data conversion device according to any one of claims 1 to 3.

The division unit divides the sound data in the section into the plurality of sound data based on the amplitude of the waveform of the sound data in the section.
The data conversion device according to any one of claims 1 to 4.