JP7406921B2

JP7406921B2 - Information processing device, information processing method and program

Info

Publication number: JP7406921B2
Application number: JP2019056140A
Authority: JP
Inventors: 大樹石浦; 光平武田
Original assignee: 株式会社Ｎｔｔデータグループ
Priority date: 2019-03-25
Filing date: 2019-03-25
Publication date: 2023-12-28
Anticipated expiration: 2039-03-25
Also published as: JP2020160118A

Description

本発明は、情報処理装置、情報処理方法およびプログラムに関する。 The present invention relates to an information processing device, an information processing method, and a program.

不特定者を対象とした音声認識装置では、汎用的かつ一般的な語彙を中心とした音声認識用の辞書が予め登録されており、当該音声認識装置は、登録されている音声認識用の辞書に基づいて音声を認識する。このような音声認識装置において、認識対象の語彙が設計時において規定可能な場合には、事前に作成した音声認識用辞書を用いるが、語彙が規定できない場合、あるいは動的に変更されるべきである場合においては、一般的に、人的作業による入力、または自動的に文字列情報から音声認識用の語彙を生成して辞書に登録する、などといったことが行われる。 In a speech recognition device aimed at unspecified people, a dictionary for speech recognition centered on general-purpose and general vocabulary is registered in advance, and the speech recognition device can use the registered dictionary for speech recognition. Recognize speech based on. In such speech recognition devices, if the vocabulary to be recognized can be defined at the time of design, a pre-created speech recognition dictionary is used; however, if the vocabulary cannot be defined or should be changed dynamically. In some cases, vocabulary for speech recognition is typically generated manually or automatically from character string information and registered in a dictionary.

また、近年の音声認識装置では、例えば、省略語などの言い換え表現についても音声認識用の辞書に登録することによって、正式な単語の発声だけでなく、ユーザによる任意の省略的な発声にも対処している。 In addition, recent speech recognition devices are able to handle not only formal word pronunciations but also arbitrary abbreviated pronunciations by the user by registering paraphrase expressions such as abbreviations in the speech recognition dictionary. are doing.

例えば特許文献１には、単語の省略的な言い換え表現に対しても高い認識率で認識することが可能な音声認識装置が開示されている。 For example, Patent Document 1 discloses a speech recognition device that can recognize even abbreviated paraphrase expressions of words with a high recognition rate.

特許第３７２４６４９号公報Patent No. 3724649

しかしながら、特許文献１に開示されている音声認識装置では、例えば、企業特有の社内用語や今回の会議や講演会で登場するような特殊用語といった、汎用的かつ一般的ではない新規な単語（特殊用語）を音声認識用の辞書に登録する場合には、人的作業による入力が必要となり、登録すべき単語の選別や入力など、人的作業負担が大きかった。そのため、音声認識用の辞書を好適に生成するという観点からすると未だ十分でなかった。 However, the speech recognition device disclosed in Patent Document 1 uses general and uncommon new words (special When registering (terms) in a dictionary for speech recognition, manual input is required, which places a heavy burden on human labor such as selecting and inputting words to be registered. Therefore, from the viewpoint of appropriately generating a dictionary for speech recognition, this method is still insufficient.

本発明は、上述のような事情に鑑みてなされたものであり、音声認識用の辞書を好適に生成することができる情報処理装置、情報処理方法およびプログラムを提供することを目的としている。 The present invention has been made in view of the above-mentioned circumstances, and an object of the present invention is to provide an information processing device, an information processing method, and a program that can suitably generate a dictionary for speech recognition.

上記目的を達成するため、本発明の第１の観点に係る情報処理装置は、
第１の辞書に基づく第１音声認識結果と、前記第１の辞書とは異なる、ユーザにより生成された単語情報を含む第２の辞書に基づく第２音声認識結果と、を受信する音声認識結果受信手段と、
予め定められた演算に基づいて算出された前記第１音声認識結果についての第１確信度と、前記演算に基づいて算出された前記第２音声認識結果についての第２確信度と、を受信する確信度受信手段と、
前記第１確信度と前記第２確信度とを比較した結果、確信度の差異が予め定められた値よりも大きい場合、前記第２音声認識結果に含まれる単語情報を、前記第１の辞書の更新用リストとして記憶する単語情報記憶手段と、
前記単語情報記憶手段で記憶した前記更新用リストに含まれる単語情報を、前記第１の辞書へ追加する旨の更新指示を送信する送信手段と、
を備えることを特徴とする。 In order to achieve the above object, an information processing device according to a first aspect of the present invention includes:
A speech recognition result that receives a first speech recognition result based on a first dictionary and a second speech recognition result based on a second dictionary that is different from the first dictionary and includes word information generated by the user. receiving means;
Receive a first certainty factor about the first speech recognition result calculated based on a predetermined calculation and a second certainty factor about the second speech recognition result calculated based on the calculation. certainty level receiving means;
As a result of comparing the first certainty factor and the second certainty factor, if the difference in the certainty factor is larger than a predetermined value, the word information included in the second speech recognition result is stored in the first dictionary. word information storage means for storing it as an update list ;
Transmitting means for transmitting an update instruction to add word information included in the update list stored in the word information storage means to the first dictionary;
It is characterized by having the following .

前記確信度の差異が大きい場合、前記第２音声認識結果から記憶対象となる単語情報を、予め定められた基準に従って抽出する抽出手段をさらに備え、
前記単語情報記憶手段は、前記抽出手段により抽出された単語情報を記憶する、
ようにしてもよい。 further comprising an extraction means for extracting word information to be stored from the second speech recognition result in accordance with a predetermined standard when the difference in the certainty factor is large ;
The word information storage means stores the word information extracted by the extraction means.
You can do it like this.

前記抽出手段により抽出された単語情報を、出現頻度毎に予め定められた複数分類のいずれかに分類する分類手段をさらに備え、
前記単語情報記憶手段は、前記分類手段により分類された単語情報を該分類毎に記憶する、
ようにしてもよい。 Further comprising a classification means for classifying the word information extracted by the extraction means into one of a plurality of predetermined classifications for each appearance frequency,
The word information storage means stores word information classified by the classification means for each classification.
You can do it like this.

前記単語情報には音声情報および文字情報が含まれ、
前記単語情報記憶手段により前記更新用リストとして記憶された単語情報を前記第１の辞書に追加することで前記第１の辞書を更新する第１辞書更新手段、をさらに備え、
前記第２の辞書は、前記第１の辞書が更新される毎に前記ユーザの操作により新たに記憶される、
ようにしてもよい。 The word information includes audio information and text information,
further comprising a first dictionary updating means for updating the first dictionary by adding word information stored as the update list by the word information storage means to the first dictionary;
The second dictionary is newly stored by the user's operation every time the first dictionary is updated.
You can do it like this.

上記目的を達成するため、本発明の第２の観点に係る情報処理方法は、
第１の辞書に基づく第１音声認識結果と、前記第１の辞書とは異なる、ユーザにより生成された単語情報を含む第２の辞書に基づく第２音声認識結果と、を受信する音声認識結果受信ステップと、
予め定められた演算に基づいて算出された前記第１音声認識結果についての第１確信度と、前記演算に基づいて算出された前記第２音声認識結果についての第２確信度と、を受信する確信度受信ステップと、
前記第１確信度と前記第２確信度とを比較した結果、確信度の差異が予め定められた値よりも大きい場合、前記第２音声認識結果に含まれる単語情報を、前記第１の辞書の更新用リストとして記憶する単語情報記憶ステップと、
前記単語情報記憶ステップで記憶した前記更新用リストに含まれる単語情報を、前記第１の辞書へ追加する旨の更新指示を送信する送信ステップと、
を備えることを特徴とする。 In order to achieve the above object, an information processing method according to a second aspect of the present invention includes:
A speech recognition result that receives a first speech recognition result based on a first dictionary and a second speech recognition result based on a second dictionary that is different from the first dictionary and includes word information generated by the user. a receiving step;
Receive a first certainty factor about the first speech recognition result calculated based on a predetermined calculation and a second certainty factor about the second speech recognition result calculated based on the calculation. a confidence level receiving step;
As a result of comparing the first certainty factor and the second certainty factor, if the difference in the certainty factor is larger than a predetermined value, the word information included in the second speech recognition result is stored in the first dictionary. a step of storing word information as an update list ;
a sending step of sending an update instruction to add the word information included in the update list stored in the word information storage step to the first dictionary;
It is characterized by having the following .

上記目的を達成するため、本発明の第３の観点に係るプログラムは、
コンピュータを、
第１の辞書に基づく第１音声認識結果と、前記第１の辞書とは異なる、ユーザにより生成された単語情報を含む第２の辞書に基づく第２音声認識結果と、を受信する音声認識結果受信手段、
予め定められた演算に基づいて算出された前記第１音声認識結果についての第１確信度と、前記演算に基づいて算出された前記第２音声認識結果についての第２確信度と、を受信する確信度受信手段、
前記第１確信度と前記第２確信度とを比較した結果、確信度の差異が予め定められた値よりも大きい場合、前記第２音声認識結果に含まれる単語情報を、前記第１の辞書の更新用リストとして記憶する単語情報記憶手段、
前記単語情報記憶手段で記憶した前記更新用リストに含まれる単語情報を、前記第１の辞書へ追加する旨の更新指示を送信する送信手段、
として機能させることを特徴とする。 In order to achieve the above object, a program according to a third aspect of the present invention,
computer,
A speech recognition result that receives a first speech recognition result based on a first dictionary and a second speech recognition result based on a second dictionary that is different from the first dictionary and includes word information generated by the user. receiving means,
Receive a first certainty factor about the first speech recognition result calculated based on a predetermined calculation and a second certainty factor about the second speech recognition result calculated based on the calculation. confidence receiving means;
As a result of comparing the first certainty factor and the second certainty factor, if the difference in the certainty factor is larger than a predetermined value, the word information included in the second speech recognition result is stored in the first dictionary. word information storage means for storing it as an update list ;
Transmitting means for transmitting an update instruction to add word information included in the update list stored in the word information storage means to the first dictionary;
It is characterized by functioning as

本発明によれば、音声認識用の辞書を好適に生成することができる。 According to the present invention, a dictionary for speech recognition can be suitably generated.

本発明の実施の形態に係る情報処理システムの一例を示すブロック図である。1 is a block diagram showing an example of an information processing system according to an embodiment of the present invention. 本発明の実施の形態に係る情報処理装置の一例を示すブロック図である。FIG. 1 is a block diagram showing an example of an information processing device according to an embodiment of the present invention. 本発明の実施の形態に係る音声認識サーバの一例を示すブロック図である。FIG. 1 is a block diagram showing an example of a speech recognition server according to an embodiment of the present invention. 情報処理システムの全体的な処理を説明するための説明図である。FIG. 2 is an explanatory diagram for explaining the overall processing of the information processing system. 用語登録処理の一例を示すフローチャートである。3 is a flowchart illustrating an example of term registration processing. 音声認識結果の一例を示す図である。It is a figure showing an example of a voice recognition result. 音声認識結果の形態素と品詞の一例を示す図である。It is a figure showing an example of a morpheme and a part of speech of a speech recognition result.

本発明における情報処理装置１００を、図１に示す情報処理システム１に適用した例を用いて説明する。情報処理システム１では、図１に示すように、情報処理装置１００Ａおよび１００Ｂと、音声認識サーバ２００とがネットワーク５１０を介して通信可能に接続されている。なお、理解を容易にするため、この実施の形態では、情報処理装置１００Ａのユーザと情報処理装置１００Ｂのユーザとが互いに会話を行う場合を例に、以下説明する。なお、情報処理装置１００Ａおよび情報処理装置１００Ｂは、単に情報処理装置１００とも言う。 An information processing apparatus 100 according to the present invention will be explained using an example in which it is applied to an information processing system 1 shown in FIG. In the information processing system 1, as shown in FIG. 1, information processing devices 100A and 100B and a voice recognition server 200 are communicably connected via a network 510. For ease of understanding, this embodiment will be described below using an example in which a user of information processing device 100A and a user of information processing device 100B converse with each other. Note that the information processing device 100A and the information processing device 100B are also simply referred to as the information processing device 100.

情報処理装置１００は、携帯電話やスマートフォン、タブレットやＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）等の情報端末（所謂コンピュータ）であり、Ｐ２Ｐ（ＰｅｅｒｔｏＰｅｅｒ）等の分散型のネットワーク５１０を構築している。なお、情報処理システム１は、Ｐ２Ｐ型のシステムに限られず、例えばクラウドコンピューティング型であってもよい。 The information processing device 100 is an information terminal (so-called computer) such as a mobile phone, a smartphone, a tablet, or a PC (Personal Computer), and has constructed a distributed network 510 such as P2P (Peer to Peer). Note that the information processing system 1 is not limited to a P2P type system, and may be, for example, a cloud computing type system.

情報処理装置１００は、音声認識サーバ２００から受信した、他の情報処理装置１００のユーザの会話の音声データおよびテキストデータ（音声認識結果）を出力する機能を有している。また、情報処理装置１００は、音声認識サーバ２００から受信した確信度に基づいて、登録対象となる単語情報を音声認識結果から抽出し、音声認識用の辞書へ登録する機能を有している。 The information processing device 100 has a function of outputting voice data and text data (speech recognition results) of conversations of users of other information processing devices 100, which are received from the voice recognition server 200. Furthermore, the information processing device 100 has a function of extracting word information to be registered from the voice recognition result based on the confidence level received from the voice recognition server 200, and registering the word information in the dictionary for voice recognition.

音声認識サーバ２００は、例えばメインフレームやワークステーション、あるいはＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）等の任意のコンピュータ装置である。音声認識サーバ２００は、情報処理装置１００から送信された音声（会話の内容）を、予め記憶された音声認識用の辞書に基づいて認識し、認識した音声データをテキストデータとともに（音声認識結果として）他の情報処理装置１００へ送信する機能を有している。また、音声認識サーバ２００は、音声認識結果として得られる語彙が実際に発話された語彙と一致している確率を示す確信度を算出し、他の情報処理装置１００へ送信する機能も有している。 The speech recognition server 200 is any computer device such as a mainframe, a workstation, or a PC (Personal Computer). The voice recognition server 200 recognizes the voice (conversation content) transmitted from the information processing device 100 based on a pre-stored dictionary for voice recognition, and stores the recognized voice data together with text data (as a voice recognition result). ) It has a function of transmitting to other information processing apparatuses 100. The speech recognition server 200 also has a function of calculating a confidence level indicating the probability that the vocabulary obtained as a speech recognition result matches the vocabulary actually uttered, and transmitting it to other information processing devices 100. There is.

次に、図２を参照し、この実施の形態における情報処理装置１００（図１に示す情報処理装置１００Ａおよび情報処理装置１００Ｂ）の構成について説明する。なお、図示は省略しているが、ユーザの会話（音声）を送信用の音声データとして（アナログからデジタルへ）変換する機能（およびその逆の機能）を有する機能部が設けられているものとする。 Next, with reference to FIG. 2, the configuration of information processing apparatus 100 (information processing apparatus 100A and information processing apparatus 100B shown in FIG. 1) in this embodiment will be described. Although not shown, a functional unit is provided that has the function of converting the user's conversation (voice) into audio data for transmission (from analog to digital) (and vice versa). do.

図２に示すように、情報処理装置１００は、記憶部１１０と、制御部１２０と、入出力部１３０と、通信部１４０と、これらを相互に接続するシステムバス（図示省略）と、を備えている。 As shown in FIG. 2, the information processing device 100 includes a storage section 110, a control section 120, an input/output section 130, a communication section 140, and a system bus (not shown) that interconnects these sections. ing.

記憶部１１０は、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ)やＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等を備える。ＲＯＭは制御部１２０のＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）が実行するプログラム及び、プログラムを実行する上で予め必要なデータを記憶する（図示省略）。 The storage unit 110 includes a ROM (Read Only Memory), a RAM (Random Access Memory), and the like. The ROM stores programs executed by a CPU (Central Processing Unit) of the control unit 120 and data required in advance to execute the programs (not shown).

具体的に、この実施の形態における記憶部１１０は、登録用語一覧１１１として、音声認識用の辞書として登録すべき単語の音声データとそのテキストデータを記憶する。なお、音声データと当該音声データに対応するテキストデータを、合わせて単語情報とも言う。なお、登録用語一覧１１１は、登録対象の単語情報の一覧を示すものであり、複数の単語情報が含まれる。当該登録用語一覧１１１の単語情報は、後述する用語登録処理により、分類毎に記憶部１１０へ記憶される。また、記憶部１１０には、登録分類１１２として、ユーザによる指定に基づいて分類される登録分類の一覧と、その分類基準が記憶されている。登録分類としては、例えば、「普遍的に使用される社内用語」といった分類や、「特定の組織内で使用される組織内用語」などの分類が、ユーザによる指定に基づいて登録されている。分類基準としては、例えば、会話中における当該登録対象の単語情報の出現頻度を記憶しておき、５回以上出現している単語情報については「普遍的に使用される社内用語」と分類し、５回未満であれば「特定の組織内で使用される組織内用語」に分類するなど、ユーザによって任意に設定可能であればよい。 Specifically, the storage unit 110 in this embodiment stores, as a list of registered terms 111, audio data of words to be registered as a dictionary for speech recognition and their text data. Note that the audio data and the text data corresponding to the audio data are also collectively referred to as word information. Note that the registered term list 111 shows a list of word information to be registered, and includes a plurality of word information. The word information in the registered term list 111 is stored in the storage unit 110 for each category through a term registration process that will be described later. The storage unit 110 also stores, as a registered classification 112, a list of registered classifications that are classified based on user specifications and their classification criteria. As registered classifications, for example, classifications such as "universally used internal terminology" and "internal organizational terminology used within a specific organization" are registered based on user specifications. As a classification standard, for example, the frequency of occurrence of the word information to be registered in a conversation is memorized, and word information that appears five or more times is classified as "universally used company terminology". If it is less than 5 times, it may be set as desired by the user, such as classifying it into "internal terms used within a specific organization".

制御部１２０は、ＣＰＵやＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）等から構成される。制御部１２０は、記憶部１１０に記憶されたプログラムに従って動作し、当該プログラムに従った処理を実行する。制御部１２０は、記憶部１１０に記憶されたプログラムにより提供される主要な機能部として、確信度比較部１２１と、形態素抽出部１２２と、品詞推定部１２３と、用語分類部１２４と、用語登録部１２５と、を備える。 The control unit 120 includes a CPU, an ASIC (Application Specific Integrated Circuit), and the like. The control unit 120 operates according to a program stored in the storage unit 110 and executes processing according to the program. The control unit 120 includes a certainty comparison unit 121, a morpheme extraction unit 122, a part of speech estimation unit 123, a term classification unit 124, and a term registration unit as main functional units provided by the program stored in the storage unit 110. 125.

確信度比較部１２１は、音声認識サーバ２００から送信された確信度を比較する機能部である。詳しくは後述するが、音声認識サーバ２００からは、第１登録用語一覧２１１を音声認識用の辞書（第１の辞書）として用いた場合の音声認識結果（後述する第１登録用語一覧２１１に基づくテキストデータとその音声データ）とその確信度Ａ（第１確信度）と、第２登録用語一覧２１２を音声認識用の辞書（第２の辞書）として用いた場合の音声認識結果（後述する第２登録用語一覧２１２に基づくテキストデータとその音声データ）とその確信度Ｂ（第２確信度）と、が送信される。確信度比較部１２１は、当該確信度Ａと確信度Ｂとを比較する。具体的に、確信度比較部１２１は、確信度Ｂから確信度Ａを減算した値が、予め定められた閾値以上であるか否かを判定することにより、確信度を比較する。閾値は、例えば、会議の内容や使用する言語などに応じて異なる値がユーザにより設定されていればよい。 The confidence level comparison unit 121 is a functional unit that compares the confidence levels transmitted from the speech recognition server 200. As will be described in detail later, the speech recognition server 200 provides speech recognition results (based on the first registered term list 211, which will be described later) when the first registered term list 211 is used as a speech recognition dictionary (first dictionary). text data and its audio data), its confidence level A (first confidence level), and the speech recognition result when the second registered term list 212 is used as a dictionary for speech recognition (second dictionary). 2 text data and its audio data based on the list of registered terms 212) and its confidence level B (second confidence level) are transmitted. The confidence level comparison unit 121 compares the confidence level A and the confidence level B. Specifically, the confidence level comparison unit 121 compares the confidence levels by determining whether a value obtained by subtracting the confidence level A from the confidence level B is greater than or equal to a predetermined threshold. The threshold value may be set by the user to a different value depending on, for example, the content of the meeting or the language used.

形態素抽出部１２２は、例えば、第１登録用語一覧２１１を音声認識用の辞書として用いた場合の音声認識結果（第１音声認識結果）と、第２登録用語一覧２１２を音声認識用の辞書として用いた場合の音声認識結果（第２音声認識結果）と、のそれぞれを、形態素解析などにより形態素毎に分割し、異なる形態素を抽出する機能部である。具体的に、形態素抽出部１２２は、形態素毎に分割した第２音声認識結果から、形態素毎に分割した第１音声認識結果との共通部分の形態素を差し引くことで、異なる形態素を抽出する。 For example, the morpheme extraction unit 122 uses the speech recognition result (first speech recognition result) when the first registered term list 211 is used as a dictionary for speech recognition, and the second registered term list 212 as a dictionary for speech recognition. This is a functional unit that divides each of the speech recognition results (second speech recognition results) into morphemes by morpheme analysis and extracts different morphemes. Specifically, the morpheme extracting unit 122 extracts different morphemes by subtracting the morphemes in common with the first speech recognition result divided for each morpheme from the second speech recognition result divided for each morpheme.

品詞推定部１２３は、第１音声認識結果と第２音声認識結果とのそれぞれの形態素の品詞を比較することで、異なる品詞の形態素を抽出する機能部である。具体的に、品詞推定部１２３は、第１音声認識結果の形態素と第２音声認識結果の形態素を比較し、第２音声認識結果の形態素の品詞が名詞であるものの、第１音声認識結果の形態素が名詞以外である形態素を抽出する。すなわち、形態素抽出部１２２は、第２音声認識結果から、第１音声認識結果と異なる単語の形態素（異なる文字列）を抽出するのに対し、品詞推定部１２３は、第２音声認識結果から、第１音声認識結果と異なる品詞の形態素を抽出する。換言すると、形態素抽出部１２２は、文字列の観点から形態素を抽出する機能部であり、品詞推定部１２３は、品詞の観点から形態素を抽出する機能部であると言える。なお、「普遍的に使用される社内用語」や「特定の組織内で使用される組織内用語」などといった特殊用語は、通常名詞であることが多い。そのため、この実施の形態における品詞推定部１２３は、第２音声認識結果の形態素の品詞が名詞であるものの、第１音声認識結果の形態素が名詞以外である形態素を抽出する。これとは異なり、単に異なる品詞の形態素を入出力部１３０に出力し、ユーザにより抽出するか否かを選択させるようにしてもよい。 The part-of-speech estimation unit 123 is a functional unit that extracts morphemes with different parts of speech by comparing the parts of speech of each morpheme in the first speech recognition result and the second speech recognition result. Specifically, the part-of-speech estimation unit 123 compares the morpheme of the first speech recognition result with the morpheme of the second speech recognition result, and determines that although the part of speech of the morpheme of the second speech recognition result is a noun, Extract morphemes other than nouns. That is, the morpheme extraction unit 122 extracts a word morpheme (a different character string) from the second speech recognition result that is different from the first speech recognition result, whereas the part-of-speech estimation unit 123 extracts a word morpheme (a different character string) from the second speech recognition result. A morpheme of a part of speech that is different from the first speech recognition result is extracted. In other words, it can be said that the morpheme extraction unit 122 is a functional unit that extracts morphemes from the perspective of character strings, and the part-of-speech estimation unit 123 is a functional unit that extracts morphemes from the perspective of parts of speech. Note that special terms such as "universally used internal terminology" and "internal organizational terminology used within a specific organization" are often nouns. Therefore, the part of speech estimating unit 123 in this embodiment extracts a morpheme whose part of speech of the morpheme in the second speech recognition result is a noun, but whose morpheme in the first speech recognition result is other than a noun. Alternatively, morphemes of different parts of speech may simply be output to the input/output unit 130, and the user may select whether or not to extract them.

用語分類部１２４は、形態素抽出部１２２の機能により抽出した形態素と、品詞推定部１２３の機能により抽出した形態素と、が一致しているか否かを判定し、一致した場合に登録対象として認定し、当該認定した登録対象の形態素の単語情報を、登録分類１１２に基づく分類に基づいて分類する機能部である。具体的に、用語分類部１２４は、抽出したそれぞれの形態素が一致する場合、登録対象となる単語情報の出現頻度に基づいて、登録分類１１２として設定されている分類基準に従い、登録されているいずれかの分類に分類する。 The term classification unit 124 determines whether or not the morpheme extracted by the function of the morpheme extraction unit 122 matches the morpheme extracted by the function of the part of speech estimation unit 123, and if they match, the term classification unit 124 certifies the morpheme as a registration target. , is a functional unit that classifies the word information of the certified morpheme to be registered based on the classification based on the registration classification 112. Specifically, when the extracted morphemes match, the term classification unit 124 classifies any of the registered morphemes according to the classification criteria set as the registration classification 112 based on the frequency of appearance of the word information to be registered. Classify into one of the following categories.

用語登録部１２５は、用語分類部１２４で分類された単語情報を、当該分類毎に登録用語一覧１１１へ登録する機能部である。また、用語登録部１２５は、登録用語一覧１１１へ登録された単語情報の内容に基づいて、第１登録用語一覧２１１の内容を更新させる更新指示を音声認識サーバ２００へ送信する機能も有している。なお、用語登録部１２５は、単語情報登録手段としての機能である。 The term registration unit 125 is a functional unit that registers the word information classified by the term classification unit 124 into the registered term list 111 for each classification. The term registration unit 125 also has a function of transmitting an update instruction to update the content of the first registered term list 211 to the speech recognition server 200 based on the content of the word information registered in the registered term list 111. There is. Note that the term registration section 125 functions as a word information registration means.

これら各機能部が協働して、情報処理装置１００において、登録対象となる単語情報を音声認識用の辞書へ登録する機能を実現している。 These functional units cooperate to realize a function of registering word information to be registered in a speech recognition dictionary in the information processing apparatus 100.

入出力部１３０は、キーボード、マウス、カメラ、マイク、液晶ディスプレイ、有機ＥＬ（Ｅｌｅｃｔｒｏ－Ｌｕｍｉｎｅｓｃｅｎｃｅ）ディスプレイ等から構成され、データの入出力を行うための装置である。 The input/output unit 130 is a device for inputting/outputting data, and includes a keyboard, a mouse, a camera, a microphone, a liquid crystal display, an organic EL (Electro-Luminescence) display, and the like.

通信部１４０は、他の情報処理装置１００や音声認識サーバ２００とネットワーク５１０を介して通信を行うためのデバイスである。 The communication unit 140 is a device for communicating with other information processing devices 100 and the voice recognition server 200 via the network 510.

以上が、情報処理装置１００の構成である。次に、図３を参照し、この実施の形態における音声認識サーバ２００の構成について説明する。図３に示すように、音声認識サーバ２００は、記憶部２１０と、制御部２２０と、入出力部２３０と、通信部２４０と、これらを相互に接続するシステムバス（図示省略）と、を備えている。 The above is the configuration of the information processing device 100. Next, with reference to FIG. 3, the configuration of the speech recognition server 200 in this embodiment will be described. As shown in FIG. 3, the speech recognition server 200 includes a storage section 210, a control section 220, an input/output section 230, a communication section 240, and a system bus (not shown) that interconnects these sections. ing.

記憶部２１０は、ＲＯＭやＲＡＭ等を備える。ＲＯＭは制御部２２０のＣＰＵが実行するプログラム及び、プログラムを実行する上で予め必要なデータを記憶する（図示省略）。 The storage unit 210 includes ROM, RAM, and the like. The ROM stores programs to be executed by the CPU of the control unit 220 and data required in advance to execute the programs (not shown).

具体的に、この実施の形態における記憶部２１０は、音声認識用の辞書として、第１登録用語一覧２１１と、第２登録用語一覧２１２とを記憶する。第１登録用語一覧２１１は、単語情報の一覧であり、後述する用語登録処理が実行される度に、登録されている単語情報が更新される。なお、初期の第１登録用語一覧２１１は、汎用的かつ一般的な語彙を中心とした単語情報の一覧であればよく、例えば、ユーザにより生成されてもよいし、ネットワーク上に公開されているものをダウンロードすることで取得してもよい。 Specifically, the storage unit 210 in this embodiment stores a first registered term list 211 and a second registered term list 212 as a dictionary for speech recognition. The first registered term list 211 is a list of word information, and the registered word information is updated every time the term registration process described below is executed. Note that the initial first registered term list 211 may be a list of word information centered on general-purpose and general vocabulary, and may be, for example, generated by a user or published on a network. You can also obtain it by downloading something.

一方、第２登録用語一覧２１２は、第１登録用語一覧２１１よりも、例えば「普遍的に使用される社内用語」や「特定の組織内で使用される組織内用語」などといった特殊用語の単語情報を多く含むよう、ユーザにより生成された単語情報の一覧である。なお、第２登録用語一覧２１２は、例えば、予定されている会議の資料や講演会の資料に基づいて、当該会議や講演会毎にユーザにより生成されればよい。この実施の形態における情報処理装置１００では、例えば会議毎に（換言すると第１登録用語一覧２１１が更新される毎に）第２登録用語一覧２１２が新規に記憶されて、後述する用語登録処理が行われる。当該用語登録処理では、第２登録用語一覧２１２と第１登録用語一覧２１１との比較により、対象となる単語情報が登録される。したがって、「普遍的に使用される社内用語」などの特殊用語を音声認識用の辞書に好適に登録することができるとともに、繰り返し実行することで、当該音声認識用の辞書を更新することが可能となる。 On the other hand, the second registered term list 212 contains more special terms than the first registered term list 211, such as "universally used in-house terms" and "internal terms used within a specific organization." This is a list of word information generated by the user to include as much information as possible. Note that the second registered term list 212 may be generated by the user for each conference or lecture, for example, based on the materials of the scheduled conference or lecture. In the information processing apparatus 100 in this embodiment, the second registered term list 212 is newly stored for each meeting (in other words, each time the first registered term list 211 is updated), and the term registration process described below is performed. It will be done. In the term registration process, target word information is registered by comparing the second registered term list 212 and the first registered term list 211. Therefore, special terms such as "universally used in-house terms" can be conveniently registered in the speech recognition dictionary, and the speech recognition dictionary can be updated by repeatedly executing the process. becomes.

制御部２２０は、ＣＰＵやＡＳＩＣ等から構成される。制御部１２０は、記憶部１１０に記憶されたプログラムに従って動作し、当該プログラムに従った処理を実行する。制御部２２０は、記憶部２１０に記憶されたプログラムにより提供される主要な機能部として、音声認識処理部２２１と、確信度算出部２２２と、を備える。 The control unit 220 is composed of a CPU, an ASIC, and the like. The control unit 120 operates according to a program stored in the storage unit 110 and executes processing according to the program. The control unit 220 includes a speech recognition processing unit 221 and a certainty calculation unit 222 as main functional units provided by the program stored in the storage unit 210.

音声認識処理部２２１は、例えば、情報処理装置１００から受信した音声データについて、第１登録用語一覧２１１に基づくテキストデータと、第２登録用語一覧２１２に基づくテキストデータと、のそれぞれに変換する機能部である。なお、音声データからテキストデータへの変換は、第１登録用語一覧２１１および第２登録用語一覧２１２に基づいて、従来から用いられている音声認識技術により行われればよい。なお、音声認識処理部２２１は、変換したそれぞれのテキストデータを、音声データとともに他の情報処理装置１００へと送信する機能も有している。 The voice recognition processing unit 221 has a function of converting, for example, voice data received from the information processing device 100 into text data based on the first registered term list 211 and text data based on the second registered term list 212, respectively. Department. Note that the conversion from voice data to text data may be performed using conventionally used voice recognition technology based on the first registered term list 211 and the second registered term list 212. Note that the voice recognition processing unit 221 also has a function of transmitting each converted text data to another information processing device 100 together with the voice data.

確信度算出部２２２は、音声認識処理部２２１にて変換されたテキストデータに対応する確信度を算出する機能部である。具体的に、確信度算出部２２２は、第１登録用語一覧２１１に基づくテキストデータの確信度Ａと、第２登録用語一覧２１２に基づくテキストデータの確信度Ｂと、をそれぞれ算出する。確信度は、例えば、第１登録用語一覧２１１や第２登録用語一覧２１２に登録されている単語情報の音声特徴量（波形や周期等）と、受信した音声データによる音声特徴量の類似度に基づいて算出（予め定められた演算に基づいて算出）されればよい。なお、確信度算出部２２２は、算出したそれぞれの確信度を他の情報処理装置１００へと送信する機能も有している。 The confidence level calculation unit 222 is a functional unit that calculates the confidence level corresponding to the text data converted by the speech recognition processing unit 221. Specifically, the certainty factor calculation unit 222 calculates the certainty factor A of the text data based on the first registered term list 211 and the certainty factor B of the text data based on the second registered term list 212, respectively. For example, the confidence level is based on the degree of similarity between the audio feature amount (waveform, cycle, etc.) of the word information registered in the first registered term list 211 and the second registered term list 212 and the audio feature amount based on the received audio data. (calculated based on a predetermined calculation). Note that the confidence level calculation unit 222 also has a function of transmitting each calculated confidence level to other information processing apparatuses 100.

これらの機能部が協働して、音声認識サーバ２００において、情報処理装置１００から受信した音声データをテキストデータにそれぞれ変換し（音声認識し）、当該音声データとともに音声認識結果として他の情報処理装置１００へと送信する機能を実現している。また、確信度を他の情報処理装置１００へと送信する機能を実現している。 These functional units work together to convert the voice data received from the information processing device 100 into text data (speech recognition) in the voice recognition server 200, and perform other information processing along with the voice data as a voice recognition result. A function of transmitting data to the device 100 is realized. Furthermore, a function of transmitting the certainty factor to other information processing apparatuses 100 is realized.

入出力部２３０は、キーボード、マウス、カメラ、マイク、液晶ディスプレイ、有機ＥＬディスプレイ等から構成され、データの入出力を行うための装置である。 The input/output unit 230 is a device for inputting and outputting data, and is composed of a keyboard, a mouse, a camera, a microphone, a liquid crystal display, an organic EL display, and the like.

通信部２４０は、情報処理装置１００とネットワーク５１０を介して通信を行うためのデバイスである。 The communication unit 240 is a device for communicating with the information processing apparatus 100 via the network 510.

以上が、音声認識サーバ２００の構成である。続いて情報処理装置１００の動作などについて、図４～図７を参照して説明する。まず、情報処理システム１の動作として、全体的な処理の流れについて、図４を参照して説明する。なお、図示する例では、情報処理装置１００Ｂのユーザが情報処理装置１００Ａのユーザに対して例文１の内容の発言した場合を例に、以下説明する。 The above is the configuration of the voice recognition server 200. Next, the operation of the information processing device 100 will be explained with reference to FIGS. 4 to 7. First, as an operation of the information processing system 1, the overall processing flow will be described with reference to FIG. In the illustrated example, a case will be described below in which the user of the information processing device 100B makes a statement of the content of example sentence 1 to the user of the information processing device 100A.

図４に示すように、情報処理装置１００Ｂのユーザが入出力部１３０に例文１の音声を入力すると、制御部１２０の機能により音声データに変換され、当該音声データが音声認識サーバ２００へ送信される（図４の（１））。なお、図示する例では、理解を容易にするため、情報処理装置１００Ｂから音声認識サーバ２００へ当該音声データが送信される例を示しているが、例えば、情報処理装置１００Ｂから情報処理装置１００Ａへと音声データが送信され、当該情報処理装置１００Ａにて抽出した特定の音声データが音声認識サーバ２００へ送信されるようにしてもよい。 As shown in FIG. 4, when the user of the information processing device 100B inputs the voice of example sentence 1 into the input/output unit 130, it is converted into voice data by the function of the control unit 120, and the voice data is sent to the voice recognition server 200. ((1) in Figure 4). Note that in the illustrated example, in order to facilitate understanding, the voice data is transmitted from the information processing device 100B to the voice recognition server 200, but for example, the voice data is transmitted from the information processing device 100B to the information processing device 100A. and voice data may be transmitted, and specific voice data extracted by the information processing device 100A may be transmitted to the voice recognition server 200.

音声認識サーバ２００は、情報処理装置１００Ｂから音声データを受信すると、音声認識処理部２２１の機能により、第１登録用語一覧２１１に基づいて音声認識を行い（テキストデータへ変換し）、音声データとテキストデータを、第１音声認識結果として情報処理装置１００Ａへ送信する（図４の（２））。また、音声認識サーバ２００は、確信度算出部２２２の機能により、第１登録用語一覧２１１に基づく音声認識の確信度Ａを算出し、情報処理装置１００Ａへ送信する（図４の（３））。 When the voice recognition server 200 receives voice data from the information processing device 100B, the voice recognition processing unit 221 performs voice recognition (converts to text data) based on the first registered term list 211, and converts the voice data into text data. The text data is transmitted to the information processing device 100A as the first voice recognition result ((2) in FIG. 4). Furthermore, the speech recognition server 200 uses the function of the confidence calculation unit 222 to calculate the confidence A of speech recognition based on the first registered term list 211, and transmits it to the information processing device 100A ((3) in FIG. 4). .

また、音声認識サーバ２００は、音声認識処理部２２１の機能により、第２登録用語一覧２１２に基づいて音声認識を行い（テキストデータへ変換し）、音声データとテキストデータを、第２音声認識結果として情報処理装置１００Ａへ送信する（図４の（４））。また、音声認識サーバ２００は、確信度算出部２２２の機能により、第２登録用語一覧２１２に基づく音声認識の確信度Ｂを算出し、情報処理装置１００Ａへ送信する（図４の（５））。なお、図４の（２）～（５）は、まとめて行われてもよい。 Further, the speech recognition server 200 performs speech recognition (converts to text data) based on the second registered term list 212 using the function of the speech recognition processing unit 221, and converts the speech data and text data into the second speech recognition result. ((4) in FIG. 4). Furthermore, the speech recognition server 200 uses the function of the confidence calculation unit 222 to calculate the confidence B of speech recognition based on the second registered term list 212, and transmits it to the information processing device 100A ((5) in FIG. 4). . Note that (2) to (5) in FIG. 4 may be performed all at once.

情報処理装置１００Ａの側では、音声認識サーバ２００から受信した、第２登録用語一覧２１２に基づく音声データとテキストデータを、入出力部１３０から出力する（図６（Ｂ）に示す内容が出力される）。また、情報処理装置１００Ａは、音声認識サーバ２００から第１音声認識結果と第２音声認識結果（確信度Ａおよび確信度Ｂも含む）を受信すると（音声認識結果受信手段および確信度受信手段に相当）、登録対象となる特殊用語を当該音声認識用の辞書に登録するための用語登録処理を行う。すなわち、情報処理装置１００Ａは、情報処理装置１００Ｂのユーザの発言に含まれる特殊用語を音声認識用の辞書に登録するための処理を行う。なお、以下では、図６（Ａ）に示す内容の音声データおよびテキストデータを第１音声認識結果として受信し、図６（Ｂ）に示す内容の音声データおよびテキストデータを第２音声認識結果として受信し、当該第２音声認識結果の「ＮＴＴ」を、特殊用語として登録する場合について説明する（確信度についても図示する値であるとする）。 On the side of the information processing device 100A, voice data and text data based on the second registered term list 212 received from the voice recognition server 200 are outputted from the input/output unit 130 (the contents shown in FIG. 6(B) are outputted). ). Further, upon receiving the first voice recognition result and the second voice recognition result (including confidence level A and confidence level B) from the voice recognition server 200, the information processing device 100A sends the first voice recognition result and the second voice recognition result (including confidence level A and confidence level B) to the voice recognition result receiving means and the confidence level receiving means. equivalent), performs term registration processing to register special terms to be registered in the speech recognition dictionary. That is, the information processing device 100A performs processing for registering special terms included in the utterances of the user of the information processing device 100B in a speech recognition dictionary. Note that in the following, audio data and text data with the content shown in FIG. 6(A) are received as the first speech recognition result, and audio data and text data with the content shown in FIG. 6(B) are received as the second speech recognition result. A case will be described in which the second speech recognition result "NTT" is registered as a special term (assuming that the confidence level is also the value shown in the figure).

図５は、用語登録処理の一例を示すフローチャートである。用語登録処理において、情報処理装置１００Ａは、確信度比較部１２１の機能により、確信度Ｂから確信度Ａを減算した値が、予め定められた閾値以上であるか否か（予め定められた条件を満たすか否か）を判定する（ステップＳ１０１）。閾値未満である場合、情報処理装置１００Ａは、登録すべき対象が存在しないものとして、そのまま用語登録処理を終了する。具体的に、ステップＳ１０１の処理では、図６（Ｂ）に示す確信度０．８９から図６（Ａ）に示す確信度０．１６を減算し、閾値以上であるか否かを判定する。なお、この例における閾値は、０．５として予めユーザにより設定されているものとする。 FIG. 5 is a flowchart illustrating an example of term registration processing. In the term registration process, the information processing device 100A uses the function of the certainty comparison unit 121 to determine whether the value obtained by subtracting the certainty A from the certainty B is equal to or greater than a predetermined threshold (predetermined conditions (step S101). If it is less than the threshold, the information processing device 100A concludes that there is no target to be registered and ends the term registration process. Specifically, in the process of step S101, the certainty factor 0.16 shown in FIG. 6(A) is subtracted from the certainty factor 0.89 shown in FIG. 6(B), and it is determined whether the certainty factor is equal to or greater than a threshold value. Note that the threshold value in this example is assumed to be set in advance by the user as 0.5.

閾値以上である場合（ステップＳ１０１；Ｙｅｓ）、情報処理装置１００Ａは、形態素抽出部１２２の機能により、音声認識サーバ２００から受信した第１音声認識結果と第２音声認識結果のそれぞれを形態素毎に分割し、異なる形態素を第２音声認識結果から抽出する（ステップＳ１０２）。なお、ステップＳ１０２では、第１音声認識結果のうちのテキストデータを形態素毎に分割し、異なる形態素を抽出した上で、当該形態素に対応する部分の音声データを抽出してもよい。また、第１音声認識結果のうちのテキストデータと音声データの両方を形態素毎に分割し、それぞれについて異なる形態素を抽出してもよい。具体的に、ステップＳ１０２では、図６（Ａ）および図７（Ａ）に示す「Ｖｅｎｄｉｔｔｉ」と図６（Ｂ）および図７（Ｂ）に示す「ＮＴＴ」の形態素が異なるため、図６（Ｂ）および図７（Ｂ）に示す「ＮＴＴ」の形態素を抽出する。なお、図６（Ａ）および図７（Ａ）に示す「Ｖｅｎｄｉｔｔｉ」はこの実施の形態にて理解を容易にするために用いた造語であり、品詞が形容詞であるものとする。また、以下では、当該「ＮＴＴ」の出現頻度が５回であり、今回の例文１にて６回の出現頻度となったものとする。 If it is equal to or greater than the threshold value (step S101; Yes), the information processing device 100A uses the function of the morpheme extraction unit 122 to extract each of the first speech recognition result and the second speech recognition result received from the speech recognition server 200 for each morpheme. Then, different morphemes are extracted from the second speech recognition result (step S102). Note that in step S102, the text data of the first speech recognition result may be divided into morphemes, different morphemes may be extracted, and then the audio data of the portion corresponding to the morpheme may be extracted. Alternatively, both the text data and the audio data in the first speech recognition result may be divided into morphemes, and different morphemes may be extracted for each morpheme. Specifically, in step S102, since the morphemes of "Venditti" shown in FIGS. 6(A) and 7(A) and "NTT" shown in FIGS. 6(B) and 7(B) are different, B) and the morpheme of "NTT" shown in FIG. 7(B) are extracted. Note that "Venditti" shown in FIGS. 6A and 7A is a coined word used in this embodiment to facilitate understanding, and the part of speech is an adjective. In addition, in the following, it is assumed that the frequency of appearance of the "NTT" is 5 times, and the frequency of appearance in example sentence 1 is 6 times.

ステップＳ１０２の処理を実行した後、情報処理装置１００Ａは、品詞推定部１２３の機能により、第１音声認識結果の形態素と第２音声認識結果の形態素を比較し、第２音声認識結果の形態素の品詞が名詞であるものの、第１音声認識結果の形態素が名詞以外である形態素を抽出する（ステップＳ１０３）。なお、上述したように、ステップＳ１０３では、単に異なる品詞の形態素を入出力部１３０に出力し、ユーザにより抽出するか否かを選択させるようにしてもよい。具体的に、ステップＳ１０３の処理では、図７（Ａ）に示す「Ｖｅｎｄｉｔｔｉ」の品詞が「形容詞」であり、図７（Ｂ）に示す「ＮＴＴ」の品詞が「名詞」であることから、図７（Ｂ）に示す「ＮＴＴ」の形態素を抽出する。また、この実施の形態では、図７に示すように「ｏｆ」といった前置詞については、音声認識用の辞書への登録といった観点からすると不要な品詞であることから、比較対象外としている。 After executing the process of step S102, the information processing device 100A uses the function of the part-of-speech estimation unit 123 to compare the morphemes of the first speech recognition result and the morphemes of the second speech recognition result, and compares the morphemes of the second speech recognition result. A morpheme whose part of speech is a noun but whose morpheme in the first speech recognition result is other than a noun is extracted (step S103). Note that, as described above, in step S103, morphemes of different parts of speech may simply be output to the input/output unit 130, and the user may select whether or not to extract them. Specifically, in the process of step S103, since the part of speech of "Venditti" shown in FIG. 7(A) is "adjective" and the part of speech of "NTT" shown in FIG. 7(B) is "noun", The morpheme of "NTT" shown in FIG. 7(B) is extracted. Furthermore, in this embodiment, as shown in FIG. 7, a preposition such as "of" is not included in the comparison because it is an unnecessary part of speech from the perspective of registration in a speech recognition dictionary.

ステップＳ１０３の処理を実行した後、情報処理装置１００Ａは、用語分類部１２４の機能により、ステップＳ１０２で抽出した形態素とステップＳ１０３で抽出した形態素とが一致するか否かを判定する（ステップＳ１０４）。一致していない場合（ステップＳ１０４；Ｎｏ）、用語登録処理を終了する。なお、一致していない場合、ステップＳ１０２で抽出した形態素とステップＳ１０３で抽出した形態素のそれぞれに対応する単語情報ついて、登録用語一覧１１１へ登録するか否かをユーザに選択させ、いずれも登録しない場合に当該用語登録処理を終了し、少なくともいずれかを登録する場合には、ステップＳ１０５の処理に移行すればよい。なお、この実施の形態では、ステップＳ１０２の処理およびステップＳ１０３の処理で抽出した形態素同士が一致するか否かを判定したが、ステップＳ１０２の処理のみ、またはステップＳ１０３の処理のみ行い、ステップＳ１０５の処理に移行してもよい。さらに、ステップＳ１０２～ステップＳ１０４の処理を実行せず、ステップＳ１０１にてＹｅｓと判定した場合には、ステップＳ１０５の処理へ移行してもよい。この場合、例えば、形態素毎の確信度が音声認識サーバ２００から送信されればよい。 After executing the process in step S103, the information processing device 100A uses the function of the term classification unit 124 to determine whether or not the morpheme extracted in step S102 and the morpheme extracted in step S103 match (step S104). . If they do not match (step S104; No), the term registration process ends. If they do not match, the user is asked to select whether or not to register the word information corresponding to the morpheme extracted in step S102 and the morpheme extracted in step S103 in the registered term list 111, and neither is registered. If the term registration process is ended and at least one of the terms is registered, the process may proceed to step S105. In this embodiment, it is determined whether the morphemes extracted in the process of step S102 and the process of step S103 match, but only the process of step S102 or only the process of step S103 is performed, and the process of step S105 is performed. You may proceed to processing. Furthermore, if the process in steps S102 to S104 is not executed and the determination is Yes in step S101, the process may proceed to step S105. In this case, for example, the confidence level for each morpheme may be transmitted from the speech recognition server 200.

一致していると判定した場合（ステップＳ１０４；Ｙｅｓ）、情報処理装置１００Ａは、用語分類部１２４の機能により、抽出した形態素に対応する単語情報を登録対象として認定し、認定した登録対象の形態素の単語情報を、登録分類１１２に基づく分類に基づいて分類する（ステップＳ１０５）。具体的に、ステップＳ１０５の処理では、「ＮＴＴ」の単語情報の出現頻度が６回であることから、当該「ＮＴＴ」は「普遍的に使用される社内用語」の分類に分類する。なお、「普遍的に使用される社内用語」には、例えば、複数のプロジェクトにおいて共通して使用される用語が含まれる。 If it is determined that they match (step S104; Yes), the information processing device 100A uses the function of the term classification unit 124 to certify the word information corresponding to the extracted morpheme as a registration target, and uses the function of the term classification unit 124 to certify the word information corresponding to the extracted morpheme as a registration target. The word information is classified based on the classification based on the registered classification 112 (step S105). Specifically, in the process of step S105, since the word information of "NTT" appears six times, "NTT" is classified into the "universally used in-house term" category. Note that "universally used in-house terms" includes, for example, terms that are commonly used in multiple projects.

ステップＳ１０５の処理を実行した後、情報処理装置１００Ａは、用語登録部１２５の機能により、ステップＳ１０４の処理にて分類された単語情報としての音声データおよびテキストデータを、当該分類に従い登録用語一覧１１１へ登録する（ステップＳ１０６）。具体的に、ステップＳ１０６の処理では、「普遍的に使用される社内用語」の分類に分類された「ＮＴＴ」の音声データおよびテキストデータを、それぞれ対応付けて、登録用語一覧１１１における「普遍的に使用される社内用語」の分類として登録する。 After executing the process of step S105, the information processing device 100A uses the function of the term registration unit 125 to input the audio data and text data as word information classified in the process of step S104 to the registered term list 111 according to the classification. (Step S106). Specifically, in the process of step S106, the audio data and text data of "NTT" classified into the "universally used in-house terminology" category are associated with each other, and the "universal Register as a classification of "internal terminology used in the company".

ステップＳ１０６の処理を実行した後、情報処理装置１００Ａは、用語登録部１２５の機能により、登録用語一覧１１１へ登録された単語情報の内容に基づいて、第１登録用語一覧２１１の内容を更新させる更新指示を音声認識サーバ２００へ送信し（ステップＳ１０７）、用語登録処理を終了する。具体的に、ステップＳ１０７の処理では、登録用語一覧１１１における「普遍的に使用される社内用語」の分類として登録した「ＮＴＴ」の音声データおよびテキストデータを、更新指示とともに音声認識サーバ２００へ送信し、音声認識サーバ２００に記憶されている第１登録用語一覧２１１に、当該「ＮＴＴ」の音声データおよびテキストデータを追加登録させる。これにより、第１登録用語一覧２１１の内容が更新されることとなる。 After executing the process of step S106, the information processing device 100A uses the function of the term registration unit 125 to update the contents of the first registered term list 211 based on the contents of the word information registered in the registered term list 111. An update instruction is sent to the speech recognition server 200 (step S107), and the term registration process is ended. Specifically, in the process of step S107, the voice data and text data of "NTT" registered as a category of "universally used in-house terms" in the registered term list 111 are sent to the voice recognition server 200 along with an update instruction. Then, the voice data and text data of "NTT" are additionally registered in the first registered term list 211 stored in the voice recognition server 200. As a result, the contents of the first registered term list 211 will be updated.

図４に戻り、音声認識サーバ２００の側では、情報処理装置１００Ａから更新指示を受信したことに基づいて、第１登録用語一覧２１１の内容を更新する。なお、図示は省略しているが、この後に、情報処理装置１００Ａのユーザが情報処理装置１００Ｂのユーザに対して発言した場合には、情報処理装置１００Ａの制御部１２０の機能により音声データに変換され、当該音声データが音声認識サーバ２００へ送信される。そして情報処理装置１００Ｂの側において用語登録処理が行われ、音声認識サーバ２００における第１登録用語一覧２１１の内容が更新される。このような処理が、当該会議や講演会などの会話が終了するまで繰り返し実行されることとなる。このように、会話毎に用語登録処理が行われて第１登録用語一覧２１１の内容が更新されるため、リアルタイムで音声認識用の辞書が更新されることとなり、音声認識用の辞書を好適に生成することができる。なお、この実施の形態では、２者間での会話を例としたが、３者以上でも同様である。また、このようにして生成された辞書は、公知の日本語入力ソフトにおける辞書にも活用可能である。 Returning to FIG. 4, the voice recognition server 200 updates the contents of the first registered term list 211 based on receiving the update instruction from the information processing device 100A. Although not shown, if the user of the information processing device 100A speaks to the user of the information processing device 100B after this, the function of the control unit 120 of the information processing device 100A converts it into voice data. The voice data is sent to the voice recognition server 200. Then, term registration processing is performed on the information processing device 100B side, and the contents of the first registered term list 211 in the speech recognition server 200 are updated. Such processing will be repeatedly executed until the conversation at the conference, lecture, etc. ends. In this way, term registration processing is performed for each conversation and the contents of the first registered term list 211 are updated, so the dictionary for speech recognition is updated in real time, and the dictionary for speech recognition can be suitably used. can be generated. In this embodiment, a conversation between two parties is taken as an example, but the same applies to a conversation between three or more parties. Furthermore, the dictionary generated in this manner can also be used as a dictionary in known Japanese input software.

（変形例）
なお、この発明は、上記実施の形態に限定されず、様々な変形及び応用が可能である。例えば、情報処理装置１００では、上記実施の形態で示した全ての技術的特徴を備えるものでなくてもよく、従来技術における少なくとも１つの課題を解決できるように、上記実施の形態で説明した一部の構成を備えたものであってもよい。また、下記の変形例それぞれについて、少なくとも一部を組み合わせてもよい。 (Modified example)
Note that the present invention is not limited to the embodiments described above, and various modifications and applications are possible. For example, the information processing device 100 does not need to have all the technical features described in the above embodiments, but may include some of the technical features described in the above embodiments so that at least one problem in the conventional technology can be solved. It may also have the structure of a section. Furthermore, at least a portion of each of the following modified examples may be combined.

上記実施の形態では、図５のステップＳ１０７の処理が用語登録処理の中で実行される例を示したが、例えば、会議の終了や講演会の終了などといった一連の会話が終了したタイミングで一度行われるようにしてもよい。例えば、会話が終了したタイミングでユーザによる入出力部１３０への操作が行われることで図５に示すステップＳ１０７の処理が実行されるようにしてもよい。また、例えば、「終了」など、予め定められた特定の音声（複数設定されていてよい）を受信した場合に、会話の終了と判定して図５のステップＳ１０７の処理を実行するようにしてもよい。また、これとは異なり、ユーザにより設定された数の単語情報が登録用語一覧１１１へ登録される毎に図５のステップＳ１０７の処理が実行されるようにしてもよい。これらによれば、第１登録用語一覧２１１の更新処理に対する負荷を軽減することができる。 In the above embodiment, an example was shown in which the process of step S107 in FIG. It may also be done. For example, the process of step S107 shown in FIG. 5 may be executed by the user operating the input/output unit 130 at the timing when the conversation ends. Further, for example, when a predetermined specific voice (multiple voices may be set) such as "end" is received, it is determined that the conversation is over and the process of step S107 in FIG. 5 is executed. Good too. Alternatively, the process of step S107 in FIG. 5 may be executed every time the number of word information set by the user is registered in the registered term list 111. According to these, the load on the updating process of the first registered term list 211 can be reduced.

また、例えば「ＰｏＣ」という単語について、「ピーオーシー」と読むユーザや「ポック」と読むユーザなど、一の単語について、ユーザ毎に読み方が異なるような場合がある。このような単語について、第２登録用語一覧２１２として、一のテキストデータに対応して複数の音声データを予め登録しておき、図５のステップＳ１０６では、一のテキストデータに対応して複数の音声データを登録用語一覧１１１へ登録すればよい。そして、ステップＳ１０７の処理では、当該内容にて第１登録用語一覧２１１を更新させる指示を行えばよい。これによれば、一の単語について、ユーザ毎に読み方が異なるような場合についても、音声認識用の辞書を好適に生成することができる。 Furthermore, for example, the word "PoC" may be pronounced differently depending on the user, such as one user who pronounces it as "P-OC" and another who pronounces it as "Pock." Regarding such words, a plurality of audio data are registered in advance as the second registered term list 212 in correspondence with one text data, and in step S106 of FIG. The audio data may be registered in the registered term list 111. Then, in the process of step S107, an instruction to update the first registered term list 211 with the content may be issued. According to this, it is possible to suitably generate a dictionary for speech recognition even when a word is pronounced differently depending on the user.

また、上記実施の形態における音声認識サーバ２００の構成を、情報処理装置１００が備えていてもよい。この場合、図５のステップＳ１０７において、自身の記憶部１１０に記憶された第１登録用語一覧２１１を更新し、他の情報処理装置１００に記憶された第１登録用語一覧２１１と同期をとるようにすればよい。 Furthermore, the information processing device 100 may include the configuration of the voice recognition server 200 in the above embodiment. In this case, in step S107 of FIG. 5, the first registered term list 211 stored in the own storage unit 110 is updated to synchronize with the first registered term list 211 stored in the other information processing device 100. Just do it.

なお、上述の機能を、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）とアプリケーションとの分担、またはＯＳとアプリケーションとの協同により実現する場合等には、ＯＳ以外の部分のみを媒体に格納してもよい。 In addition, when the above-mentioned functions are realized by sharing between an OS (Operating System) and an application, or by cooperation between the OS and an application, only the parts other than the OS may be stored on the medium.

また、搬送波にプログラムを重畳し、通信ネットワークを介して配信することも可能である。例えば、通信ネットワーク上の掲示板（ＢＢＳ、ＢｕｌｌｅｔｉｎＢｏａｒｄＳｙｓｔｅｍ）に当該プログラムを掲示し、ネットワークを介して当該プログラムを配信してもよい。そして、これらのプログラムを起動し、オペレーティングシステムの制御下で、他のアプリケーションプログラムと同様に実行することにより、上述の処理を実行できるように構成してもよい。 It is also possible to superimpose a program on a carrier wave and distribute it via a communication network. For example, the program may be posted on a bulletin board (BBS, Bulletin Board System) on a communication network and distributed via the network. The above-described process may be executed by starting these programs and executing them under the control of the operating system in the same way as other application programs.

１情報処理システム、１００、１００Ａ、１００Ｂ情報処理装置、１１０、２１０記憶部、１１１登録用語一覧、１１２登録分類、１２０、２２０制御部、１２１確信度比較部、１２２形態素抽出部、１２３品詞推定部、１２４用語分類部、１２５用語登録部、１３０、２３０入出力部、１４０、２４０通信部、２００音声認識サーバ、２１１第１登録用語一覧、２１２第２登録用語一覧、２２１音声認識処理部、２２２確信度算出部、５１０ネットワーク 1 Information processing system, 100, 100A, 100B Information processing device, 110, 210 Storage unit, 111 List of registered terms, 112 Registered classification, 120, 220 Control unit, 121 Confidence comparison unit, 122 Morphological extraction unit, 123 Part of speech estimation unit , 124 term classification unit, 125 term registration unit, 130, 230 input/output unit, 140, 240 communication unit, 200 speech recognition server, 211 first registered term list, 212 second registered term list, 221 speech recognition processing unit, 222 Confidence calculation unit, 510 network

Claims

A speech recognition result that receives a first speech recognition result based on a first dictionary and a second speech recognition result based on a second dictionary that is different from the first dictionary and includes word information generated by the user. receiving means;
Receive a first certainty factor about the first speech recognition result calculated based on a predetermined calculation and a second certainty factor about the second speech recognition result calculated based on the calculation. certainty level receiving means;
As a result of comparing the first certainty factor and the second certainty factor, if the difference in the certainty factor is larger than a predetermined value, the word information included in the second speech recognition result is stored in the first dictionary. word information storage means for storing it as an update list ;
Transmitting means for transmitting an update instruction to add word information included in the update list stored in the word information storage means to the first dictionary;
An information processing device comprising :

further comprising an extraction means for extracting word information to be stored from the second speech recognition result in accordance with a predetermined standard when the difference in the certainty factor is large;
The word information storage means stores the word information extracted by the extraction means.
The information processing device according to claim 1, characterized in that:

Further comprising a classification means for classifying the word information extracted by the extraction means into one of a plurality of predetermined classifications for each appearance frequency,
The word information storage means stores word information classified by the classification means for each classification.
The information processing device according to claim 2, characterized in that:

The word information includes audio information and text information,
further comprising a first dictionary updating means for updating the first dictionary by adding word information stored as the update list by the word information storage means to the first dictionary;
The second dictionary is newly stored by the user's operation every time the first dictionary is updated.
The information processing device according to any one of claims 1 to 3, characterized in that:

A speech recognition result that receives a first speech recognition result based on a first dictionary and a second speech recognition result based on a second dictionary that is different from the first dictionary and includes word information generated by the user. a receiving step;
Receive a first certainty factor about the first speech recognition result calculated based on a predetermined calculation and a second certainty factor about the second speech recognition result calculated based on the calculation. a confidence level receiving step;
As a result of comparing the first certainty factor and the second certainty factor, if the difference in the certainty factor is larger than a predetermined value, the word information included in the second speech recognition result is stored in the first dictionary. a step of storing word information as an update list ;
a sending step of sending an update instruction to add the word information included in the update list stored in the word information storage step to the first dictionary;
An information processing method comprising :

computer,
A speech recognition result that receives a first speech recognition result based on a first dictionary and a second speech recognition result based on a second dictionary that is different from the first dictionary and includes word information generated by the user. receiving means,
Receive a first certainty factor about the first speech recognition result calculated based on a predetermined calculation and a second certainty factor about the second speech recognition result calculated based on the calculation. confidence receiving means;
As a result of comparing the first certainty factor and the second certainty factor, if the difference in the certainty factor is larger than a predetermined value, the word information included in the second speech recognition result is stored in the first dictionary. word information storage means for storing it as an update list ;
Transmitting means for transmitting an update instruction to add word information included in the update list stored in the word information storage means to the first dictionary;
to function as
A program characterized by: