JP4749438B2

JP4749438B2 - Phonetic character conversion device, phonetic character conversion method, and phonetic character conversion program

Info

Publication number: JP4749438B2
Application number: JP2008085112A
Authority: JP
Inventors: 信行小林; 浩桑原; 努森垣
Original assignee: Mitsubishi Electric Information Systems Corp
Current assignee: Mitsubishi Electric Information Systems Corp
Priority date: 2008-03-28
Filing date: 2008-03-28
Publication date: 2011-08-17
Anticipated expiration: 2028-03-28
Also published as: JP2009237387A

Abstract

<P>PROBLEM TO BE SOLVED: To improve the accuracy of voice recognition. <P>SOLUTION: A voice conversion part 121 converts two voice information pieces showing the same information input by two persons to generate two character information pieces. A character information comparing part 122 compares two character information pieces to extract a mismatch part. A mismatch part deciding part 123 decides which character is adopted depending on which character of the mismatch part is correctly converted from the voice information to the character information. A character information generating pat 124 generates character information corresponding to two voice information pieces by replacing a mismatch part of one of two character information pieces with a character decided to be adopted. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、例えば、音声情報を文字情報に変換する技術に関する。 The present invention relates to a technique for converting voice information into character information, for example.

従来、音声情報が入力されると、入力された音声情報を文字情報へ変換して出力する音声文字変換装置（音声認識装置）がある。また、入力された音声情報のパターンと、ユーザの音声の特徴パターンとを比較して、ユーザを特定する装置がある（特許文献１参照）。
また、上記音声文字変換装置や、ユーザを特定する装置をコールセンターシステムへ応用した例がある（特許文献２，３参照）。
特開２００２−２７９２４５号公報特開２００６−１２６９６６号公報特開２００２−９９５５号公報 2. Description of the Related Art Conventionally, there is a voice character conversion device (voice recognition device) that converts input voice information into character information and outputs it when voice information is input. Also, there is an apparatus that identifies a user by comparing a pattern of input voice information with a feature pattern of a user's voice (see Patent Document 1).
Moreover, there is an example in which the above-mentioned phonetic character conversion device or a device for specifying a user is applied to a call center system (see Patent Documents 2 and 3).
JP 2002-279245 A JP 2006-126966 A JP 2002-9955 A

従来の音声認識技術では、音声の認識精度が低い。そのため、例えば契約書等の電子書類を作成する場合に、音声情報により各情報（個人情報等）を入力し、入力された音声情報を文字情報へ変換して書類を作成することは難しい。
この発明は、例えば、音声認識の精度を高くすることを目的とする。また、例えばコールセンター等で、オペレータとユーザとが音声情報を入力することにより、契約書等の電子書類を作成することを目的とする。 Conventional speech recognition technology has low speech recognition accuracy. Therefore, for example, when creating an electronic document such as a contract, it is difficult to input each information (personal information, etc.) using voice information and convert the input voice information into character information to create a document.
An object of the present invention is to increase the accuracy of voice recognition, for example. Another object is to create an electronic document such as a contract by inputting voice information between an operator and a user, for example, at a call center.

本発明に係る音声文字変換装置は、例えば、
第１の端末により入力された第１の音声情報を入力して記憶装置に記憶する第１の音声情報入力部と、
第２の端末により入力された第２の音声情報を入力して記憶装置に記憶する第２の音声情報入力部と、
上記第１の音声情報入力部が入力した第１の音声情報を第１の文字情報へ処理装置により変換するとともに、上記第２の音声情報入力部が入力した第２の音声情報を第２の文字情報へ処理装置により変換する音声変換部と、
上記音声変換部が変換した第１の文字情報と第２の文字情報とを比較して、不一致部分を処理装置により抽出する文字情報比較部と、
上記文字情報比較部が比較して抽出した不一致部分の文字情報を、所定の方法により上記第１の文字情報と上記第２の文字情報とのいずれかの不一致部分の文字情報に処理装置により決定する不一致部分決定部と、
上記第１の文字情報又は上記第２の文字情報の上記不一致部分を上記不一致部分決定部が決定した文字情報に置き換えて、文字情報を処理装置により生成する文字情報生成部と
を備えることを特徴とする。 The phonetic character conversion device according to the present invention is, for example,
A first voice information input unit for inputting the first voice information input by the first terminal and storing the first voice information in a storage device;
A second voice information input unit for inputting the second voice information input by the second terminal and storing the second voice information in a storage device;
The first voice information input by the first voice information input unit is converted into first character information by the processing device, and the second voice information input by the second voice information input unit is converted into the second character information. A voice conversion unit that converts the text information into a character information by a processing device;
A character information comparison unit that compares the first character information converted by the voice conversion unit with the second character information and extracts a mismatched portion by a processing device;
The processing device determines the character information of the inconsistent portion extracted by comparison by the character information comparing unit as the character information of the inconsistent portion of the first character information and the second character information by a predetermined method. A non-matching part determination unit to
A character information generation unit that replaces the mismatched portion of the first character information or the second character information with the character information determined by the mismatched portion determination unit, and generates character information by a processing device. And

上記不一致部分決定部は、上記第１の音声情報から第１の文字情報への変換が正しい確度を示す第１の確度と、上記第２の音声情報から第２の文字情報への変換が正しい確度を示す第２の確度とを比較して、上記不一致部分の文字情報を、上記第１の確度が高い場合には上記第１の文字情報に決定し、上記第２の確度が高い場合には上記第２の文字情報に決定する
ことを特徴とする。 The non-matching portion determination unit has a first accuracy indicating the accuracy with which the conversion from the first speech information to the first character information is correct, and the conversion from the second speech information into the second character information is correct. When the first accuracy is high, the character information of the inconsistent portion is determined as the first character information when compared with the second accuracy indicating the accuracy, and when the second accuracy is high. Is determined as the second character information.

上記不一致部分決定部は、上記第１の文字情報と上記第２の文字情報とに含まれる文字毎に、不一致部分の文字情報を上記第１の文字情報と上記第２の文字情報とのいずれかの文字情報にするかを決定する
ことを特徴とする。 The inconsistent part determination unit determines the character information of the inconsistent part for each of the characters included in the first character information and the second character information, either the first character information or the second character information. The character information is determined.

上記音声文字変換装置は、さらに、
上記第１の端末と上記第２の端末とを使用するユーザ毎に、音声情報を文字情報へ変換するための言語モデルを記憶装置に記憶する言語モデル記憶部を備え、
上記音声変換部は、上記言語モデル記憶部が上記第１の端末を使用する第１のユーザに対して記憶した第１の言語モデルに基づき、上記第１の音声情報を第１の文字情報へ変換するとともに、上記言語モデル記憶部が上記第２の端末を使用する第２のユーザに対して記憶した第２の言語モデルに基づき、上記第２の音声情報を第２の文字情報へ変換する
ことを特徴とする。 The phonetic character conversion device further includes:
For each user who uses the first terminal and the second terminal, a language model storage unit that stores a language model for converting speech information into character information in a storage device,
The speech conversion unit converts the first speech information into first character information based on a first language model stored by the language model storage unit for a first user who uses the first terminal. In addition to the conversion, the language model storage unit converts the second speech information into second character information based on the second language model stored for the second user who uses the second terminal. It is characterized by that.

上記音声文字変換装置は、さらに、
上記文字情報生成部が生成した文字情報と上記第１の音声情報とに基づき、上記第１の端末を使用する第１のユーザの言語モデルを処理装置により更新するとともに、上記文字情報生成部が生成した文字情報と上記第２の音声情報とに基づき、上記第２の端末を使用する第２のユーザの言語モデルを処理装置により更新する言語モデル更新部
を備えることを特徴とする。 The phonetic character conversion device further includes:
Based on the character information generated by the character information generation unit and the first voice information, the language model of the first user who uses the first terminal is updated by the processing device, and the character information generation unit A language model updating unit is provided that updates a language model of a second user who uses the second terminal by a processing device based on the generated character information and the second voice information.

本発明に係る音声文字変換方法は、例えば、
処理装置が、第１の端末により入力された第１の音声情報を入力する第１の音声情報入力ステップと、
処理装置が、第２の端末により入力された第２の音声情報を入力する第２の音声情報入力ステップと、
処理装置が、上記第１の音声情報入力ステップで入力した第１の音声情報を第１の文字情報へ変換するとともに、上記第２の音声情報入力ステップで入力した第２の音声情報を第２の文字情報へ変換する音声変換ステップと、
処理装置が、上記音声変換ステップで変換した第１の文字情報と第２の文字情報とを比較して、不一致部分を抽出する文字情報比較ステップと、
処理装置が、上記文字情報比較ステップで比較して抽出した不一致部分の文字情報を、所定の方法により上記第１の文字情報と上記第２の文字情報とのいずれかの不一致部分の文字情報に決定する不一致部分決定ステップと、
処理装置が、上記第１の文字情報又は上記第２の文字情報の上記不一致部分を上記不一致部分決定ステップで決定した文字情報に置き換えて、文字情報を生成する文字情報生成ステップと
を備えることを特徴とする。 The phonetic character conversion method according to the present invention is, for example,
A first audio information input step in which the processing device inputs the first audio information input by the first terminal;
A second audio information input step in which the processing device inputs the second audio information input by the second terminal;
The processing device converts the first voice information input in the first voice information input step into first character information, and also converts the second voice information input in the second voice information input step into the second voice information. A voice conversion step to convert the text information into
A character information comparison step in which the processing device compares the first character information converted in the voice conversion step with the second character information and extracts a mismatched portion;
The processing device converts the character information of the inconsistent portion extracted by comparison in the character information comparing step into character information of the inconsistent portion of the first character information and the second character information by a predetermined method. A non-matching part determination step to be determined;
A processing apparatus comprising: a character information generation step of generating character information by replacing the mismatched portion of the first character information or the second character information with the character information determined in the mismatched portion determination step. Features.

本発明に係る音声文字変換プログラムは、例えば、
第１の端末により入力された第１の音声情報を入力する第１の音声情報入力処理と、
第２の端末により入力された第２の音声情報を入力する第２の音声情報入力処理と、
上記第１の音声情報入力処理で入力した第１の音声情報を第１の文字情報へ変換するとともに、上記第２の音声情報入力処理で入力した第２の音声情報を第２の文字情報へ変換する音声変換処理と、
上記音声変換処理で変換した第１の文字情報と第２の文字情報とを比較して、不一致部分を抽出する文字情報比較処理と、
上記文字情報比較処理で比較して抽出した不一致部分の文字情報を、所定の方法により上記第１の文字情報と上記第２の文字情報とのいずれかの不一致部分の文字情報に決定する不一致部分決定処理と、
上記第１の文字情報又は上記第２の文字情報の上記不一致部分を上記不一致部分決定処理で決定した文字情報に置き換えて、文字情報を生成する文字情報生成処理と
をコンピュータに実行させることを特徴とする。 The phonetic character conversion program according to the present invention is, for example,
A first audio information input process for inputting the first audio information input by the first terminal;
A second audio information input process for inputting the second audio information input by the second terminal;
The first voice information input in the first voice information input process is converted into first character information, and the second voice information input in the second voice information input process is converted into second character information. Voice conversion processing to convert,
A character information comparison process for comparing the first character information converted by the voice conversion process with the second character information and extracting a mismatched portion;
Unmatched portion for determining the character information of the mismatched portion extracted by comparison in the character information comparison process as the character information of the mismatched portion of the first character information and the second character information by a predetermined method The decision process,
Replacing the inconsistent portion of the first character information or the second character information with the character information determined in the inconsistent portion determination processing, and causing the computer to execute character information generation processing for generating character information. And

本発明に係る音声文字変換装置によれば、第１の音声情報と第２の音声情報との２つの音声情報から１つの文字情報へ変換するため変換の精度が高い。 According to the phonetic character conversion device according to the present invention, the conversion accuracy is high because the two voice information of the first voice information and the second voice information is converted into one character information.

実施の形態１．
この実施の形態では、入力項目毎に辞書情報を持つ音声文字変換装置１００について説明する。 Embodiment 1 FIG.
In this embodiment, a phonetic character conversion apparatus 100 having dictionary information for each input item will be described.

図１は、この実施の形態に係る音声文字変換装置１００の機能の概要を示す概念図である。
ユーザは、アプリケーション１０の所定の入力項目（入力欄）にカーソルを合わせて、マイク等の入力装置から音声により情報を入力する。例えば、金額入力欄にカーソルを合わせて、「１０万」と入力する。この場合、音声文字変換装置１００は、アプリケーション１０からカーソルがある入力項目の属性情報を取得する。ここでは、金額入力欄の属性情報として、「数値属性」を取得する。また、音声文字変換装置１００は、ユーザがマイク等の入力装置から入力した音声情報を取得する。ここでは、「１０万」を示す音声情報「ジュウマン」を取得する。そして、音声文字変換装置１００は、アプリケーション１０から取得した属性情報により使用する辞書を決定して、決定した辞書により取得した音声情報を文字情報へ変換して、認識結果としてアプリケーション１０へ返す。ここでは、「数値属性」に対応する数値認識辞書を使用して、「ジュウマン」という音声情報を「１０万」という文字情報へ変換してアプリケーション１０へ返す。すると、アプリケーション１０は、音声文字変換装置１００から返された「１０万」という文字情報を金額入力欄に設定する。
このように、音声文字変換装置１００は、入力項目毎に辞書情報を持ち、入力しようとしている項目に合わせた辞書情報を使用して音声情報を文字情報へ変換する。一般に、辞書情報に登録されている語数が少ないほどヒット率（意図した文字情報へ音声情報が変換される確率）は高くなる。入力項目毎に辞書情報を持つことで、使用する辞書情報に登録された語数を減らすことができるとともに、的確な単語だけを登録しておくことができる。そのため、音声文字変換装置１００によればヒット率を高くすることができる。つまり、音声認識精度を高くすることができる。 FIG. 1 is a conceptual diagram showing an outline of functions of the phonetic character conversion apparatus 100 according to this embodiment.
The user moves the cursor to a predetermined input item (input field) of the application 10 and inputs information by voice from an input device such as a microphone. For example, move the cursor to the amount input field and enter “100,000”. In this case, the speech character conversion device 100 acquires the attribute information of the input item with the cursor from the application 10. Here, “numerical attribute” is acquired as attribute information in the amount input field. In addition, the speech character conversion device 100 acquires speech information input by a user from an input device such as a microphone. Here, the audio information “Juman” indicating “100,000” is acquired. Then, the phonetic character conversion device 100 determines a dictionary to be used based on the attribute information acquired from the application 10, converts the phonetic information acquired by the determined dictionary into character information, and returns the recognition result to the application 10. Here, using the numerical value recognition dictionary corresponding to “numerical attribute”, the speech information “juuman” is converted into character information “100,000” and returned to the application 10. Then, the application 10 sets the character information “100,000” returned from the phonetic character conversion device 100 in the amount input field.
Thus, the phonetic character conversion apparatus 100 has dictionary information for each input item, and converts the phonetic information into character information using dictionary information that matches the item to be input. In general, the smaller the number of words registered in the dictionary information, the higher the hit rate (probability that voice information is converted into intended character information). By having dictionary information for each input item, the number of words registered in the dictionary information to be used can be reduced, and only accurate words can be registered. Therefore, according to the phonetic character conversion device 100, the hit rate can be increased. That is, the voice recognition accuracy can be increased.

図２は、この実施の形態に係る音声文字変換装置１００の機能を示す機能ブロック図である。
音声文字変換装置１００は、音声情報入力部１１０、音声認識部１２０、属性情報取得部１３０、文字情報取得部１４０、属性毎辞書情報記憶部１５０を備える。
音声情報入力部１１０は、アプリケーション１０を使用するユーザが出力した所定の音声情報を処理装置により入力して記憶装置に記憶する。
音声認識部１２０は、音声情報入力部１１０が取得した音声情報を変換して文字情報を処理装置により生成する。音声認識部１２０は、言語モデルに従い、音声情報をその音を示す文字情報へ変換する。例えば、「１０」という数値の読み方として、「ジュウ」や「イチゼロ」等がある。この場合、音声認識部１２０は、ユーザが「１０」という数値を「ジュウ」という読みで発音して入力した音声情報を「ジュウ」という文字情報に変換する。一方、ユーザが「１０」という数値を「イチゼロ」という読みで発音して入力した音声情報を「イチゼロ」という文字情報に変換する。
属性情報取得部１３０は、アプリケーション１０により所定の端末に表示された複数の入力項目のうち、現在カーソルがある入力項目の属性を示す属性情報をアプリケーション１０から処理装置により取得して記憶装置に記憶する。つまり、属性情報取得部１３０は、現在入力しようとしている入力項目の属性情報を取得する。属性とは、例えば、数値、住所、人名等のその入力項目に入力される情報の性質を示す情報である。
文字情報取得部１４０は、後述する属性毎辞書情報記憶部１５０が記憶した辞書情報のうち、属性情報取得部１３０が取得した属性情報が示す属性に対応する辞書情報に基づき、音声認識部１２０が変換して生成した文字情報を処理装置により他の文字情報へ変換して、認識結果としてアプリケーション１０へ返す。例えば、上記例であれば、文字情報取得部１４０は、「ジュウ」や「イチゼロ」という文字情報を、「１０」という文字情報へ変換して「１０」という文字情報をアプリケーション１０へ返す。
属性毎辞書情報記憶部１５０は、入力項目の属性毎に、第１の文字情報と第２の文字情報とを対応付けした辞書情報を記憶装置に記憶する。例えば、図２では、属性毎辞書情報記憶部１５０は、数値を入力する入力項目に対しては数値認識辞書、住所を入力する入力項目に対しては住所認識辞書、人名を入力する入力項目に対しては人名認識辞書等を記憶する。ここで、第１の文字情報とは、変換後の文字情報であり、文字情報取得部１４０が認識結果としてアプリケーション１０へ返す文字情報である。第２の文字情報とは、音声認識部１２０が生成した文字情報と比較される文字情報であり、第１の文字情報の読みに当たる文字情報である。つまり、上記例であれば、第１の文字情報とは、「１０」であり、第２の文字情報とは、「ジュウ」、「イチゼロ」である。 FIG. 2 is a functional block diagram showing functions of the phonetic character conversion apparatus 100 according to this embodiment.
The phonetic character conversion device 100 includes a voice information input unit 110, a voice recognition unit 120, an attribute information acquisition unit 130, a character information acquisition unit 140, and a per-attribute dictionary information storage unit 150.
The voice information input unit 110 inputs predetermined voice information output by a user who uses the application 10 by the processing device and stores it in the storage device.
The voice recognition unit 120 converts the voice information acquired by the voice information input unit 110 and generates character information by the processing device. The voice recognition unit 120 converts voice information into character information indicating the sound according to the language model. For example, there are “juu”, “ichi zero” and the like as a way of reading the numerical value “10”. In this case, the voice recognition unit 120 converts the voice information input by the user to pronounce the numerical value “10” as “ju” and converts it into character information “ju”. On the other hand, the voice information that the user pronounces and inputs the numerical value “10” with the reading “1 zero” is converted into character information “1 zero”.
The attribute information acquisition unit 130 acquires attribute information indicating the attribute of the input item at which the cursor is currently located among the plurality of input items displayed on the predetermined terminal by the application 10 from the application 10 and stores the attribute information in the storage device. To do. That is, the attribute information acquisition unit 130 acquires attribute information of the input item that is currently input. The attribute is information indicating the nature of information input to the input item such as a numerical value, an address, and a person name.
The character information acquisition unit 140 is based on the dictionary information corresponding to the attribute indicated by the attribute information acquired by the attribute information acquisition unit 130 among the dictionary information stored in the attribute-specific dictionary information storage unit 150 described later. The character information generated by the conversion is converted into other character information by the processing device and returned to the application 10 as a recognition result. For example, in the above example, the character information acquisition unit 140 converts character information such as “ju” or “first zero” into character information “10” and returns character information “10” to the application 10.
The attribute-specific dictionary information storage unit 150 stores, in the storage device, dictionary information in which the first character information and the second character information are associated with each other for each attribute of the input item. For example, in FIG. 2, the attribute-specific dictionary information storage unit 150 uses a numeric recognition dictionary for input items that input numerical values, an address recognition dictionary for input items that input addresses, and an input item that inputs personal names. In contrast, a personal name recognition dictionary or the like is stored. Here, the 1st character information is the character information after conversion, and is the character information which the character information acquisition part 140 returns to the application 10 as a recognition result. The second character information is character information that is compared with the character information generated by the voice recognition unit 120, and is character information that corresponds to reading the first character information. That is, in the above example, the first character information is “10”, and the second character information is “ju” and “first zero”.

図３は、図２とは異なる音声文字変換装置１００の構成を示す図である。
図３に示す音声文字変換装置１００は、図２に示す音声文字変換装置１００の機能のうち、言語モデルに従い音声情報をその音を示す文字情報へ変換する機能を備えず、音声認識装置１０１がその機能を備えている。つまり、言語モデルに従い音声情報をその音を示す文字情報へ変換する機能を外出しして、音声認識装置１０１に持たせている。つまり、音声認識装置１０１に、音声情報入力部１１０と音声認識部１２０とを持たせている。そして、音声文字変換装置１００の情報取得部１６０は、音声認識装置１０１の音声認識部１２０が音声情報を変換して生成した文字情報を処理装置により入力して記憶装置に記憶する。その他は、図２に示す音声文字変換装置１００と同様である。 FIG. 3 is a diagram showing a configuration of the phonetic character conversion apparatus 100 different from that in FIG.
3 does not have a function of converting speech information into character information indicating the sound in accordance with the language model, among the functions of the speech character conversion device 100 illustrated in FIG. It has that function. That is, the voice recognition apparatus 101 is provided with a function for converting voice information into character information indicating the sound according to the language model. That is, the voice recognition device 101 includes the voice information input unit 110 and the voice recognition unit 120. Then, the information acquisition unit 160 of the speech character conversion device 100 inputs the character information generated by the speech recognition unit 120 of the speech recognition device 101 by converting the speech information, and stores it in the storage device. Others are the same as the phonetic character conversion apparatus 100 shown in FIG.

図４は、アプリケーション１０により表示される画面情報と、属性毎辞書情報記憶部１５０が記憶する辞書情報の一例を示す図である。
画面情報には、金額入力欄、住所入力欄、氏名入力欄の３つの入力項目がある。属性毎辞書情報記憶部１５０は、画面情報の３つの入力項目それぞれに対応する数値認識辞書、住所認識辞書、人名認識辞書の３つの辞書情報を記憶している。
各辞書には複数の第１の文字情報が記憶され、それぞれの第１の文字情報に対して１つ又は複数の第２の文字情報が記憶されている。例えば、数値認識辞書であれば、第１の文字情報「１」に対して、第２の文字情報「イチ」が、第１の文字情報「１０」に対して、第２の文字情報「ジュウ」、「イチゼロ」等が記憶されている。 FIG. 4 is a diagram illustrating an example of screen information displayed by the application 10 and dictionary information stored in the attribute-specific dictionary information storage unit 150.
The screen information includes three input items: an amount input field, an address input field, and a name input field. The attribute-specific dictionary information storage unit 150 stores three pieces of dictionary information, that is, a numerical value recognition dictionary, an address recognition dictionary, and a personal name recognition dictionary corresponding to three input items of screen information.
Each dictionary stores a plurality of pieces of first character information, and one or more pieces of second character information are stored for each piece of first character information. For example, in the case of a numerical value recognition dictionary, the second character information “1” for the first character information “1”, and the second character information “juu” for the first character information “10”. "," Zero zero ", etc. are stored.

例えば、ユーザが端末から金額入力欄を選択したとする。つまり、金額入力欄にカーソルを合わせたとする。この場合、属性情報取得部１３０は、金額入力欄の属性を示す属性情報として、数値属性を取得する。ユーザが端末から金額入力欄を選択した状態で、ユーザが「１０」を音声で「ジュウ」と入力したとする。この場合、音声情報入力部１１０が「ジュウ」という音声情報を入力して、音声認識部１２０が音声情報を「ジュウ」という文字情報へ変換する。そして、文字情報取得部１４０は、まず、属性毎辞書情報記憶部１５０が記憶した辞書情報のうち、属性情報取得部１３０が取得した属性情報に対応する辞書情報を検索して選択する。つまり、ここでは、数値を入力するので数値認識辞書を検索して選択する。次に、文字情報取得部１４０は、選択した辞書情報から音声認識部１２０により生成された文字情報と一致する第２の文字情報を検索する。つまり、数値認識辞書の第２の文字情報から「ジュウ」という文字情報を検索する。そして、文字情報取得部１４０は、検索した第２の文字情報に対応する第１の文字情報を取得する。つまり、第２の文字情報「ジュウ」に対応する第１の文字情報「１０」を取得する。文字情報取得部１４０は、取得した第１の文字情報を認識結果としてアプリケーション１０へ返す。つまり、文字情報「１０」をアプリケーション１０へ返す。アプリケーション１０では、返された「１０」を金額入力欄へ記入する。 For example, it is assumed that the user has selected an amount input field from the terminal. In other words, it is assumed that the cursor is placed on the amount input field. In this case, the attribute information acquisition unit 130 acquires a numerical attribute as attribute information indicating the attribute of the amount input field. It is assumed that the user inputs “10” by voice while the user has selected the amount input field from the terminal. In this case, the voice information input unit 110 inputs voice information “ju”, and the voice recognition unit 120 converts the voice information into character information “ju”. The character information acquisition unit 140 first searches and selects dictionary information corresponding to the attribute information acquired by the attribute information acquisition unit 130 from the dictionary information stored in the attribute-specific dictionary information storage unit 150. That is, here, since a numerical value is input, the numerical value recognition dictionary is searched and selected. Next, the character information acquisition unit 140 searches for second character information that matches the character information generated by the voice recognition unit 120 from the selected dictionary information. That is, the character information “ju” is searched from the second character information in the numerical recognition dictionary. And the character information acquisition part 140 acquires the 1st character information corresponding to the searched 2nd character information. That is, the first character information “10” corresponding to the second character information “ju” is acquired. The character information acquisition unit 140 returns the acquired first character information to the application 10 as a recognition result. That is, the character information “10” is returned to the application 10. In the application 10, the returned “10” is entered in the amount input field.

また、図４において、住所認識辞書では、第１の文字情報が階層構造により互いに関連付けされている。例えば、「東京都」と「品川区」とは、「東京都」が親、「品川区」が子という階層構造で関連付けされている。同様に、「神奈川県」と「横浜市」、「鎌倉市」とは、「神奈川県」が親、「横浜市」、「鎌倉市」が子という階層構造で関連付けされている。
これは、例えば、「神奈川県横浜市」という住所を入力する場合に、「神奈川県」という県名を入力して、さらに「横浜市」という市名を入力する場合と、県名を入力することなく「横浜市」という市名を入力する場合とが考えられる。そこで、「神奈川県」という県名を入力して、さらに「横浜市」という市名を入力された場合には、入力された通り、まず「神奈川県」という文字情報を取得して認識情報として返し、次に「横浜市」という文字情報を取得して認識情報として返す。一方、県名を入力することなく「横浜市」という市名を入力された場合には、「横浜市」という文字情報を得る場合に、その親の階層の文字情報である「神奈川県」も合わせて取得する。そして、「神奈川県横浜市」という文字情報を認識情報として返す。つまり、親の第１の文字情報が取得されずに、子の第１の文字情報が取得された場合には、子の第１の文字情報と合わせて親の第１の文字情報を取得する。
なお、上記説明では、階層構造とすることで複数の第１の文字情報を関連付けし、子の第１の文字情報が取得された場合に、親の第１の文字情報も合わせて取得するとした。しかし、関連付けする構造は階層構造に限らない。例えば、２つの第１の文字情報を対等の関係で関連付けしておき、一方が取得される場合には、他方も取得するようにしてもよい。
つまり、属性毎辞書情報記憶部１５０は、複数の第１の文字情報のある第１の文字情報である一の文字情報と、一の文字情報とは異なる第１の文字情報である二の文字情報とを関連させて記憶する。そして、文字情報取得部１４０は、二の文字情報を第１の文字情報として取得した場合、二の文字情報に関連する一の文字情報も第１の文字情報として取得して、一の文字情報とニの文字情報とを認識情報として返す。 Also, in FIG. 4, in the address recognition dictionary, the first character information is associated with each other by a hierarchical structure. For example, “Tokyo” and “Shinagawa-ku” are associated with each other in a hierarchical structure in which “Tokyo” is a parent and “Shinagawa-ku” is a child. Similarly, “Kanagawa Prefecture”, “Yokohama City”, and “Kamakura City” are related by a hierarchical structure in which “Kanagawa Prefecture” is a parent and “Yokohama City” and “Kamakura City” are children.
For example, if you enter the address "Yokohama City, Kanagawa Prefecture", enter the name of the prefecture "Kanagawa Prefecture", then enter the city name "Yokohama City", and enter the name of the prefecture. It is conceivable that the city name “Yokohama City” can be entered without any problem. Therefore, if you enter the prefecture name “Kanagawa Prefecture” and then enter the city name “Yokohama City”, the character information “Kanagawa Prefecture” is first acquired as recognition information. Next, the character information “Yokohama City” is acquired and returned as recognition information. On the other hand, if you enter the city name “Yokohama City” without entering the prefecture name, when you get the character information “Yokohama City”, the parental character information “Kanagawa Prefecture” Acquire together. Then, character information “Yokohama City, Kanagawa Prefecture” is returned as recognition information. That is, when the first character information of the child is acquired without acquiring the first character information of the parent, the first character information of the parent is acquired together with the first character information of the child. .
In the above description, when the first character information of the child is acquired by associating the plurality of first character information with a hierarchical structure, the parent first character information is also acquired. . However, the associated structure is not limited to the hierarchical structure. For example, two pieces of first character information may be associated with each other in an equal relationship, and when one is acquired, the other may be acquired.
That is, the attribute-specific dictionary information storage unit 150 includes one character information which is first character information having a plurality of first character information and two characters which are first character information different from the one character information. Store information in association with it. When the character information acquisition unit 140 acquires the second character information as the first character information, the character information acquisition unit 140 also acquires the first character information related to the second character information as the first character information. And D character information are returned as recognition information.

図５は、この実施の形態に係る音声文字変換装置１００の実装例（音声文字変換プログラムの一例）の説明図である。
図５では、ＨＴＭＬを拡張したプログラムコードによる実装例と、そのプログラムコードにより表示されるシート（画面情報）とを示す。
ＨＴＭＬを拡張したプログラムコードによる実装例では、画面情報として表示するためのタグ（例えば、ＳＥＬＥＣＴタグ、ＩＮＰＵＴタグ）と、文字情報の変換を行うタグ（ＶＯＩＣＥタグ）とが混在している。つまり、図５に示す実装例では、アプリケーション１０の中に音声文字変換装置１００を組み込んでいる。画面情報として表示するタグのうち、情報を入力するためのタグは、文字情報の変換を行うタグとタグ名（入力タグ１、入力タグ２、入力タグ３）により対応している。例えば、ＳＥＬＥＣＴタグであれば、その名称である「入力タグ１」により、同じ名称が付けられたＶＯＩＣＥタグと対応付けされている。
つまり、シートにおいて入力タグ１が選択されると、プログラムコードにおいて入力タグ１の名称が付けられたＶＯＩＣＥタグが実行される。ここで、ＶＯＩＣＥタグには、その入力項目に対応する辞書情報が記述され、入力された音声情報が変換され生成された文字情報（例えば、「ジュウ」）を認識情報（例えば、「１０」）へ変換する。なお、プログラムコードでは、音声情報が変換され生成された文字情報をアルファベットで表しているが、上記説明と同様にカタカナであってもよい。
例えば、入力タグ１のＶＯＩＣＥタグでは、「１万」、「１０万」、「２０万」が第１の文字情報として登録されている。そして、各第１の文字情報に対して、第２の文字情報が登録されている。例えば、第１の文字情報「１０万」については、「ｚｙｕｕｍａｎｎ」、「ｉｃｈｉｚｅｒｏ」、「ｔｏｕ」が第２の文字情報として登録されている。また、さらに、汎用辞書「ｓｕｕ」が第２の文字情報として登録されている。汎用辞書「ｓｕｕ」は、一般的な数値属性の読みを集めた辞書情報であって、他のＸＭＬファイル等に記憶された外部辞書である。つまり、画面情報として表示するためのタグと辞書情報を混在させてプログラムコードを作成すると、プログラムコードが煩雑になるおそれがある。そこで、そのプログラムコードで特に必要な第２の文字情報はプログラムコードに直に記載し、その他の一般的な第２の文字情報は外部辞書を読み込むようにしている。これにより、プログラムコードが煩雑になることはなく、また特に必要な第２の文字情報はプログラムコードに直に記載されているため処理速度も速い。
また、入力タグ２に対応するＶＯＩＣＥタグは、図４に基づき説明したように、階層関係を有している。つまり、県名を入力することなく、市名（例えば、「横浜市」）を入力すると、「神奈川県横浜市」が認識情報として返される。 FIG. 5 is an explanatory diagram of an implementation example (an example of a phonetic character conversion program) of the phonetic character conversion device 100 according to this embodiment.
FIG. 5 shows an implementation example using program codes obtained by extending HTML and sheets (screen information) displayed by the program codes.
In an implementation example using program codes in which HTML is expanded, tags (for example, a SELECT tag and an INPUT tag) for displaying as screen information and a tag (VOICE tag) for converting character information are mixed. That is, in the implementation example illustrated in FIG. 5, the phonetic character conversion device 100 is incorporated in the application 10. Among tags displayed as screen information, tags for inputting information correspond to tags for converting character information and tag names (input tag 1, input tag 2, input tag 3). For example, a SELECT tag is associated with a VOICE tag having the same name by “input tag 1” as its name.
That is, when the input tag 1 is selected in the sheet, the VOICE tag with the name of the input tag 1 is executed in the program code. Here, in the VOICE tag, dictionary information corresponding to the input item is described, and character information (for example, “ju” ”) generated by converting input voice information is recognized information (for example,“ 10 ”). Convert to In the program code, the character information generated by converting the voice information is represented by alphabets, but may be katakana as described above.
For example, in the VOICE tag of the input tag 1, “10,000”, “100,000”, and “200,000” are registered as the first character information. Second character information is registered for each first character information. For example, for the first character information “100,000”, “Zyuumann”, “ichizero”, and “tou” are registered as the second character information. Furthermore, a general-purpose dictionary “suu” is registered as second character information. The general-purpose dictionary “suu” is dictionary information that collects readings of general numerical attributes, and is an external dictionary stored in another XML file or the like. That is, if a program code is created by mixing a tag to be displayed as screen information and dictionary information, the program code may become complicated. Therefore, the second character information particularly necessary for the program code is written directly in the program code, and the other general second character information is read from the external dictionary. As a result, the program code does not become complicated, and the necessary second character information is directly described in the program code, so that the processing speed is high.
The VOICE tag corresponding to the input tag 2 has a hierarchical relationship as described with reference to FIG. That is, if a city name (for example, “Yokohama City”) is input without inputting a prefecture name, “Yokohama City, Kanagawa Prefecture” is returned as recognition information.

図６は、ＪＡＶＡＳｃｒｉｐｔ（登録商標）やＡＪＡＸによる実装例を示す図である。
ＪＡＶＡＳｃｒｉｐｔ（登録商標）やＡＪＡＸによる実装例もＨＴＭＬを拡張したプログラムコードによる実装例と同様に、図４に示す画面情報として表示するためのコード（ｉｎｐｕｔｔｙｐｅ・・・）と、文字情報の変換を行う関数（ｆｕｎｃｔｉｏｎｉｎｐｕｔｔａｇ１等）とが混在している。つまり、画面情報で入力項目が選択されると、選択された入力項目に対応する関数が呼ばれる。そして、その関数では、上記ＶＯＩＣＥタグと同様の処理が実行される。
例えば、入力タグ１が選択されると、ｉｎｐｕｔｔａｇ１関数が実行される。ｉｎｐｕｔｔａｇ１関数には、辞書情報が直に記載されている（辞書配列は省略して記載している）。また、入力タグ２が選択されると、ｉｎｐｕｔｔａｇ２関数が実行される。ｉｎｐｕｔｔａｇ２関数では、ＸＭＬファイル等の外部ファイルに記載された外部辞書を呼ぶ。また、入力タグ３が選択されると、ｉｎｐｕｔｔａｇ３関数が実行される。ｉｎｐｕｔｔａｇ３関数では、文字情報取得部１４０による文字情報変換処理自体を他のプログラムを呼び出して実行する。 FIG. 6 is a diagram illustrating an implementation example using JAVAScript (registered trademark) or AJAX.
Similar to the implementation example using the program code that is an extension of HTML, the implementation example using JAVAScript (registered trademark) or AJAX is also converted to the code (input type...) For displaying as screen information shown in FIG. Functions to be performed (function inputtag 1 etc.) are mixed. That is, when an input item is selected in the screen information, a function corresponding to the selected input item is called. In the function, the same processing as that of the VOICE tag is executed.
For example, when the input tag 1 is selected, the inputtag1 function is executed. In the inputtag1 function, dictionary information is directly described (the dictionary array is omitted). When the input tag 2 is selected, the inputtag2 function is executed. In the inputtag2 function, an external dictionary described in an external file such as an XML file is called. When the input tag 3 is selected, the inputtag3 function is executed. In the inputtag3 function, the character information conversion process itself by the character information acquisition unit 140 is executed by calling another program.

また、例えば、ＨＴＭＬやＸＭＬのタグ名称に、図５に示すＶＯＩＣＥタグを予め対応付けしておくことで、通常通り画面情報を作成するだけで、文字情報の変換を行うタグを組み込んだプログラムコードを生成することができる。
例えば、タグ名称に「ＪＵＵＳＨＯ」と付けた場合には、「住所認識辞書」を備えるＶＯＩＣＥタグと関連付けされるように予め設定しておく。このようにすることにより、住所の入力欄を作成する際、そのタグ名称に「ＪＵＵＳＨＯ」と付けるだけで、自動的に「住所認識辞書」を備えるＶＯＩＣＥタグと関連付けされたプログラムコードが生成される。
また、ＳＥＬＥＣＴタグのような選択式のタグであれば、選択対象として登録された単語を認識対象（第１の文字情報）とする辞書を生成するようにしておいてもよい。そして、第１の文字情報に対する第２の文字情報は、例えば、一般的な辞書から第１の文字情報をキーとして検索して取得するようにしてもよい。なお、認識対象が判別できない場合には、一般的な汎用辞書を設定するとしておいてもよい。 In addition, for example, by associating HTML and XML tag names with the VOICE tag shown in FIG. 5 in advance, a program code that incorporates a tag that converts character information simply by creating screen information as usual. Can be generated.
For example, when “JUUSHO” is added to the tag name, the tag name is set in advance so as to be associated with a VOICE tag including an “address recognition dictionary”. In this way, when creating an address entry field, simply adding “JUUSHO” to the tag name automatically generates a program code associated with a VOICE tag having an “address recognition dictionary”. .
In addition, in the case of a selection-type tag such as a SELECT tag, a dictionary that uses a word registered as a selection target as a recognition target (first character information) may be generated. And the 2nd character information with respect to 1st character information may be made to acquire by searching 1st character information for a key from a general dictionary, for example. If the recognition target cannot be determined, a general general dictionary may be set.

以上のように、この実施の形態に係る音声文字変換装置１００によれば、入力項目毎に辞書情報を持つため、音声認識精度を高くすることができる。
また、辞書情報に登録された単語（第１の文字情報）を関連付けしておくことにより、入力情報を省略して入力した場合にも、必要な入力情報を補うことができる。
また、辞書情報を文書表示プログラムの中に埋め込むことにより、変換処理の高速化を図ることができる。一方、使用される確率の低い単語（第２の文字情報）については外部辞書とすることで、プログラムコードが複雑になることを防止できる。 As described above, according to the phonetic character conversion apparatus 100 according to this embodiment, since the dictionary information is provided for each input item, the voice recognition accuracy can be increased.
In addition, by associating a word (first character information) registered in the dictionary information, necessary input information can be supplemented even when the input information is omitted.
Further, by embedding dictionary information in the document display program, the conversion process can be speeded up. On the other hand, it is possible to prevent the program code from becoming complicated by using an external dictionary for words (second character information) that have a low probability of being used.

実施の形態２．
この実施の形態では、実施の形態１に係る音声文字変換装置１００を応用して、ユーザとオペレータとの間の会話を音声情報として取得することにより、電子書類を作成する方法について説明する。 Embodiment 2. FIG.
In this embodiment, a method of creating an electronic document by applying the phonetic character conversion apparatus 100 according to Embodiment 1 and acquiring a conversation between a user and an operator as voice information will be described.

図７は、この実施の形態に係る音声文字変換装置１００の機能の概要を示す概念図である。
ユーザとオペレータとは、例えば電話等により会話をする。音声文字変換装置１００は、ユーザとオペレータとの会話を音声情報として取得して電子書類を作成する。音声文字変換装置１００は、オペレータが「〜を教えてください」と言った場合に、「〜」に当たる入力項目へカーソルを合わせる。例えば、オペレータが「ご住所を教えてください」と言った場合には、「住所」の入力欄へカーソルを合わせる。すると、音声文字変換装置１００は、実施の形態１で説明したように、カーソルが合わされた入力項目の属性情報を取得して、使用する辞書情報を切り替える。そして、オペレータからの「〜を教えてください」に対して、ユーザが「○○です」と答えた場合、音声文字変換装置１００は「○○」を認識して、カーソルを合わせた入力欄へ記入する。これをすべての項目について繰り返すことにより、電子書類の作成ができる。
このように、音声文字変換装置１００は、オペレータから入力された音声情報を認識して入力項目を切り替え、ユーザから入力された音声情報を認識して入力項目へ情報を記入する。そのため、カーソルの切り替え等の端末操作をすることなく、ユーザとオペレータとは単に会話をするだけで、電子書類の作成をすることができる。 FIG. 7 is a conceptual diagram showing an outline of the functions of the phonetic character conversion apparatus 100 according to this embodiment.
The user and the operator have a conversation, for example, by telephone. The voice character conversion device 100 acquires a conversation between a user and an operator as voice information and creates an electronic document. When the operator says “Please tell me”, the phonetic character conversion apparatus 100 moves the cursor to the input item corresponding to “˜”. For example, when the operator says “Please tell me your address,” move the cursor to the “Address” input field. Then, as described in the first embodiment, the phonetic character conversion apparatus 100 acquires the attribute information of the input item on which the cursor is placed, and switches the dictionary information to be used. When the user answers “Please tell me” from the operator, when the user answers “Yes”, the speech character conversion device 100 recognizes “XX” and goes to the input field where the cursor is placed. Fill out. By repeating this process for all items, an electronic document can be created.
As described above, the speech character conversion apparatus 100 recognizes the speech information input from the operator and switches the input item, recognizes the speech information input from the user, and writes the information in the input item. Therefore, an electronic document can be created by simply having a conversation between the user and the operator without performing a terminal operation such as switching the cursor.

図８は、この実施の形態に係る音声文字変換装置１００の機能を示す機能ブロック図である。
この実施の形態に係る音声文字変換装置１００は、実施の形態１に係る音声文字変換装置１００の機能に加え、さらに、項目情報取得部１７０、項目識別辞書情報記憶部１８０を備える。また、音声情報入力部１１０は、第１の音声情報入力部１１１、第２の音声情報入力部１１２を備える。
第１の音声情報入力部１１１は、オペレータが出力した音声情報（第１の音声情報）をオペレータ端末１１を介して処理装置により入力して記憶装置に記憶する。
第２の音声情報入力部１１２は、ユーザが出力した音声情報（第２の音声情報）をユーザ端末１２を介して処理装置により入力して記憶装置に記憶する。
項目情報取得部１７０は、後述する項目識別辞書情報記憶部１８０が記憶した項目識別辞書情報に基づき、第１の音声情報入力部１１１が入力した音声情報を音声認識部１２０が変換して生成した文字情報と一致する項目文字情報を検索して、検索した項目文字情報に対応する項目識別情報を処理装置により取得する。
項目識別辞書情報記憶部１８０は、複数の入力項目の入力項目毎に、その入力項目を示す項目識別情報と所定の文字情報である項目文字情報とを対応付けした項目識別辞書情報を記憶装置に記憶する。項目識別辞書情報記憶部１８０は、例えば、金額を入力する入力項目であれば、その項目識別情報「金額入力欄」と、「金額」、「お金」、「値段」等の文字情報とを対応付けして記憶する。
また、属性情報取得部１３０は、項目情報取得部１７０が取得した項目識別情報が示す入力項目の属性情報を取得する。文字情報取得部１４０は、属性毎辞書情報記憶部１５０が記憶した辞書情報のうち、属性情報取得部１３０が取得した属性情報に対応する辞書情報を検索して選択する。文字情報取得部１４０は、選択した辞書情報に基づき、第２の音声情報入力部１１２が入力した音声情報を音声認識部１２０が変換して生成した文字情報に一致する第２の文字情報に対応する第１の文字情報を取得する。そして、文字情報取得部１４０は、取得した第１の文字情報を入力項目に記述（記憶）する。 FIG. 8 is a functional block diagram showing functions of the phonetic character conversion apparatus 100 according to this embodiment.
The phonetic character conversion device 100 according to this embodiment includes an item information acquisition unit 170 and an item identification dictionary information storage unit 180 in addition to the functions of the phonetic character conversion device 100 according to the first embodiment. The voice information input unit 110 includes a first voice information input unit 111 and a second voice information input unit 112.
The first voice information input unit 111 inputs the voice information (first voice information) output by the operator from the processing device via the operator terminal 11 and stores it in the storage device.
The second voice information input unit 112 inputs voice information (second voice information) output by the user from the processing device via the user terminal 12 and stores it in the storage device.
The item information acquisition unit 170 is generated by the voice recognition unit 120 converting the voice information input by the first voice information input unit 111 based on the item identification dictionary information stored in the item identification dictionary information storage unit 180 described later. The item character information that matches the character information is searched, and the item identification information corresponding to the searched item character information is acquired by the processing device.
The item identification dictionary information storage unit 180 stores, in the storage device, item identification dictionary information in which item identification information indicating the input item is associated with item character information that is predetermined character information for each input item of a plurality of input items. Remember. For example, if the item identification dictionary information storage unit 180 is an input item for inputting an amount, the item identification information “amount input field” corresponds to character information such as “amount”, “money”, and “price”. Add and remember.
Further, the attribute information acquisition unit 130 acquires attribute information of the input item indicated by the item identification information acquired by the item information acquisition unit 170. The character information acquisition unit 140 searches and selects dictionary information corresponding to the attribute information acquired by the attribute information acquisition unit 130 from the dictionary information stored in the attribute-specific dictionary information storage unit 150. The character information acquisition unit 140 corresponds to the second character information that matches the character information generated by the voice recognition unit 120 converting the voice information input by the second voice information input unit 112 based on the selected dictionary information. First character information to be acquired is acquired. Then, the character information acquisition unit 140 describes (stores) the acquired first character information in the input item.

図９は、図８とは異なる音声文字変換装置１００の構成を示す図である。
図９に示す音声文字変換装置１００は、図３に示す音声文字変換装置１００と同様に、図８に示す音声文字変換装置１００の機能のうち、言語モデルに従い音声情報をその音を示す文字情報へ変換する機能を外出しして、音声認識装置１０１に持たせている。
ここで、情報取得部１６０は、第１の情報取得部１６１、第２の情報取得部１６２を備える。第１の情報取得部１６１は、第１の音声情報入力部１１１が入力した音声情報を音声認識部１２０が変換して生成した文字情報を処理装置により入力して記憶装置に記憶する。第２の情報取得部１６２は、第２の音声情報入力部１１２が入力した音声情報を音声認識部１２０が変換して生成した文字情報を処理装置により入力して記憶装置に記憶する。
その他は、図８に示す音声文字変換装置１００と同様である。 FIG. 9 is a diagram showing a configuration of a phonetic character conversion apparatus 100 different from that in FIG.
The phonetic character conversion device 100 shown in FIG. 9 is similar to the phonetic character conversion device 100 shown in FIG. 3, and among the functions of the phonetic character conversion device 100 shown in FIG. The voice recognition device 101 is provided with the function of converting to “G”.
Here, the information acquisition unit 160 includes a first information acquisition unit 161 and a second information acquisition unit 162. The first information acquisition unit 161 inputs the character information generated by the voice recognition unit 120 by converting the voice information input by the first voice information input unit 111 by the processing device and stores the character information in the storage device. The second information acquisition unit 162 inputs the character information generated by the voice recognition unit 120 by converting the voice information input by the second voice information input unit 112 by the processing device and stores the character information in the storage device.
Others are the same as the phonetic character conversion apparatus 100 shown in FIG.

なお、上記説明では、オペレータから入力された音声情報により入力項目を切り替えるとした。しかし、オペレータだけでなくユーザから入力された音声情報により入力項目を切り替えしてもよい。
また、項目識別辞書情報記憶部１８０は、入力項目の順序を記憶しておき、「次へ」等の音声情報が入力された場合に、項目情報取得部１７０は現在カーソルが合わせられている入力項目の次の入力項目の項目識別情報を取得するとしてもよい。
また、オペレータはオペレータ端末１１により、ボタン操作等で入力項目を変更してもよい。また、入力された情報（第１の文字情報）を訂正してもよい。 In the above description, input items are switched according to voice information input from an operator. However, the input items may be switched based on voice information input not only by the operator but also by the user.
Also, the item identification dictionary information storage unit 180 stores the order of input items, and when voice information such as “next” is input, the item information acquisition unit 170 inputs the current cursor position. The item identification information of the input item next to the item may be acquired.
In addition, the operator may change the input items by operating the buttons on the operator terminal 11. The input information (first character information) may be corrected.

以上のように、この実施の形態に係る音声文字変換装置１００によれば、ユーザとオペレータとが会話をするだけで、電子書類を作成することができる。 As described above, according to the phonetic character conversion apparatus 100 according to this embodiment, an electronic document can be created only by a conversation between a user and an operator.

実施の形態３．
この実施の形態では、ユーザにより入力された音声情報とオペレータにより入力された音声情報との２つの音声情報を使うことで、音声情報の認識精度を高める方法について説明する。 Embodiment 3 FIG.
In this embodiment, a method for improving the recognition accuracy of voice information by using two pieces of voice information, that is, voice information input by a user and voice information input by an operator will be described.

図１０は、この実施の形態に係る音声文字変換装置１００の機能の概要を示す概念図である。
図１０では、オペレータが住所はどこかユーザに質問した場合を例として示す。住所を聞かれると、ユーザが「東京都」ですと答えたとする。すると、音声文字変換装置１００は、ユーザが答えた「東京都」ですという音声情報を入力して、その音を示す文字情報へ変換する。ここでは、「とうひょうと」ですと変換されたとする。一方、オペレータはユーザが答えた内容を確認するため「東京都」ですねと復唱する。すると、音声文字変換装置１００は、オペレータが復唱した「東京都」ですねという音声情報を入力して、その音を示す文字情報へ変換する。ここでは、「とおきょうと」ですねと変換されたとする。そして、音声文字変換装置１００は、２つの音声情報を変換して生成した２つの文字情報「とうひょうと」と「とおきょうと」とを比較する。すると、「う」と「お」、「ひ」と「き」という２箇所の文字情報が不一致であることがわかる。そこで、この不一致の文字情報について、ユーザの音声情報から変換された文字情報とオペレータの音声情報から変換された文字情報とのどちらが確からしいか判断して、確からしい方の文字情報を採用する。例えば、ここでは、「う」と「お」についてはユーザの音声情報から変換された文字情報「う」が確からしい、「ひ」と「き」についてはオペレータの音声情報から変換された文字情報「き」が確からしいと判断したとする。つまり、「う」と「き」とを採用する。そして、「とうきょうと」という文字情報を生成する。
また、音声文字変換装置１００は、音声情報から文字情報へ変換する際の規則を音声情報を入力する人毎に定めた言語モデルを、認識結果に基づき変更して、変換の確度（認識率）を高める。つまり、音声文字変換装置１００は、上記例では、「とうひょうと」と変換されてしまったユーザの発音（例えば、イントネーションや音の周波数）は、このユーザの場合は「とうきょうと」と変換しなければならないことがわかる。したがって、この結果に基づき、音声文字変換装置１００は、言語モデルを変更する。
このように、音声文字変換装置１００は、２つの音声情報を使うことで認識率を高めるとともに、認識結果に基づき言語モデルを改善していくことでさらに認識率を高める。そのため、ユーザとオペレータとが会話を行う度に、認識率が高くなる。 FIG. 10 is a conceptual diagram showing an outline of functions of the phonetic character conversion apparatus 100 according to this embodiment.
FIG. 10 shows an example where the operator asks the user where the address is. Suppose the user answers "Tokyo" when asked for his address. Then, the voice character conversion device 100 inputs the voice information “Tokyo” answered by the user, and converts it into character information indicating the sound. Here, it is assumed that “Toyoto” is converted. On the other hand, the operator repeats “Tokyo” to confirm the contents answered by the user. Then, the phonetic character conversion apparatus 100 inputs the voice information that the operator repeats “Tokyo”, and converts it into character information indicating the sound. Here, it is assumed that “Toyoyoto” is converted. Then, the phonetic character conversion apparatus 100 compares the two character information “Toyoyo” and “Toyoyo” generated by converting the two phonetic information. Then, it can be seen that the character information at the two locations “U” and “O” and “HI” and “KI” do not match. Therefore, it is determined whether the character information converted from the user's voice information or the character information converted from the operator's voice information is likely, and the most likely character information is adopted. For example, here, for "u" and "o", the character information "u" converted from the user's voice information is likely, and for "hi" and "ki", the character information converted from the operator's voice information Suppose that "Ki" seems to be certain. In other words, “U” and “KI” are adopted. Then, the character information “Tokyo” is generated.
Further, the phonetic character conversion device 100 changes a language model in which a rule for converting voice information into character information is determined for each person who inputs the voice information based on the recognition result, and the conversion accuracy (recognition rate). To increase. That is, in the above example, the phonetic character conversion device 100 converts the pronunciation of the user (for example, intonation and sound frequency) that has been converted to “Toyoto” into “Toyoto” for this user. I understand that I have to. Therefore, based on this result, the phonetic character conversion apparatus 100 changes the language model.
As described above, the phonetic character conversion apparatus 100 increases the recognition rate by using two pieces of speech information, and further increases the recognition rate by improving the language model based on the recognition result. Therefore, the recognition rate increases every time the user and the operator have a conversation.

図１１は、この実施の形態に係る音声文字変換装置１００の機能を示す機能ブロック図である。
この実施の形態に係る音声文字変換装置１００の音声認識部１２０は、音声変換部１２１、文字情報比較部１２２、不一致部分決定部１２３、文字情報生成部１２４、言語モデル記憶部１２５、言語モデル更新部１２６を備える。その他は、実施の形態２に係る音声文字変換装置１００と同様である。
音声変換部１２１は、オペレータ（第１の音声情報入力部１１１）が入力した第１の音声情報を上記オペレータ用の言語モデルに基づき文字情報へ処理装置により変換する。同様に、音声変換部１２１は、ユーザ（第２の音声情報入力部１１２）が入力した第２の音声情報を上記ユーザ用の言語モデルに基づき文字情報へ処理装置により変換する。また、音声変換部１２１は、音声情報から文字情報へ変換する場合、変換後の文字情報について１文字毎に、変換が正しい確度を処理装置により計算する。
文字情報比較部１２２は、音声変換部１２１が認識した２つの文字情報とを比較して、不一致部分を処理装置により抽出する。
不一致部分決定部１２３は、文字情報比較部１２２が比較して抽出した不一致部分の文字情報を、２つの文字情報とのいずれかの不一致部分の文字情報に処理装置により決定する。不一致部分決定部１２３は、音声変換部１２１が計算した確度に基づき、不一致部分の文字情報について１文字毎にどちらの文字情報を使用するか決定する。
文字情報生成部１２４は、２つの文字情報のいずれかの不一致部分を不一致部分決定部１２３が決定した文字情報に置き換えて、文字情報を処理装置により生成する。
言語モデル記憶部１２５は、音声情報を入力する人毎（つまり、ユーザ、オペレータ毎）に、音声情報を文字情報へ変換するための言語モデルを記憶装置に記憶する。
言語モデル更新部１２６は、文字情報生成部１２４が生成した文字情報と第１の音声情報とに基づき、上記オペレータの言語モデルを更新するとともに、文字情報生成部１２４が生成した文字情報と第２の音声情報とに基づき、上記ユーザの言語モデルを処理装置により更新する。 FIG. 11 is a functional block diagram showing functions of the phonetic character conversion apparatus 100 according to this embodiment.
The speech recognition unit 120 of the speech to character conversion device 100 according to this embodiment includes a speech conversion unit 121, a character information comparison unit 122, a mismatched part determination unit 123, a character information generation unit 124, a language model storage unit 125, and a language model update. Part 126 is provided. Others are the same as the phonetic character conversion apparatus 100 according to the second embodiment.
The voice conversion unit 121 converts the first voice information input by the operator (first voice information input unit 111) into character information by the processing device based on the language model for the operator. Similarly, the voice conversion unit 121 converts the second voice information input by the user (second voice information input unit 112) into character information by the processing device based on the language model for the user. In addition, when converting the voice information to the character information, the voice conversion unit 121 calculates, with the processing device, the accuracy with which the conversion is correct for each character of the converted character information.
The character information comparison unit 122 compares the two pieces of character information recognized by the voice conversion unit 121 and extracts a mismatched portion by the processing device.
The non-matching portion determination unit 123 determines the character information of the non-matching portion extracted by comparison by the character information comparison unit 122 as the character information of the non-matching portion between the two pieces of character information. The mismatching part determination unit 123 determines which character information is used for each character of the character information of the mismatching part based on the accuracy calculated by the voice conversion unit 121.
The character information generation unit 124 replaces any mismatched portion of the two character information with the character information determined by the mismatched portion determination unit 123, and generates the character information by the processing device.
The language model storage unit 125 stores, in a storage device, a language model for converting speech information into character information for each person who inputs speech information (that is, for each user and operator).
The language model update unit 126 updates the language model of the operator based on the character information generated by the character information generation unit 124 and the first voice information, and the character information generated by the character information generation unit 124 and the second information The language model of the user is updated by the processing device based on the voice information.

図１２は、図１０に基づき説明した音声情報から文字情報へ変換する処理を補足説明するための図である。
上述したように、言語モデル記憶部１２５が音声情報を入力する人毎に言語モデルを記憶して、音声変換部１２１が音声情報を入力した人の言語モデルに基づき、音声情報を文字情報へ変換する。つまり、いわゆる特定話者タイプの音声認識処理を行う。この場合、言語モデルがその人（音声情報を入力した人）の発音についての情報をどれだけ持っているか、つまりどれだけ学習しているかにより認識の確度が変わる。つまり、その人の発音について学習しているほど、認識の確度は高くなる。また、発音の明確さ等によっても、認識の確度が変わる。音声変換部１２１は、音声情報から文字情報へ変換しながら、この認識の確度を文字毎に計算する。
例えば、図１２に示す例では、ユーザは新規の顧客で、言語モデルは未学習状態であり、オペレータは習熟度の高いオペレータで、言語モデルの学習も進んでいる状態であるとする。そのため、ユーザが入力した音声情報を変換した「とうひょうと」という文字情報は、全体的に認識確度が低い。一方、オペレータが入力した音声情報を変換した「とおきょうと」という文字情報は、全体的に認識確度が高い。しかし、オペレータが入力した音声情報を変換した「とおきょうと」の「お」は発音が不明確であったため、認識確度が低い。ここでは、文字情報比較部１２２が２つの文字情報を比較して抽出した不一致部分の文字情報の認識確度は、ユーザが入力した音声情報を変換した「とうひょうと」の「う」が６０％、「ひ」が３０％、オペレータが入力した音声情報を変換した「とおきょうと」の「お」が５０％、「き」が９０％であったとする。
不一致部分決定部１２３は、「う」（６０％）と「お」（５０％）を比較して、認識確度の高い「う」を採用し、「ひ」（３０％）と「き」（９０％）を比較して、認識確度の高い「き」を採用する。
文字情報生成部１２４は、例えば、「とうひょうと」の不一致部分である「う」と「ひ」とをそれぞれ、「う」と「き」とに置き換えて、「とうきょうと」という文字情報を生成する。
また、言語モデル更新部１２６は、文字情報生成部１２４が生成した「とうきょうと」という文字情報と、ユーザの発音（とうひょうと）とに基づき、ユーザの言語モデルを更新する。また、言語モデル更新部１２６は、文字情報生成部１２４が生成した「とうきょうと」という文字情報と、オペレータの発音（とおきょうと）とに基づき、オペレータの言語モデルを更新する。 FIG. 12 is a diagram for supplementarily explaining the process of converting voice information to character information described with reference to FIG.
As described above, the language model storage unit 125 stores a language model for each person who inputs speech information, and the speech conversion unit 121 converts speech information into character information based on the language model of the person who inputs speech information. To do. That is, a so-called specific speaker type speech recognition process is performed. In this case, the accuracy of recognition changes depending on how much information the language model has about the pronunciation of the person (the person who inputted the speech information), that is, how much information is learned. In other words, the more you learn about the person's pronunciation, the higher the accuracy of recognition. The accuracy of recognition also changes depending on the clarity of pronunciation. The voice conversion unit 121 calculates the recognition accuracy for each character while converting the voice information into the character information.
For example, in the example shown in FIG. 12, it is assumed that the user is a new customer, the language model is in an unlearned state, the operator is an operator with a high level of proficiency, and the language model is being learned. Therefore, the character information “Toyo Hyoto” obtained by converting the voice information input by the user has a low recognition accuracy as a whole. On the other hand, the character information “Toyoyoto” obtained by converting the voice information input by the operator has a high recognition accuracy as a whole. However, “O” of “Toyoyo” obtained by converting the voice information input by the operator has an unclear pronunciation, so the recognition accuracy is low. Here, the recognition accuracy of the character information of the mismatched portion extracted by comparing the two character information by the character information comparison unit 122 is 60% of “U” of “Toyoyo” converted from the voice information input by the user. , “Hi” is 30%, “O” of “Toyokyo” converted voice information input by the operator is 50%, and “Ki” is 90%.
The non-matching part determination unit 123 compares “U” (60%) and “O” (50%), adopts “U” with high recognition accuracy, and “H” (30%) and “K” (“ 90%) and adopt “ki” with high recognition accuracy.
For example, the character information generation unit 124 replaces “U” and “HI”, which are inconsistent parts of “TOYO HYOTO”, with “U” and “KI”, respectively, and changes the character information “TOYOTO” Generate.
Further, the language model update unit 126 updates the user's language model based on the character information “Tokyo” generated by the character information generation unit 124 and the pronunciation of the user. The language model update unit 126 updates the language model of the operator based on the character information “Tokyo” generated by the character information generation unit 124 and the pronunciation of the operator.

なお、ユーザやオペレータが入力した音声情報から特定の音声情報を抜き出す方法（例えば、「とうきょうとです」という音声情報から「とうきょうと」を抜き出す方法）についてはどのようなものであっても構わない。例えば、一般的に語尾に付けられる「です」、「ですね」や語頭に付けられる「それは」（「それは・・・です」というような場合）等は、省くようにしておいてもよい。また、この省く情報についても言語モデルと同様に学習するようにしてもよい。 Any method may be used for extracting specific audio information from the audio information input by the user or operator (for example, extracting “Tokyo” from the audio information “Tokyo is”). . For example, “is”, “sound” that is generally added to the end of a word, “it” that is added to the beginning of a word (in the case of “that is ...”), etc. may be omitted. Further, this omitted information may be learned in the same manner as the language model.

以上のように、この実施の形態に係る音声文字変換装置１００によれば、ユーザとオペレータとの２つの音声情報を用いて、１つの文字情報へ変換することにより認識精度を高くすることができる。
また、２つの音声情報から生成された文字情報と、入力された音声情報とを用いて言語モデルを更新するため、ユーザとオペレータとが会話をするほど認識精度が高くなる。
なお、実施の形態１に示す文字情報の変換処理と合わせて実施することにより、音声情報から認識結果を導く精度をさらに高くすることができる。 As described above, according to the phonetic character conversion device 100 according to this embodiment, the recognition accuracy can be increased by converting the text information into one character information using the two voice information of the user and the operator. .
Moreover, since the language model is updated using the character information generated from the two pieces of voice information and the input voice information, the recognition accuracy increases as the user and the operator have a conversation.
It should be noted that the accuracy of deriving the recognition result from the speech information can be further increased by carrying out the processing together with the character information conversion processing shown in the first embodiment.

実施の形態４．
この実施の形態では、上記実施の形態で説明した音声文字変換装置１００をコールセンターシステムに応用した例について説明する。 Embodiment 4 FIG.
In this embodiment, an example in which the phonetic character conversion device 100 described in the above embodiment is applied to a call center system will be described.

図１３は、この実施の形態に係る音声文字変換装置１００の機能の概要を示す概念図である。
例えば、金融業等において商品を販売する際に、事前にその商品のリスク等の所定の説明をユーザへ行うことが義務付けされている場合がある。コールセンターにおいて、ユーザから商品の購入の申し出がされた場合、オペレータは義務付けされている説明を行う。しかし、義務付けされている説明は商品毎に定められたものであるため、オペレータは単にその説明を読み上げしているに過ぎない。
そこで、その説明をオペレータに代わり音声文字変換装置１００が音声により行う。音声文字変換装置１００は、音声による説明が終了すると、ユーザに説明内容を了解してもらえるか否か確認する。この際、ユーザはオペレータと会話する場合と同様、音声で了解するか否かを入力する。すると、音声文字変換装置１００は、確認認識辞書を使って入力された音声情報を認識して、オペレータへ認識結果を送信する。オペレータは認識結果により、ユーザが説明を了解したか否かを知ることができる。そして、了解している場合には、オペレータはユーザと通話して商品の購入のための処理へ進む。
また、音声文字変換装置１００は、音声による説明中もユーザからの音声の入力を受け付ける。例えば、ユーザが説明を一時停止して欲しい場合には、その旨を音声により入力する。すると、音声文字変換装置１００は割込認識辞書を使って入力された音声を認識して、説明を一時停止する。また、ユーザが説明内容について質問がある場合も同様にその旨を音声により入力する。すると、音声文字変換装置１００は割込認識辞書を使って入力された音声を認識して、認識結果をオペレータへ送信する。オペレータは、認識結果により、ユーザが説明に質問があることを知ることができる。そこで、オペレータはユーザと通話して質問に対応することができる。
このように、音声文字変換装置１００がオペレータに代わって音声により説明を行い、ユーザからの応答を音声により受付する。そのため、オペレータは、音声文字変換装置１００が説明を行っている間、他の作業をすることができ作業効率を高くすることができる。また、ユーザにとっては、オペレータから説明を受けているのと同様の説明を受けることができる。さらに、ユーザは、質問がある場合等にも機械の操作ではなく、オペレータが対応しているときと同様に音声により質問があることを伝えることができ、不慣れな機械操作を行う必要がない。 FIG. 13 is a conceptual diagram showing an outline of functions of the phonetic character conversion apparatus 100 according to this embodiment.
For example, when a product is sold in a financial business or the like, it may be obliged to give a user a predetermined explanation about the risk of the product in advance. In the call center, when a user offers to purchase a product, the operator gives a mandatory explanation. However, since the required explanation is determined for each product, the operator simply reads out the explanation.
Therefore, the voice character conversion device 100 provides the explanation by voice instead of the operator. When the voice description ends, the voice character conversion device 100 checks whether or not the user understands the contents of the explanation. At this time, the user inputs whether or not he / she understands by voice as in the case of talking with the operator. Then, the voice character conversion apparatus 100 recognizes the voice information input using the confirmation recognition dictionary and transmits the recognition result to the operator. The operator can know from the recognition result whether or not the user understands the explanation. If the operator agrees, the operator talks with the user and proceeds to a process for purchasing a product.
In addition, the voice character conversion device 100 accepts voice input from the user even during voice explanation. For example, when the user wants to pause the explanation, he / she inputs that fact by voice. Then, the phonetic character conversion apparatus 100 recognizes the input voice using the interrupt recognition dictionary and pauses the explanation. Further, when the user has a question about the contents of explanation, the fact is input by voice as well. Then, the voice character conversion apparatus 100 recognizes the input voice using the interrupt recognition dictionary and transmits the recognition result to the operator. The operator can know from the recognition result that the user has a question in the explanation. Thus, the operator can talk to the user and answer the questions.
In this way, the voice character conversion device 100 performs a voice explanation on behalf of the operator, and accepts a response from the user by voice. Therefore, the operator can perform other work while the phonetic character conversion apparatus 100 is explaining, and work efficiency can be increased. In addition, the user can receive the same explanation as that given by the operator. Further, when there is a question, the user can tell that there is a question not by operating the machine but by the voice as when the operator is responding, and does not need to perform an unfamiliar machine operation.

図１４は、この実施の形態に係る音声文字変換装置１００の機能を示す機能ブロック図である。
この実施の形態に係る音声文字変換装置１００は、実施の形態１に係る音声文字変換装置１００の機能に加え、音声出力部１９０、確認情報要求部２００、送信部２１０を備える。
音声出力部１９０は、記憶装置に記憶された所定の説明情報を音声情報として出力する。音声出力部１９０は、予め音声情報として記憶された説明情報をそのまま音声情報として出力してもよいし、文字情報等として記憶された説明情報を音声情報へ変換して出力してもよい。
確認情報要求部２００は、音声出力部１９０が説明情報の出力を終了すると、所定の確認情報の入力を処理装置により要求する。つまり、確認情報要求部２００は、説明した内容を了解したか否かの入力を要求する。これに対して、音声情報入力部１１０は、ユーザが入力した確認情報を音声情報として入力し、音声認識部１２０が文字情報へ変換する。そして、文字情報取得部１４０は、属性毎辞書情報記憶部１５０が記憶した辞書情報のうち、確認情報用の辞書情報である確認認識辞書に基づき、音声認識部１２０が変換した文字情報と一致する第２の文字情報に対応する第１の文字情報を取得する。例えば、第１の文字情報としては、「了解」、「取消」等が登録されている。また、第１の文字情報「了解」に対する第２の文字情報としては「リョウカイ」、「カクニン」等が登録され、第１の文字情報「取消」に対する第２の文字情報としては「トリケシ」、「キャンセル」等が登録されている。
送信部２１０は、文字情報取得部１４０が取得した第１の文字情報を確認情報として、オペレータの端末へ通信装置を介して送信する。 FIG. 14 is a functional block diagram showing functions of the phonetic character conversion apparatus 100 according to this embodiment.
Spoken character conversion apparatus 100 according to this embodiment includes voice output section 190, confirmation information requesting section 200, and transmission section 210 in addition to the functions of voice / character conversion apparatus 100 according to the first embodiment.
The audio output unit 190 outputs predetermined explanation information stored in the storage device as audio information. The voice output unit 190 may output the description information stored as the voice information in advance as the voice information, or may convert the description information stored as the character information or the like into the voice information and output it.
When the audio output unit 190 finishes outputting the description information, the confirmation information request unit 200 requests the input of predetermined confirmation information from the processing device. That is, the confirmation information requesting unit 200 requests an input as to whether or not the contents explained are accepted. On the other hand, the voice information input unit 110 inputs the confirmation information input by the user as voice information, and the voice recognition unit 120 converts it into character information. Then, the character information acquisition unit 140 matches the character information converted by the voice recognition unit 120 based on the confirmation recognition dictionary that is the dictionary information for confirmation information among the dictionary information stored in the attribute-specific dictionary information storage unit 150. First character information corresponding to the second character information is acquired. For example, “OK”, “Cancel”, etc. are registered as the first character information. In addition, “Ryokai”, “Kakunin”, etc. are registered as the second character information for the first character information “OK”, and “Torikeshi”, as the second character information for the first character information “Cancel”, “Cancel” or the like is registered.
The transmission unit 210 transmits the first character information acquired by the character information acquisition unit 140 as confirmation information to the operator's terminal via the communication device.

また、音声情報入力部１１０は、音声出力部１９０が説明情報の出力中に、ユーザが入力した割込情報を音声情報として入力する。音声認識部１２０は、入力された音声情報を文字情報へ変換する。文字情報取得部１４０は、属性毎辞書情報記憶部１５０が記憶した辞書情報のうち、割込情報用の辞書情報である割込認識辞書に基づき、音声認識部１２０が変換した文字情報と一致する第２の文字情報に対応する第１の文字情報を取得する。送信部２１０は、文字情報取得部１４０が取得した第１の文字情報を割込情報として、オペレータの端末へ通信装置を介して送信する。 Also, the voice information input unit 110 inputs the interrupt information input by the user as voice information while the voice output unit 190 is outputting description information. The voice recognition unit 120 converts the input voice information into character information. The character information acquisition unit 140 matches the character information converted by the speech recognition unit 120 based on the interrupt recognition dictionary that is dictionary information for interrupt information among the dictionary information stored in the attribute-specific dictionary information storage unit 150. First character information corresponding to the second character information is acquired. The transmission unit 210 transmits the first character information acquired by the character information acquisition unit 140 as interrupt information to the operator's terminal via the communication device.

なお、属性情報取得部１３０は、音声出力部１９０が音声情報を出力する際、音声出力部１９０から属性情報として割込属性を取得し、音声出力部１９０が音声情報を出力を終えると、音声出力部１９０から属性情報として確認属性を取得する。文字情報取得部１４０は、実施の形態１と同様に属性情報取得部１３０が取得した属性情報に従い、使用する辞書情報を切り替えする。
また、ユーザからの入力は音声情報とともに、ボタン操作等による入力も受付してもよい。
また、説明は音声だけでなく、合わせて映像等をユーザの端末へ表示してもよい。 The attribute information acquisition unit 130 acquires an interrupt attribute as attribute information from the audio output unit 190 when the audio output unit 190 outputs audio information, and when the audio output unit 190 finishes outputting the audio information, A confirmation attribute is acquired from the output unit 190 as attribute information. The character information acquisition unit 140 switches the dictionary information to be used according to the attribute information acquired by the attribute information acquisition unit 130 as in the first embodiment.
Further, the input from the user may be received by the button operation or the like together with the voice information.
In addition, the description may display not only audio but also video or the like on the user terminal.

図１５は、この実施の形態に係る音声文字変換装置１００の実装例（双方向説明確認プログラムの一例）の説明図である。
図１５に示す双方向説明確認プログラムでは、説明文をユーザの端末へ表示するとともに、音声出力する。同様に、確認要求する場面では、確認／取消ボタンをユーザの端末へ表示するとともに、音声出力して、ボタンによる確認／取消の入力とともに、音声による確認／取消の入力を受け付ける。
また、表示するタグ（例えば、ＩＮＰＵＴタグ）と、音声出力のタグ（ＶＯＩＣＥタグ）とがタグ名称によって関連付けされている。
なお、図１５では、ＨＴＭＬを拡張したプログラムコードにより実装例を示したが、図６に基づき説明したようなＪＡＶＡＳｃｒｉｐｔ（登録商標）やＡＪＡＸにより実装しても構わない。 FIG. 15 is an explanatory diagram of an implementation example (an example of a bidirectional explanation confirmation program) of the phonetic character conversion apparatus 100 according to this embodiment.
In the interactive explanation confirmation program shown in FIG. 15, the explanatory text is displayed on the user's terminal and output as a voice. Similarly, in a scene where confirmation is requested, a confirmation / cancel button is displayed on the user's terminal and is output as a voice to accept confirmation / cancellation input by voice as well as confirmation / cancellation input by the button.
A tag to be displayed (for example, an INPUT tag) and a tag for voice output (VOICE tag) are associated with each other by a tag name.
In FIG. 15, an example of implementation is shown by using program codes obtained by extending HTML. However, implementation using JAVAScript (registered trademark) or AJAX as described with reference to FIG. 6 is also possible.

以上のように、この実施の形態に係る音声文字変換装置１００によれば、音声文字変換装置１００がオペレータに代わって音声により説明を行い、ユーザからの応答を音声により受付するため、オペレータは、音声文字変換装置１００が説明を行っている間、他の作業をすることができ作業効率を高くすることができる。
また、ユーザは音声による操作のみで足りるため、ユーザにとっても不便となることはない。
また、音声文字変換装置１００は設定された説明文を脚色することなく出力するため、オペレータによる説明よりも的確に内容を伝えることができる。 As described above, according to the phonetic character conversion device 100 according to this embodiment, the phonetic character conversion device 100 performs a voice explanation on behalf of the operator and accepts a response from the user by voice. While the voice character conversion apparatus 100 is explaining, other work can be performed and work efficiency can be improved.
Further, since the user only needs to operate by voice, there is no inconvenience for the user.
Further, since the voice character conversion device 100 outputs the set explanatory text without adapting it, the contents can be conveyed more accurately than the explanation by the operator.

次に、上記実施の形態における音声文字変換装置１００のハードウェア構成について説明する。
図１６は、音声文字変換装置１００のハードウェア構成の一例を示す図である。
図１６に示すように、音声文字変換装置１００は、プログラムを実行するＣＰＵ９１１（Ｃｅｎｔｒａｌ・Ｐｒｏｃｅｓｓｉｎｇ・Ｕｎｉｔ、中央処理装置、処理装置、演算装置、マイクロプロセッサ、マイクロコンピュータ、プロセッサともいう）を備えている。ＣＰＵ９１１は、バス９１２を介してＲＯＭ９１３、ＲＡＭ９１４、ＬＣＤ９０１（ＬｉｑｕｉｄＣｒｙｓｔａｌＤｉｓｐｌａｙ）、キーボード９０２、通信ボード９１５、磁気ディスク装置９２０と接続され、これらのハードウェアデバイスを制御する。磁気ディスク装置９２０の代わりに、光ディスク装置、メモリカード読み書き装置などの記憶装置でもよい。 Next, the hardware configuration of the phonetic character conversion apparatus 100 in the above embodiment will be described.
FIG. 16 is a diagram illustrating an example of a hardware configuration of the phonetic character conversion apparatus 100.
As shown in FIG. 16, the phonetic character conversion apparatus 100 includes a CPU 911 (also referred to as a central processing unit, a central processing unit, a processing unit, an arithmetic unit, a microprocessor, a microcomputer, and a processor) that executes a program. . The CPU 911 is connected to the ROM 913, the RAM 914, the LCD 901 (Liquid Crystal Display), the keyboard 902, the communication board 915, and the magnetic disk device 920 via the bus 912, and controls these hardware devices. Instead of the magnetic disk device 920, a storage device such as an optical disk device or a memory card read / write device may be used.

ＲＯＭ９１３、磁気ディスク装置９２０は、不揮発性メモリの一例である。ＲＡＭ９１４は、揮発性メモリの一例である。ＲＯＭ９１３とＲＡＭ９１４と磁気ディスク装置９２０とは、記憶装置の一例である。通信ボード９１５とキーボード９０２とは、入力装置の一例である。また、通信ボード９１５は、出力装置の一例である。さらに、通信ボード９１５は、通信装置の一例である。また、さらに、ＬＣＤ９０１は、表示装置の一例である。 The ROM 913 and the magnetic disk device 920 are examples of a nonvolatile memory. The RAM 914 is an example of a volatile memory. The ROM 913, the RAM 914, and the magnetic disk device 920 are examples of storage devices. The communication board 915 and the keyboard 902 are examples of input devices. The communication board 915 is an example of an output device. Furthermore, the communication board 915 is an example of a communication device. Furthermore, the LCD 901 is an example of a display device.

磁気ディスク装置９２０又はＲＯＭ９１３などには、オペレーティングシステム９２１（ＯＳ）、ウィンドウシステム９２２、プログラム群９２３、ファイル群９２４が記憶されている。プログラム群９２３のプログラムは、ＣＰＵ９１１、オペレーティングシステム９２１、ウィンドウシステム９２２により実行される。 An operating system 921 (OS), a window system 922, a program group 923, and a file group 924 are stored in the magnetic disk device 920 or the ROM 913. The programs in the program group 923 are executed by the CPU 911, the operating system 921, and the window system 922.

プログラム群９２３には、上記の説明において音声文字変換装置１００の各処理を実行するプログラムやその他のプログラムが記憶されている。プログラムは、ＣＰＵ９１１により読み出され実行される。つまり、「音声情報入力部１１０」、「音声認識部１２０」、「属性情報取得部１３０」、「文字情報取得部１４０」、「情報取得部１６０」、「項目情報取得部１７０」として説明した機能を実行するプログラムやその他のプログラムが記憶されている。
ファイル群９２４には、上記の説明において音声文字変換装置１００が扱う情報やデータや信号値や変数値やパラメータが、「ファイル」や「データベース」の各項目として記憶される。つまり、「属性毎辞書情報記憶部１５０」、「項目識別辞書情報記憶部１８０」が記憶した情報が、「ファイル」や「データベース」の各項目として記憶される。「ファイル」や「データベース」は、ディスクやメモリなどの記録媒体に記憶される。ディスクやメモリなどの記憶媒体に記憶された情報やデータや信号値や変数値やパラメータは、読み書き回路を介してＣＰＵ９１１によりメインメモリやキャッシュメモリに読み出され、抽出・検索・参照・比較・演算・計算・処理・出力・印刷・表示などのＣＰＵ９１１の動作に用いられる。抽出・検索・参照・比較・演算・計算・処理・出力・印刷・表示のＣＰＵ９１１の動作の間、情報やデータや信号値や変数値やパラメータは、メインメモリやキャッシュメモリやバッファメモリに一時的に記憶される。
また、上記の説明におけるフローチャートの矢印の部分は主としてデータや信号の入出力を示し、データや信号値は、ＲＡＭ９１４のメモリ、その他光ディスク等の記録媒体に記録される。また、データや信号は、バス９１２や信号線やケーブルその他の伝送媒体によりオンライン伝送される。 The program group 923 stores a program for executing each process of the phonetic character conversion apparatus 100 in the above description and other programs. The program is read and executed by the CPU 911. That is, it has been described as “voice information input unit 110”, “voice recognition unit 120”, “attribute information acquisition unit 130”, “character information acquisition unit 140”, “information acquisition unit 160”, and “item information acquisition unit 170”. A program for executing the function and other programs are stored.
The file group 924 stores information, data, signal values, variable values, and parameters handled by the phonetic character conversion apparatus 100 in the above description as items of “file” and “database”. That is, information stored in the “attribute-specific dictionary information storage unit 150” and the “item identification dictionary information storage unit 180” is stored as items of “file” and “database”. The “file” and “database” are stored in a recording medium such as a disk or a memory. Information, data, signal values, variable values, and parameters stored in a storage medium such as a disk or memory are read out to the main memory or cache memory by the CPU 911 via a read / write circuit, and extracted, searched, referenced, compared, and calculated. Used for the operation of the CPU 911 such as calculation / processing / output / printing / display. Information, data, signal values, variable values, and parameters are temporarily stored in the main memory, cache memory, and buffer memory during the operation of the CPU 911 for extraction, search, reference, comparison, calculation, calculation, processing, output, printing, and display. Is remembered.
In addition, the arrows in the flowcharts in the above description mainly indicate input / output of data and signals, and the data and signal values are recorded in a memory of the RAM 914 and other recording media such as an optical disk. Data and signals are transmitted online via a bus 912, signal lines, cables, or other transmission media.

また、上記の説明において「〜部」として説明するものは、「〜回路」、「〜装置」、「〜機器」、「〜手段」、「〜機能」であってもよく、また、「〜ステップ」、「〜手順」、「〜処理」であってもよい。また、「〜装置」として説明するものは、「〜回路」、「〜装置」、「〜機器」、「〜手段」、「〜機能」であってもよく、また、「〜ステップ」、「〜手順」、「〜処理」であってもよい。さらに、「〜処理」として説明するものは「〜ステップ」であっても構わない。すなわち、「〜部」として説明するものは、ＲＯＭ９１３に記憶されたファームウェアで実現されていても構わない。或いは、ソフトウェアのみ、或いは、素子・デバイス・基板・配線などのハードウェアのみ、或いは、ソフトウェアとハードウェアとの組み合わせ、さらには、ファームウェアとの組み合わせで実施されても構わない。ファームウェアとソフトウェアは、プログラムとして、ＲＯＭ９１３等の記録媒体に記憶される。プログラムはＣＰＵ９１１により読み出され、ＣＰＵ９１１により実行される。すなわち、プログラムは、上記で述べた「〜部」としてコンピュータ等を機能させるものである。あるいは、上記で述べた「〜部」の手順や方法をコンピュータ等に実行させるものである。 In addition, what is described as “to part” in the above description may be “to circuit”, “to device”, “to device”, “to means”, and “to function”. It may be “step”, “˜procedure”, “˜processing”. In addition, what is described as “˜device” may be “˜circuit”, “˜device”, “˜device”, “˜means”, “˜function”, and “˜step”, “ ~ Procedure "," ~ process ". Furthermore, what is described as “to process” may be “to step”. That is, what is described as “˜unit” may be realized by firmware stored in the ROM 913. Alternatively, it may be implemented only by software, or only by hardware such as elements, devices, substrates, and wirings, by a combination of software and hardware, or by a combination of firmware. Firmware and software are stored in a recording medium such as ROM 913 as a program. The program is read by the CPU 911 and executed by the CPU 911. That is, the program causes a computer or the like to function as the “˜unit” described above. Alternatively, the computer or the like is caused to execute the procedures and methods of “to part” described above.

実施の形態１に係る音声文字変換装置１００の機能の概要を示す概念図。FIG. 3 is a conceptual diagram showing an outline of functions of the phonetic character conversion device 100 according to the first embodiment. 実施の形態１に係る音声文字変換装置１００の機能を示す機能ブロック図。FIG. 3 is a functional block diagram showing functions of the phonetic character conversion device 100 according to the first embodiment. 実施の形態１に係る音声文字変換装置１００であって、図２とは異なる構成を示す図。FIG. 3 is a diagram showing a configuration different from FIG. 2, which is a phonetic character conversion apparatus 100 according to the first embodiment. アプリケーション１０により表示される画面情報と、属性毎辞書情報記憶部１５０が記憶する辞書情報の一例を示す図。The figure which shows an example of the screen information displayed by the application 10, and the dictionary information which the dictionary information storage part 150 for every attribute memorize | stores. 実施の形態１に係る音声文字変換装置１００の実装例（音声文字変換プログラムの一例）の説明図。Explanatory drawing of the example of implementation (an example of a phonetic character conversion program) of the phonetic character conversion apparatus 100 which concerns on Embodiment 1. FIG. ＪＡＶＡＳｃｒｉｐｔ（登録商標）やＡＪＡＸによる実装例を示す図。The figure which shows the example of mounting by JAVAScript (trademark) or AJAX. 実施の形態２に係る音声文字変換装置１００の機能の概要を示す概念図。The conceptual diagram which shows the outline | summary of the function of the audio | voice character conversion apparatus 100 which concerns on Embodiment 2. FIG. 実施の形態２に係る音声文字変換装置１００の機能を示す機能ブロック図。FIG. 4 is a functional block diagram showing functions of the phonetic character conversion device 100 according to the second embodiment. 実施の形態２に係る音声文字変換装置１００であって、図８とは異なる構成を示す図。FIG. 9 is a phonetic character conversion device 100 according to Embodiment 2 and shows a configuration different from that in FIG. 8. 実施の形態３に係る音声文字変換装置１００の機能の概要を示す概念図。FIG. 9 is a conceptual diagram showing an outline of functions of the phonetic character conversion device 100 according to the third embodiment. 実施の形態３に係る音声文字変換装置１００の機能を示す機能ブロック図。FIG. 9 is a functional block diagram showing functions of the phonetic character conversion device 100 according to the third embodiment. 図１０に基づき説明した音声情報から文字情報へ変換する処理の補足説明図。FIG. 11 is a supplementary explanatory diagram of processing for converting voice information to character information described based on FIG. 10. 実施の形態４に係る音声文字変換装置１００の機能の概要を示す概念図。The conceptual diagram which shows the outline | summary of the function of the audio | voice character conversion apparatus 100 which concerns on Embodiment 4. FIG. 実施の形態４に係る音声文字変換装置１００の機能を示す機能ブロック図。FIG. 6 is a functional block diagram showing functions of the phonetic character conversion device 100 according to the fourth embodiment. 実施の形態４に係る音声文字変換装置１００の実装例（双方向説明確認プログラムの一例）の説明図。Explanatory drawing of the implementation example (an example of a bidirectional | two-way description confirmation program) of the phonetic character conversion apparatus 100 which concerns on Embodiment 4. FIG. 音声文字変換装置１００のハードウェア構成の一例を示す図。The figure which shows an example of the hardware constitutions of the voice character conversion apparatus 100.

Explanation of symbols

１０アプリケーション、１１オペレータ端末、１２ユーザ端末、１００音声文字変換装置、１０１音声認識装置、１１０音声情報入力部、１１１第１の音声情報入力部、１１２第２の音声情報入力部、１２０音声認識部、１２１音声変換部、１２２文字情報比較部、１２３不一致部分決定部、１２４文字情報生成部、１２５言語モデル記憶部、１２６言語モデル更新部、１３０属性情報取得部、１４０文字情報取得部、１５０属性毎辞書情報記憶部、１６０情報取得部、１６１第１の情報取得部、１６２第２の情報取得部、１７０項目情報取得部、１８０項目識別辞書情報記憶部、１９０音声出力部、２００確認情報要求部、２１０送信部。 DESCRIPTION OF SYMBOLS 10 Application, 11 Operator terminal, 12 User terminal, 100 Voice character conversion apparatus, 101 Voice recognition apparatus, 110 Voice information input part, 111 1st voice information input part, 112 2nd voice information input part, 120 Voice recognition part 121, speech conversion unit, 122 character information comparison unit, 123 mismatched part determination unit, 124 character information generation unit, 125 language model storage unit, 126 language model update unit, 130 attribute information acquisition unit, 140 character information acquisition unit, 150 attribute Each dictionary information storage unit, 160 information acquisition unit, 161 first information acquisition unit, 162 second information acquisition unit, 170 item information acquisition unit, 180 item identification dictionary information storage unit, 190 voice output unit, 200 confirmation information request Part, 210 transmitting part.

Claims

A first voice information input unit for inputting the first voice information input by the first terminal and storing the first voice information in a storage device;
A second voice information input unit for inputting the second voice information input by the second terminal and storing the second voice information in a storage device;
The processing apparatus, the conversion of the first audio information converted into the first character information, the converted one character first audio information for the first character information first voice information input unit inputs The first accuracy indicating the correct accuracy is calculated, and the processing device converts the second speech information input by the second speech information input unit into the second character information, and the second character information. A speech conversion unit that calculates a second accuracy indicating the accuracy with which the conversion from the second speech information character by character is correct ;
A character information comparison unit that compares the first character information converted by the voice conversion unit with the second character information and extracts a mismatched portion by a processing device;
The character information of the mismatched portion extracted by comparison by the character information comparison unit is compared character by character, and the first accuracy and the second accuracy are compared one character at a time . inconsistent area determining unit determining the processing device to one of the characters of the second character information,
A character information generation unit that replaces the mismatched portion of the first character information or the second character information with the character information determined by the mismatched portion determination unit one character at a time , and generates character information by a processing device; A phonetic character conversion device characterized by that.

The mismatch area determining section, character information of the upper Symbol unmatched portion, when the first probable determines in the first character information, when the second accuracy is higher the second The phonetic character conversion apparatus according to claim 1, wherein the character information is determined.

The phonetic character conversion device further includes:
For each user who uses the first terminal and the second terminal, a language model storage unit that stores a language model for converting speech information into character information in a storage device,
The speech conversion unit converts the first speech information into first character information based on a first language model stored by the language model storage unit for a first user who uses the first terminal. In addition to the conversion, the language model storage unit converts the second speech information into second character information based on the second language model stored for the second user who uses the second terminal. The phonetic character conversion device according to claim 1 or 2 ,

The phonetic character conversion device further includes:
Based on the character information generated by the character information generation unit and the first voice information, the language model of the first user who uses the first terminal is updated by the processing device, and the character information generation unit based on the generated character information and said second audio information, according to claim 3, characterized in that it comprises a language model updating unit that updates the processing unit a second user language model using the second terminal The phonetic character conversion device described in 1.

A first audio information input step in which the processing device inputs the first audio information input by the first terminal;
A second audio information input step in which the processing device inputs the second audio information input by the second terminal;
The processing device converts the first voice information input in the first voice information input step into first character information, and the converted first character information is converted from the first voice information character by character. The first accuracy indicating the correct accuracy is calculated , the second speech information input in the second speech information input step is converted into second character information, and the second character information is converted character by character. A speech conversion step of calculating a second accuracy indicating the accuracy of conversion from the speech information of 2 ;
A character information comparison step in which the processing device compares the first character information converted in the voice conversion step with the second character information and extracts a mismatched portion;
The processing device compares the first accuracy and the second accuracy one character at a time for the character information of the inconsistent portion extracted by comparison in the character information comparison step, and compares the first character one character at a time. A non- matching part determination step for determining any character of the information and the second character information;
A character information generation step in which the processing device replaces the mismatched portion of the first character information or the second character information with the character information determined in the mismatched portion determination step one by one , and generates character information. A phonetic character conversion method comprising:

A first audio information input process for inputting the first audio information input by the first terminal;
A second audio information input process for inputting the second audio information input by the second terminal;
The first voice information input in the first voice information input process is converted into first character information, and the conversion from the first voice information character by character is correct for the converted first character information. While calculating 1st accuracy, it converts the 2nd audio | voice information input by the said 2nd audio | voice information input process into 2nd character information, The 2nd audio | voice information per character about the said 2nd character information A voice conversion process for calculating a second accuracy indicating that the conversion from is correct ;
A character information comparison process for comparing the first character information converted by the voice conversion process with the second character information and extracting a mismatched portion;
The character information of the mismatched portion extracted by comparison in the character information comparison process is compared character by character, the first accuracy and the second accuracy, and the first character information and the first character by character. inconsistent area determining process for determining the one of the characters of the second character information,
Causing the computer to execute character information generation processing for generating character information by replacing the inconsistent portion of the first character information or the second character information one by one with the character information determined in the inconsistent portion determination processing. A phonetic character conversion program characterized by that.