JP2017211430A

JP2017211430A - Information processing device and information processing method

Info

Publication number: JP2017211430A
Application number: JP2016102755A
Authority: JP
Inventors: 早紀横山; Saki Yokoyama
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2016-05-23
Filing date: 2016-05-23
Publication date: 2017-11-30
Also published as: EP3467820A1; WO2017203764A1; EP3467820A4; US20190189122A1

Abstract

PROBLEM TO BE SOLVED: To provide an information processing device and an information processing method with which it is possible to correct texts via speech input.SOLUTION: The information processing device comprises a transmitting unit for transmitting speech information including a text correction command and a correction target, and a receiving unit for receiving processing results based on the correction command and the correction target.SELECTED DRAWING: Figure 1

Description

本開示は、情報処理装置および情報処理方法に関する。 The present disclosure relates to an information processing apparatus and an information processing method.

近年、音声によるコマンド入力の技術が発達してきている。音声によるコマンド入力では、例えば音声認識システムにより、ユーザ発話をテキスト認識し、認識したテキストの構文解析を行い、解析結果に従ってコマンドが実行される。このような音声認識システムに関し、例えば下記特許文献１には、音声認識結果をコンテキスト情報を用いて修正する音声認識修正方法が記載されている。コンテキスト情報には、ユーザ入力の履歴や会話履歴が含まれている。 In recent years, voice command input technology has been developed. In voice command input, for example, a speech recognition system recognizes a user utterance as text, performs syntax analysis of the recognized text, and executes a command according to the analysis result. Regarding such a speech recognition system, for example, Patent Document 1 below describes a speech recognition correction method for correcting a speech recognition result using context information. The context information includes a user input history and a conversation history.

特開２０１５−０１８２６５号公報JP2015-018265A

しかしながら、音声により文字入力を行っている場合、文字の削除や訂正、入力する文字の種類の切り替え等は物理的な文字入力インターフェースからの操作が必要であったり、削除や訂正等を音声で行うと音声認識結果としてそのまま文字入力されてしまったりする。 However, when characters are input by voice, deletion or correction of characters, switching of the type of characters to be input, etc. require operation from a physical character input interface, or deletion or correction is performed by voice. And the text is input as it is as a voice recognition result.

そこで、本開示では、音声入力による文章校正を実現することが可能な情報処理装置および情報処理方法を提案する。 Therefore, the present disclosure proposes an information processing apparatus and an information processing method capable of realizing sentence proofreading by voice input.

本開示によれば、文章の校正指令と校正ターゲットを含む音声情報を送信する送信部と、前記校正指令と校正ターゲットに基づく処理結果を受信する受信部と、を備える、情報処理装置を提案する。 According to the present disclosure, an information processing apparatus is provided that includes: a transmission unit that transmits audio information including a sentence calibration command and a calibration target; and a reception unit that receives a processing result based on the calibration command and the calibration target. .

本開示によれば、文章の校正指令と校正ターゲットを含む音声情報を受信する受信部と、前記校正指令と校正ターゲットに基づく処理結果を送信する送信部と、を備える、情報処理装置を提案する。 According to the present disclosure, an information processing apparatus is provided that includes a receiving unit that receives audio information including a sentence correction command and a correction target, and a transmission unit that transmits a processing result based on the correction command and the correction target. .

本開示によれば、プロセッサが、文章の校正指令と校正ターゲットを含む音声情報を送信することと、前記校正指令と校正ターゲットに基づく解析結果を受信することと、を含む、情報処理方法を提案する。 According to the present disclosure, a processor proposes an information processing method including: transmitting a voice information including a sentence proofreading instruction and a proofreading target; and receiving an analysis result based on the proofreading instruction and the proofreading target. To do.

本開示によれば、プロセッサが、文章の校正指令と校正ターゲットを含む音声情報を受信することと、前記校正指令と校正ターゲットに基づく解析結果を送信することと、を含む、情報処理方法を提案する。 According to the present disclosure, a processor proposes an information processing method including receiving audio information including a sentence proofreading instruction and a proofreading target, and transmitting an analysis result based on the proofreading instruction and the proofreading target. To do.

以上説明したように本開示によれば、音声入力による文章校正を実現することが可能となる。 As described above, according to the present disclosure, it is possible to realize sentence proofreading by voice input.

なお、上記の効果は必ずしも限定的なものではなく、上記の効果とともに、または上記の効果に代えて、本明細書に示されたいずれかの効果、または本明細書から把握され得る他の効果が奏されてもよい。 Note that the above effects are not necessarily limited, and any of the effects shown in the present specification, or other effects that can be grasped from the present specification, together with or in place of the above effects. May be played.

本実施形態による情報処理システムの概要を説明する図である。It is a figure explaining the outline | summary of the information processing system by this embodiment. 本実施形態によるクライアント端末の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the client terminal by this embodiment. 本実施形態によるサーバの構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the server by this embodiment. 本実施形態による入力する文字の種類の指定を音声で行う場合の具体例を示す図である。It is a figure which shows the specific example in the case of performing designation | designated of the kind of character to input by this embodiment with a sound. 本実施形態による入力する文字の漢字変換の指定を音声で行う場合の具体例を示す図である。It is a figure which shows the specific example in the case of performing specification of the kanji conversion of the character to input by this embodiment by a voice. 本実施形態によるユーザ発話と校正情報の分析結果の一例を示す図である。It is a figure which shows an example of the user utterance by this embodiment, and the analysis result of proofreading information. 図６に示すユーザ発話に対する最終出力結果の一例を示す図である。It is a figure which shows an example of the final output result with respect to the user utterance shown in FIG. 本実施形態によるユーザ発話とコンテキスト情報を考慮した校正情報の分析結果の一例を示す図である。It is a figure which shows an example of the analysis result of the proofreading information which considered the user utterance and context information by this embodiment. 図８に示すユーザ発話に対する最終出力結果の一例を示す図である。It is a figure which shows an example of the final output result with respect to the user utterance shown in FIG. 本実施形態による情報処理システムの動作処理を示すフローチャートである。It is a flowchart which shows the operation processing of the information processing system by this embodiment. 本実施形態による他のシステム構成を示す図である。It is a figure which shows the other system structure by this embodiment. 本実施形態によるエッジサーバの構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the edge server by this embodiment.

以下に添付図面を参照しながら、本開示の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In addition, in this specification and drawing, about the component which has the substantially same function structure, duplication description is abbreviate | omitted by attaching | subjecting the same code | symbol.

また、説明は以下の順序で行うものとする。
１．本開示の一実施形態による情報処理システムの概要
２．構成
２−１．クライアント端末の構成
２−２．サーバの構成
３．動作処理
４．他のシステム構成
５．まとめ The description will be made in the following order.
1. 1. Overview of information processing system according to an embodiment of the present disclosure Configuration 2-1. Configuration of client terminal 2-2. 2. Server configuration Operation processing 4. Other system configuration Summary

＜＜１．本開示の一実施形態による情報処理システムの概要＞＞
まず、本開示の一実施形態による情報処理システムの概要について説明する。図１は、本実施形態による情報処理システムの概要を説明する図である。図１に示すように、本実施形態による情報処理システムは、クライアント端末１とサーバ２を含む。クライアント端末１とサーバ２は、例えばネットワーク３を介して接続され、データの送受信を行う。 << 1. Overview of Information Processing System According to One Embodiment of Present Disclosure >>
First, an overview of an information processing system according to an embodiment of the present disclosure will be described. FIG. 1 is a diagram for explaining the outline of the information processing system according to this embodiment. As shown in FIG. 1, the information processing system according to the present embodiment includes a client terminal 1 and a server 2. The client terminal 1 and the server 2 are connected via, for example, the network 3 and transmit / receive data.

本実施形態による情報処理システムは、音声による文字入力を実現する音声認識システムであって、クライアント端末１で収音したユーザ発話の音声認識、テキスト解析を行い、解析結果としてテキストをクライアント端末１に出力する。 The information processing system according to the present embodiment is a speech recognition system that realizes character input by speech, and performs speech recognition and text analysis of user utterances collected by the client terminal 1, and text is sent to the client terminal 1 as an analysis result. Output.

クライアント端末１は、例えばスマートフォン、タブレット端末、携帯電話端末、ウェアラブル端末、パーソナルコンピュータ、ゲーム機、音楽プレイヤー等であってもよい。 The client terminal 1 may be, for example, a smartphone, a tablet terminal, a mobile phone terminal, a wearable terminal, a personal computer, a game machine, a music player, or the like.

ここで、既存の音声認識システムでは、文字の種類の切り替え（大文字、小文字、ローマ字、数字、平仮名、片仮名等の切り替え）を音声で行うことが困難であり、物理的な文字入力インターフェースからの操作が必要であった。また、入力した文章の校正を行う際、文章の削除や挿入、訂正等の入力を音声で行うと、音声認識結果としてそのまま文字入力されてしまうため、音声での校正が困難であった。 Here, in the existing speech recognition system, it is difficult to switch the character type (switching between uppercase, lowercase, Roman, numbers, hiragana, katakana, etc.) by voice, and operation from a physical character input interface is difficult. Was necessary. Further, when the input sentence is proofread, if the input of deletion, insertion, correction or the like of the sentence is made by voice, the character is inputted as it is as a voice recognition result, so that the proofreading by voice is difficult.

また、漢字には同音異義語があるため、一度の変換では目的の漢字が出てこなかったり、ユーザが望む漢字が出せないため物理的な文字入力インターフェースへ切り替えたりしなければならなかった。 In addition, because kanji has homonyms, the target kanji cannot be obtained with a single conversion, or the kanji desired by the user cannot be produced, so it has been necessary to switch to a physical character input interface.

そこで、本実施形態による情報処理システムは、音声入力による文章校正を実現し、校正の際に物理的な文字入力インターフェースへ切り替えるといった煩雑な操作を不要にする。具体的には、本実施形態による情報処理システムは、ユーザ発話のテキスト解析において校正発話か通常発話かの判断を行い、校正発話だった場合の校正情報を分析する。 Therefore, the information processing system according to the present embodiment realizes sentence proofreading by voice input and eliminates a complicated operation such as switching to a physical character input interface at the time of proofreading. Specifically, the information processing system according to the present embodiment determines whether the utterance is a proof utterance or a normal utterance in the text analysis of the user utterance, and analyzes the proofreading information in the case of the proof utterance.

以上、本実施形態による情報処理システムの概要について説明した。続いて、本実施形態による情報処理システムに含まれる各装置の構成について図２〜図３を参照して説明する。 The overview of the information processing system according to the present embodiment has been described above. Next, the configuration of each device included in the information processing system according to the present embodiment will be described with reference to FIGS.

＜＜２．構成＞＞
＜２−１．クライアント端末の構成＞
図２は、本実施形態によるクライアント端末１の構成の一例を示すブロック図である。図２に示すように、クライアント端末１（情報処理装置）は、制御部１０、音声入力部１１、撮像部１２、センサ１３、通信部１４、表示部１５、および記憶部１６を有する。 << 2. Configuration >>
<2-1. Configuration of client terminal>
FIG. 2 is a block diagram illustrating an example of the configuration of the client terminal 1 according to the present embodiment. As illustrated in FIG. 2, the client terminal 1 (information processing apparatus) includes a control unit 10, a voice input unit 11, an imaging unit 12, a sensor 13, a communication unit 14, a display unit 15, and a storage unit 16.

制御部１０は、演算処理装置および制御装置として機能し、各種プログラムに従ってクライアント端末１内の動作全般を制御する。制御部１０は、例えばＣＰＵ（Central Processing Unit）、マイクロプロセッサ等の電子回路によって実現される。また、制御部１０は、使用するプログラムや演算パラメータ等を記憶するＲＯＭ（Read Only Memory）、及び適宜変化するパラメータ等を一時記憶するＲＡＭ（Random Access Memory）を含んでいてもよい。 The control unit 10 functions as an arithmetic processing device and a control device, and controls the overall operation in the client terminal 1 according to various programs. The control unit 10 is realized by an electronic circuit such as a CPU (Central Processing Unit) or a microprocessor, for example. The control unit 10 may include a ROM (Read Only Memory) that stores programs to be used, calculation parameters, and the like, and a RAM (Random Access Memory) that temporarily stores parameters that change as appropriate.

制御部１０は、音声入力部１１から入力されたユーザ発話の音声を、通信部１４からネットワーク３を介してサーバ２へ送信する。送信する音声情報の形態は、収音した音声データ（生データ）であってもよいし、収音した音声データから抽出した特徴量データ（音素列など、ある程度加工したもの）であってもよいし、収音した音声データのテキスト解析結果であってもよい。音声データのテキスト解析結果は、例えばユーザ発話の音声に含まれる校正指令部分と校正ターゲット部分を分析した結果である。かかる分析は、次に説明するローカルテキスト解析部１０２で行われ得る。なお、本明細書において、「校正指令」とは、校正ターゲットに対してどのような校正をすべきかを示すものであって、例えば削除、置換、追加等の入力された文字列の修正や、入力する文字種類の指定（アルファベット、大文字、小文字、平仮名、片仮名等）、入力する文字の表現指定（漢字、スペル等）が想定される。また、本明細書において、「校正ターゲット」とは、校正指令の対象となるものを示す。 The control unit 10 transmits the voice of the user utterance input from the voice input unit 11 from the communication unit 14 to the server 2 via the network 3. The form of voice information to be transmitted may be collected voice data (raw data) or feature amount data extracted from the collected voice data (processed to some extent, such as phoneme string). Alternatively, it may be a text analysis result of the collected voice data. The text analysis result of the speech data is a result of analyzing a calibration command portion and a calibration target portion included in the speech of the user utterance, for example. Such an analysis can be performed by the local text analysis unit 102 described below. In the present specification, the “calibration command” indicates what kind of calibration should be performed on the calibration target, for example, correction of an input character string such as deletion, replacement, addition, It is assumed that the character type to be input (alphabet, uppercase, lowercase, hiragana, katakana, etc.) and the expression specification of the input character (kanji, spelling, etc.) are assumed. Further, in this specification, “calibration target” indicates a target for a calibration command.

また、制御部１０は、ユーザ発話の際に撮像部１２で撮像したユーザ動作の撮像画像やセンサ１３で検知したセンサ情報（画面へのタッチ情報等）を、コンテキスト情報として、通信部１４からネットワーク３を介してサーバ２へ送信する。送信するコンテキスト情報の形態は、取得した撮像画像やセンサ情報（生データ）であってもよいし、取得した撮像画像やセンサ情報から抽出した特徴量データ（ベクター化など、ある程度加工したもの）であってもよいし、取得した撮像画像やセンサ情報の解析結果（認識結果）であってもよい。撮像画像やセンサ情報の解析結果は、例えばユーザの動作や操作を認識した結果である。 Also, the control unit 10 uses the captured image of the user action captured by the imaging unit 12 and the sensor information (such as touch information on the screen) detected by the sensor 13 during user utterance as context information from the communication unit 14 to the network. 3 to the server 2. The form of the context information to be transmitted may be an acquired captured image or sensor information (raw data), or feature amount data extracted from the acquired captured image or sensor information (processed to some extent such as vectorization). It may be an acquired captured image or an analysis result (recognition result) of sensor information. The analysis result of the captured image and sensor information is, for example, the result of recognizing the user's operation and operation.

制御部１０は、図２に示すように、ローカル音声認識部１０１、ローカルテキスト解析部１０２、およびローカル最終出力決定部１０３としても機能し得る。 As shown in FIG. 2, the control unit 10 can also function as a local speech recognition unit 101, a local text analysis unit 102, and a local final output determination unit 103.

ローカル音声認識部１０１は、音声入力部１１から入力されたユーザ発話の音声信号に対して音声認識を行い、ユーザ発話をテキスト化する。本実施形態によるローカル音声認識部１０１は、後述するサーバ２の音声認識部２０１のサブセットであって、簡易の音声認識機能を有する。 The local speech recognition unit 101 performs speech recognition on the speech signal of the user utterance input from the speech input unit 11, and converts the user utterance into text. The local speech recognition unit 101 according to the present embodiment is a subset of the speech recognition unit 201 of the server 2 described later, and has a simple speech recognition function.

ローカルテキスト解析部１０２は、音声認識によりテキスト化された文字列を解析する。具体的には、ローカルテキスト解析部１０２は、記憶部１６に予め記憶されている校正発話データを参照し、文字列が単なる文字入力の発話（通常発話）であるか校正発話であるかを分析する。ローカルテキスト解析部１０２は、校正発話らしさや、校正発話だった場合の校正ターゲットと校正指令を出力する。校正発話らしさは、確信度を示すスコアとして算出される。また、ローカルテキスト解析部１０２は、複数の候補をスコアと共に出力してもよい。さらに、ローカルテキスト解析部１０２は、ユーザ発話の際に撮像部１２で撮像した撮像画像や、その他センサ１３により検知したセンサ情報（加速度センサ情報、タッチセンサ情報等）を考慮して分析してもよい。また、本実施形態によるローカルテキスト解析部１０２は、後述するサーバ２のテキスト解析部２０２のサブセットであって、簡易の解析機能を有する。具体的には、ローカルテキスト解析部１０２で用いる校正発話データの量がサーバ２で保有されているデータ量に比べて少ないため、例えば「削除」という校正用語は理解できるが、「消したい」「消して欲しいな」といった言葉は校正用語として理解できない。 The local text analysis unit 102 analyzes a character string converted into text by voice recognition. Specifically, the local text analysis unit 102 refers to the proof utterance data stored in advance in the storage unit 16 and analyzes whether the character string is a mere utterance of character input (normal utterance) or a proof utterance. To do. The local text analysis unit 102 outputs the proofreading utterance likelihood and the proofreading target and the proofreading command in the case of the proofreading utterance. The likelihood of proofreading utterance is calculated as a score indicating confidence. Further, the local text analysis unit 102 may output a plurality of candidates together with the score. Further, the local text analysis unit 102 may analyze the captured image captured by the imaging unit 12 during user utterance and other sensor information (acceleration sensor information, touch sensor information, etc.) detected by the sensor 13. Good. The local text analysis unit 102 according to the present embodiment is a subset of the text analysis unit 202 of the server 2 to be described later and has a simple analysis function. Specifically, since the amount of proofreading utterance data used in the local text analysis unit 102 is smaller than the amount of data held by the server 2, for example, the proofreading term “deletion” can be understood, but “desired” “ I don't understand words like "I want you to erase them" as proofreading terms.

ローカル最終出力決定部１０３は、最終的に出力するものを決定する機能を有する。例えばローカル最終出力決定部１０３は、音声認識により抽出された特定のキーワード（例えば「校正モード」「切替」など）や、テキスト解析結果に基づいて、ユーザ発話が通常発話か校正発話かを判断する。通常発話と判断した場合、ローカル最終出力決定部１０３は、音声認識された文字列をそのまま表示部１５の画面上に出力する。一方、校正発話と判断した場合、ローカル最終出力決定部１０３は、ローカルテキスト解析部１０２により分析された校正ターゲットと校正指令に基づいて、入力された文章の校正処理を行い、校正結果を表示部１５の画面上に出力する。なお複数の解析結果がある場合、ローカル最終出力決定部１０３は、各候補の確信度を示すスコアを参照してどの解析結果を用いるか決めてもよい。 The local final output determination unit 103 has a function of determining what is finally output. For example, the local final output determination unit 103 determines whether the user utterance is a normal utterance or a proofreading utterance based on a specific keyword (for example, “proofreading mode”, “switching”, etc.) extracted by voice recognition and a text analysis result. . If it is determined that the utterance is normal, the local final output determination unit 103 outputs the speech-recognized character string as it is on the screen of the display unit 15. On the other hand, when it is determined that the utterance is proofreading, the local final output determination unit 103 performs a proofreading process on the input sentence based on the proofreading target and the proofreading command analyzed by the local text analysis unit 102, and displays the proofreading result. 15 screens are output. When there are a plurality of analysis results, the local final output determination unit 103 may determine which analysis result to use with reference to a score indicating the certainty factor of each candidate.

なお本実施形態によるローカル最終出力決定部１０３は、後述するサーバ２の最終出力決定部２０３のサブセットであって、簡易の決定機能を有するものである。 Note that the local final output determination unit 103 according to the present embodiment is a subset of the final output determination unit 203 of the server 2 to be described later, and has a simple determination function.

以上、制御部１０の機能構成について説明した。制御部１０は、ローカル音声認識部１０１、ローカルテキスト解析部１０２、およびローカル最終出力決定部１０３といったローカルのサブセットで処理を行うことで処理速度を早くすることができるが、本実施形態はこれに限定されない。例えば制御部１０は、サブセットで十分な処理ができなかった場合やエラーが出た場合にはサーバ２にデータを送信して処理要求を行い、サーバ２から処理結果を受信して利用してもよい。または、制御部１０は、サーバ２にデータを送信して処理要求を行うと共に、サブセットでも処理を行い、サーバ２からの処理結果を所定時間待ったり、各処理結果の確信度を示すスコアを参照して、利用するデータを選択してもよい。 The functional configuration of the control unit 10 has been described above. The control unit 10 can increase the processing speed by performing processing in the local subsets such as the local speech recognition unit 101, the local text analysis unit 102, and the local final output determination unit 103. It is not limited. For example, the control unit 10 can send data to the server 2 to request processing when a subset cannot perform sufficient processing or an error occurs, and can receive and use the processing result from the server 2. Good. Alternatively, the control unit 10 transmits data to the server 2 to request processing, and also performs processing in the subset, waits for a predetermined time for the processing result from the server 2, or refers to a score indicating the certainty of each processing result Then, data to be used may be selected.

音声入力部１１は、ユーザ音声や周辺の環境音を収音し、音声信号を制御部１０に出力する。具体的には、音声入力部１１は、マイクロホンおよびアンプ等により実現される。また、音声入力部１１は、複数のマイクロホンから成るマイクアレイにより実現されてもよい。 The voice input unit 11 collects user voice and surrounding environmental sounds, and outputs a voice signal to the control unit 10. Specifically, the voice input unit 11 is realized by a microphone, an amplifier, and the like. The audio input unit 11 may be realized by a microphone array including a plurality of microphones.

撮像部１２は、ユーザの顔周辺や動作を撮像し、撮像画像を制御部１０に出力する。撮像部１２は、撮像レンズ、絞り、ズームレンズ、及びフォーカスレンズ等により構成されるレンズ系と、レンズ系に対してフォーカス動作やズーム動作を行わせる駆動系と、レンズ系で得られる撮像光を光電変換して撮像信号を生成する固体撮像素子アレイ等とを有する。固体撮像素子アレイは、例えばＣＣＤ（Charge Coupled Device）センサアレイや、ＣＭＯＳ（Complementary Metal Oxide Semiconductor）センサアレイにより実現されてもよい。 The imaging unit 12 captures the user's face periphery and motion, and outputs the captured image to the control unit 10. The imaging unit 12 includes a lens system including an imaging lens, a diaphragm, a zoom lens, and a focus lens, a drive system that causes the lens system to perform a focus operation and a zoom operation, and imaging light obtained by the lens system. And a solid-state imaging device array that generates an imaging signal through photoelectric conversion. The solid-state imaging device array may be realized by, for example, a CCD (Charge Coupled Device) sensor array or a CMOS (Complementary Metal Oxide Semiconductor) sensor array.

センサ１３は、撮像部１２（撮像センサ）以外の各種センサの総称であって、例えば加速度センサ、ジャイロセンサ、表示部１５の画面上に設けられているタッチセンサ等が想定される。センサ１３は、検知したセンサ情報を制御部１０に出力する。 The sensor 13 is a generic name for various sensors other than the imaging unit 12 (imaging sensor). For example, an acceleration sensor, a gyro sensor, a touch sensor provided on the screen of the display unit 15, and the like are assumed. The sensor 13 outputs the detected sensor information to the control unit 10.

通信部１４は、有線／無線により他の装置との間でデータの送受信を行う通信モジュールである。通信部１４は、例えば有線ＬＡＮ（Local Area Network）、無線ＬＡＮ、Ｗｉ−Ｆｉ（Wireless Fidelity、登録商標）、赤外線通信、Ｂｌｕｅｔｏｏｔｈ（登録商標）、近距離／非接触通信等の方式で、外部機器と直接またはネットワークアクセスポイントを介して通信する。 The communication unit 14 is a communication module that transmits and receives data to and from other devices by wire / wireless. The communication unit 14 is, for example, a wired LAN (Local Area Network), wireless LAN, Wi-Fi (Wireless Fidelity (registered trademark)), infrared communication, Bluetooth (registered trademark), short-range / non-contact communication, etc. Communicate directly or via a network access point.

表示部１５は、例えば液晶ディスプレイ（LCD）装置、ＯＬＥＤ（Organic Light Emitting Diode）装置等により実現される。表示部１５は、制御部１０の制御に従って表示画面に情報を表示する。 The display unit 15 is realized by, for example, a liquid crystal display (LCD) device, an OLED (Organic Light Emitting Diode) device, or the like. The display unit 15 displays information on the display screen according to the control of the control unit 10.

記憶部１６は、制御部１０が各種処理を実行するためのプログラム等を記憶する。また、記憶部１６は、記憶媒体、記憶媒体にデータを記録する記録装置、記憶媒体からデータを読み出す読出し装置および記憶媒体に記録されたデータを削除する削除装置等を含むストレージ装置により構成される。 The storage unit 16 stores a program or the like for the control unit 10 to execute various processes. The storage unit 16 includes a storage device including a storage medium, a recording device that records data on the storage medium, a reading device that reads data from the storage medium, and a deletion device that deletes data recorded on the storage medium. .

以上、本実施形態によるクライアント端末１の構成について具体的に説明した。なお本実施形態によるクライアント端末１の構成は、図２に示す例に限定されない。例えばクライアント端末１は、ローカル音声認識部１０１、ローカルテキスト解析部１０２、およびローカル最終出力決定部１０３の全てまたは一部を有さない構成であってもよい。 The configuration of the client terminal 1 according to the present embodiment has been specifically described above. Note that the configuration of the client terminal 1 according to the present embodiment is not limited to the example shown in FIG. For example, the client terminal 1 may be configured not to include all or part of the local speech recognition unit 101, the local text analysis unit 102, and the local final output determination unit 103.

また、本実施形態ではクライアント端末１とサーバ２を含む情報処理システムとして説明しているが、図２〜図３を参照して説明する各構成をクライアントモジュールおよびサーバモジュールとして有する単体の情報処理装置により実現されてもよい。若しくは、クライアント端末１の構成を、図３を参照して説明するサーバ２の制御部２０の各構成（音声認識部２０１、テキスト解析部２０２、最終出力決定部２０３）と同様の機能を有するものとしてもよい。 In the present embodiment, the information processing system including the client terminal 1 and the server 2 is described. However, a single information processing apparatus having each configuration described with reference to FIGS. 2 to 3 as a client module and a server module. May be realized. Alternatively, the configuration of the client terminal 1 has the same function as each configuration (speech recognition unit 201, text analysis unit 202, final output determination unit 203) of the control unit 20 of the server 2 described with reference to FIG. It is good.

＜２−２．サーバの構成＞
図３は、本実施形態によるサーバ２の構成の一例を示すブロック図である。図３に示すように、サーバ２（情報処理装置）は、制御部２０、通信部２１、および校正発話ＤＢ（データベース）２２を有する。 <2-2. Server configuration>
FIG. 3 is a block diagram illustrating an example of the configuration of the server 2 according to the present embodiment. As illustrated in FIG. 3, the server 2 (information processing apparatus) includes a control unit 20, a communication unit 21, and a proof speech DB (database) 22.

制御部２０は、演算処理装置および制御装置として機能し、各種プログラムに従ってサーバ２内の動作全般を制御する。制御部２０は、例えばＣＰＵ（Central Processing Unit）、マイクロプロセッサ等の電子回路によって実現される。また、制御部２０は、使用するプログラムや演算パラメータ等を記憶するＲＯＭ（Read Only Memory）、及び適宜変化するパラメータ等を一時記憶するＲＡＭ（Random Access Memory）を含んでいてもよい。 The control unit 20 functions as an arithmetic processing device and a control device, and controls the overall operation in the server 2 according to various programs. The control unit 20 is realized by an electronic circuit such as a CPU (Central Processing Unit) and a microprocessor, for example. The control unit 20 may include a ROM (Read Only Memory) that stores programs to be used, calculation parameters, and the like, and a RAM (Random Access Memory) that temporarily stores parameters that change as appropriate.

制御部２０は、クライアント端末１から受信したユーザ発話の音声に基づいて音声認識処理、テキスト解析処理、および最終出力決定処理を行い、処理結果（音声認識結果、テキスト解析結果、または校正情報（例えば校正結果））をクライアント端末１に送信するよう制御する。 The control unit 20 performs a speech recognition process, a text analysis process, and a final output determination process based on the voice of the user utterance received from the client terminal 1, and processes the result (speech recognition result, text analysis result, or calibration information (for example, Control is performed to transmit the calibration result)) to the client terminal 1.

制御部２０は、図３に示すように、音声認識部２０１、テキスト解析部２０２、および最終出力決定部２０３としても機能し得る。 As illustrated in FIG. 3, the control unit 20 can also function as a voice recognition unit 201, a text analysis unit 202, and a final output determination unit 203.

音声認識部２０１は、クライアント端末１から送信されたユーザ発話の音声信号に対して音声認識を行い、ユーザ発話をテキスト化する。 The voice recognition unit 201 performs voice recognition on the voice signal of the user utterance transmitted from the client terminal 1 and converts the user utterance into text.

テキスト解析部２０２は、音声認識によりテキスト化された文字列を解析する。具体的には、テキスト解析部２０２は、校正発話ＤＢ２２に予め記憶されている校正発話データを参照し、文字列が単なる文字入力の発話（通常発話）であるか校正発話であるかを分析する。テキスト解析部２０２は、校正発話らしさや、校正発話だった場合の校正ターゲットと校正指令を出力する。校正発話らしさは、確信度を示すスコアとして算出される。また、テキスト解析部２０２は、複数の候補をスコアと共に出力してもよい。さらに、テキスト解析部２０２は、クライアント端末１から送信されたユーザ発話の際のコンテキスト情報（撮像画像やセンサ情報）を考慮して分析してもよい。 The text analysis unit 202 analyzes a character string converted into text by voice recognition. Specifically, the text analysis unit 202 refers to the proofreading utterance data stored in advance in the proofreading utterance DB 22 and analyzes whether the character string is a simple character input utterance (normal utterance) or a proofreading utterance. . The text analysis unit 202 outputs the proofreading utterance likelihood, the proofreading target and the proofreading command in the case of the proofreading utterance. The likelihood of proofreading utterance is calculated as a score indicating confidence. The text analysis unit 202 may output a plurality of candidates together with the score. Further, the text analysis unit 202 may perform analysis in consideration of context information (captured image and sensor information) at the time of user utterance transmitted from the client terminal 1.

なお校正情報の分析は、予め生成された校正発話ＤＢ２２を利用する方法に限定されず、例えば機械学習を用いて校正情報の分析精度を高めていくことも可能である。 The analysis of the proofreading information is not limited to the method using the proofreading utterance DB 22 generated in advance. For example, the accuracy of the proofreading information analysis can be increased by using machine learning.

最終出力決定部２０３は、最終的に出力するものを決定する機能を有する。例えば最終出力決定部２０３は、音声認識により抽出された特定のキーワード（例えば「校正モード」「切替」など）や、テキスト解析結果に基づいて、ユーザ発話が通常発話か校正発話かを判断する。複数の解析結果がある場合、最終出力決定部２０３は、各候補の確信度を示すスコアを参照してどの解析結果を用いるか決めてもよい。 The final output determination unit 203 has a function of determining what is finally output. For example, the final output determination unit 203 determines whether the user utterance is a normal utterance or a proofreading utterance based on a specific keyword (for example, “proofreading mode” “switching”, etc.) extracted by voice recognition or a text analysis result. When there are a plurality of analysis results, the final output determination unit 203 may determine which analysis result to use with reference to a score indicating the certainty factor of each candidate.

通常発話と判断した場合、最終出力決定部２０３は、音声認識された文字列を通信部２１からクライアント端末１に送信する。一方、校正発話と判断した場合、最終出力決定部２０３は、テキスト解析部２０２により分析され、最終決定した校正指令に基づいて校正ターゲットを処理し、校正結果を校正情報として通信部２１からクライアント端末１に送信する。 When it is determined that the utterance is normal, the final output determination unit 203 transmits the speech-recognized character string from the communication unit 21 to the client terminal 1. On the other hand, when it is determined that the utterance is proofreading, the final output determining unit 203 analyzes the proofreading target based on the proofreading command analyzed by the text analyzing unit 202 and finally determines the proofreading result as proofreading information from the communication unit 21 to the client terminal. 1 to send.

また、最終出力決定部２０３は、コンテキスト情報としてクライアント端末１から送信された、撮像部１２でユーザの動作を撮像した撮像画像を解析し、事前に登録されている身体の動きを検出して、通常入力モードと文章校正モードの切り替えを行ってもよい。若しくは、最終出力決定部２０３は、コンテキスト情報としてクライアント端末１から送信された、センサ１３で検知したセンサ情報を解析して、事前に登録されている動き（例えば画面を振る、画面にタッチする等）を検出し、通常入力モードと文章校正モードの切り替えを行ってもよい。 Further, the final output determination unit 203 analyzes the captured image obtained by capturing the user's action by the imaging unit 12 transmitted from the client terminal 1 as the context information, detects the body movement registered in advance, Switching between the normal input mode and the sentence proofreading mode may be performed. Alternatively, the final output determination unit 203 analyzes the sensor information detected by the sensor 13 transmitted from the client terminal 1 as the context information, and moves in advance (for example, shakes the screen, touches the screen, etc.) ) May be detected to switch between the normal input mode and the sentence proofreading mode.

また、最終出力決定部２０３は、ユーザ発話のテキスト解析結果と、撮像画像やセンサ情報とを組み合わせて、校正発話であるか否かを判断することもできる。例えば最終出力決定部２０３は、ユーザが画面に表示されている文字を示しながら「ここから先を全て削除」と発話した場合、発話内容の解析結果と、画面上の文字を示している動作から、文章校正モードであると判断する。 Further, the final output determination unit 203 can determine whether the utterance is a proofreading speech by combining the text analysis result of the user utterance with the captured image and sensor information. For example, when the user utters “delete all from here” while showing the characters displayed on the screen, the final output determination unit 203 determines from the analysis result of the utterance contents and the operation indicating the characters on the screen. , Judge that it is a sentence proofreading mode.

ここで、本実施形態によるユーザ発話例と各発話の最終出力例について、図４〜図９を参照して具体的に説明する。 Here, a user utterance example and a final output example of each utterance according to the present embodiment will be specifically described with reference to FIGS.

（Ａ）文字の種類の指定
図４は、入力する文字の種類の指定を音声で行う場合の具体例を示す図である。例えば図４の１行目に示すように、ユーザ発話が「かたかなのとうきょうたわー」の場合、音声認識部２０１は、音声認識により「カタカナの東京タワー」といった文字列を出力する。この場合、既存の音声認識システムでは、音声認識した文字列そのままに「カタカナの東京タワー」と出力してしまう恐れがある。一方、本実施形態では、音声認識した文字列に対して校正発話データを参照してテキスト解析を行い、音声認識結果から「カタカナの」を文字の種類『片仮名』の校正指定と分析し、「東京タワー」を校正ターゲットと分析する。これにより、図４の１行目に示すように最終出力結果が片仮名で表現される「トウキョウタワー」となる。 (A) Designation of Character Type FIG. 4 is a diagram showing a specific example in the case of designating the type of character to be input by voice. For example, as shown in the first line of FIG. 4, when the user utterance is “Katakana no Kyoto”, the voice recognition unit 201 outputs a character string such as “Tokyo Tower of Katakana” by voice recognition. In this case, the existing voice recognition system may output “Katakana no Tokyo Tower” without changing the voice-recognized character string. On the other hand, in this embodiment, text analysis is performed with reference to the proofreading utterance data with respect to the voice-recognized character string, `` Katakana '' is analyzed from the voice recognition result as the proof designation of the character type `` Katakana '', and `` Analyze Tokyo Tower as a calibration target. As a result, as shown in the first line of FIG. 4, the final output result is “Tokyo Tower” expressed in katakana.

また、図４の２行目に示すように、ユーザ発話が「えむだけおおもじのまいける」の場合、音声認識部２０１は、音声認識により「エムだけ大文字のマイケル」といった文字列を出力する。この場合、既存の音声認識システムでは、音声認識した文字列そのままに「エムだけ大文字のマイケル」と出力してしまう恐れがある。一方、本実施形態では、音声認識した文字列に対して校正発話データを参照してテキスト解析を行い、音声認識結果から「エムだけ大文字の」を文字の種類の指定『アルファベット大文字』の校正指定と分析し、「マイケル」を校正ターゲットと分析する。これにより、図４の２行目に示すように最終出力結果が「Michael」となる。 Also, as shown in the second line of FIG. 4, when the user utterance is “Emoji can only be ignored”, the voice recognition unit 201 outputs a character string such as “Michael with only M” by voice recognition. . In this case, in the existing speech recognition system, there is a possibility that the character string that has been speech-recognized is output as “Michael only in M”. On the other hand, in this embodiment, text analysis is performed with reference to the proofreading utterance data with respect to the voice-recognized character string, and “M only uppercase” is designated as the character type from the voice recognition result. And analyze "Michael" as a calibration target. As a result, the final output result is “Michael” as shown in the second line of FIG.

（Ｂ）音やトランスクリプションの利用
図５は、入力する文字の漢字変換の指定を音声で行う場合の具体例を示す図である。例えば図５の１行目に示すように、ユーザ発話が「ゆうきゅうきゅうかのゆうにこどものこ」の場合、音声認識部２０１は、音声認識により「有給休暇の有に子供の子」といった文字列を出力する。この場合、既存の音声認識システムでは、音声認識した文字列そのままに「有給休暇の有に子供の子」と出力してしまう恐れがある。一方、本実施形態では、音声認識した文字列に対して校正発話データを参照してテキスト解析を行い、音声認識結果から「有給休暇の有」を漢字の校正指定と分析し、「有」を校正ターゲットと分析する。また、「子供の子」を漢字の校正指定と分析し、「子」を校正ターゲットと分析する。これにより、図５の１行目に示すように最終出力結果がユーザ希望の漢字で表現される「有子」となる。「ユウコ」という音に対応する漢字候補が他にある場合でも、ユーザ希望の漢字で入力することが可能となる。 (B) Utilization of Sound and Transcription FIG. 5 is a diagram showing a specific example in the case where designation of Kanji conversion of input characters is performed by voice. For example, as shown in the first line of FIG. 5, when the user utterance is “Yukyu Kyu no Yuuni Kodomo no Moko”, the voice recognition unit 201 uses a character recognition such as “Children with a paid vacation” by voice recognition. Output a column. In this case, the existing voice recognition system may output “children with paid leave” without changing the voice-recognized character string. On the other hand, in the present embodiment, text analysis is performed with reference to the proofreading utterance data for the speech-recognized character string, “paid leave” is analyzed from the speech recognition result as the proofreading designation of the kanji, and “yes” is set. Analyze with calibration target. In addition, “child” is analyzed as a proofreading designation of kanji, and “child” is analyzed as a proofreading target. As a result, as shown in the first line of FIG. 5, the final output result becomes “child” expressed in kanji desired by the user. Even if there are other kanji candidates corresponding to the sound “Yuko”, it is possible to input in the kanji desired by the user.

また、図５の２行目に示すように、ユーザ発話が「しらとりのとりはとっとりのとり」の場合、音声認識部２０１は、音声認識により「白鳥の鳥は鳥取の取」といった文字列を出力する。この場合、既存の音声認識システムでは、音声認識した文字列そのままに「白鳥の鳥は鳥取の取」と出力してしまう恐れがある。一方、本実施形態では、音声認識した文字列に対して校正発話データを参照してテキスト解析を行い、音声認識結果から「白鳥の鳥は鳥取の取」を漢字の校正指定と分析し、「白鳥」を校正ターゲットと分析する。これにより、図５の２行目に示すように最終出力結果がユーザ希望の漢字で表現される「白取」となる。「シラトリ」という音に対応する漢字候補が他にある場合でも、ユーザ希望の漢字で入力することが可能となる。 As shown in the second line of FIG. 5, when the user utterance is “Shiratori Totori is Tottori Tori”, the speech recognition unit 201 uses the speech recognition to read “Swan Bird is Tottori Tori”. Output a column. In this case, the existing speech recognition system may output “swan bird is Tottori-tori” as it is as the speech-recognized character string. On the other hand, in the present embodiment, text analysis is performed with reference to the proofreading utterance data with respect to the voice-recognized character string, and “Swan Bird is Tottori-no-Tori” is analyzed from the voice recognition result as proofreading designation of kanji. Analyze “Swan” as a calibration target. As a result, as shown in the second line of FIG. 5, the final output result is “white picked” expressed in the user-desired kanji. Even when there are other kanji candidates corresponding to the sound “Shiratori”, it is possible to input in kanji desired by the user.

（Ｃ）校正箇所と動作命令
また、校正ターゲットの範囲や校正内容を音声で命令することも可能である。例えば、以下に示すようなユーザ発話と校正情報の分析結果の一例が挙げられる。 (C) Calibration location and operation instruction It is also possible to instruct the calibration target range and calibration contents by voice. For example, an example of the analysis result of the user utterance and the proofreading information as shown below is given.

さらに、図６および図７を参照して一例を説明する。図６は、本実施形態によるユーザ発話と校正情報の分析結果の一例を示す図である。図６に示すように、ユーザ発話が「かきあんけんってところからしたをぜんぶけしてけいぞくけんとうっていれて」の場合、音声認識部２０１は、音声認識により「下記案件って所から下を全部消して継続検討っていれて」といった文字列を出力する。この場合、既存の音声認識システムでは、音声認識した文字列そのままに「下記案件って所から下を全部消して継続検討っていれて」と出力してしまう恐れがある。一方、本実施形態では、音声認識した文字列に対して校正発話データを参照してテキスト解析を行い、音声認識結果から、「校正指定：『継続検討』に修正」、「校正ターゲット：『下記案件』以降」と分析する。 Furthermore, an example is demonstrated with reference to FIG. 6 and FIG. FIG. 6 is a diagram illustrating an example of a user utterance and calibration information analysis result according to the present embodiment. As shown in FIG. 6, when the user utterance is “all the things that have been done from the location,” the speech recognition unit 201 uses the speech recognition to indicate “ The character string such as “Continue to consider by deleting all the characters from the bottom” is output. In this case, in the existing speech recognition system, there is a risk that the text string that has been speech-recognized may be output as “Continue to consider all the following items from the bottom”. On the other hand, in the present embodiment, text analysis is performed with reference to the proofreading utterance data for the speech-recognized character string, and from the speech recognition result, “proofreading designation: corrected to“ continuous examination ””, “calibration target:“ following Analyze "Proposal" and later ".

図７は、図６に示すユーザ発話に対する最終出力結果の一例を示す図である。図７に示すように、画面３０に表示されている入力された文章中、「下記案件」以降が削除されて「継続検討」に修正された画面３１が最終出力結果として出力される。 FIG. 7 is a diagram illustrating an example of a final output result for the user utterance illustrated in FIG. 6. As shown in FIG. 7, in the input text displayed on the screen 30, a screen 31 in which “the following cases” and later are deleted and corrected to “continuous examination” is output as a final output result.

（Ｄ）コンテキスト情報の活用
続いて、コンテキスト情報を考慮した校正処理の一例について説明する。本実施形態では、ユーザ発話の際に取得された撮像画像やセンサ情報を考慮してテキスト解析を行い、校正分析を行うことが可能である。 (D) Utilization of Context Information Next, an example of a calibration process considering the context information will be described. In the present embodiment, it is possible to perform a text analysis and a calibration analysis in consideration of a captured image and sensor information acquired at the time of user utterance.

ここで、図８および図９を参照して表示部１５に設けられたタッチセンサにより検知されるセンサ情報を用いた例について説明する。図８は、本実施形態によるユーザ発話とコンテキスト情報を考慮した校正情報の分析結果の一例を示す図である。図８に示すように、ユーザ発話が「ここをごぜんにして」の場合、音声認識部２０１は、音声認識により「ここを午前にして」といった文字列を出力する。また、ユーザ発話の際に表示部１５のタッチセンサにより検知された画面上の位置座標（x,y）を示すセンサ情報が取得される。 Here, an example using sensor information detected by a touch sensor provided in the display unit 15 will be described with reference to FIGS. 8 and 9. FIG. 8 is a diagram illustrating an example of the analysis result of the calibration information in consideration of the user utterance and the context information according to the present embodiment. As shown in FIG. 8, when the user utterance is “please make it here”, the speech recognition unit 201 outputs a character string “make here in the morning” by speech recognition. Further, sensor information indicating the position coordinates (x, y) on the screen detected by the touch sensor of the display unit 15 at the time of user utterance is acquired.

この場合、既存の音声認識システムでは、音声認識した文字列そのままに「ここを午前にして」と出力してしまう恐れがある。一方、本実施形態では、音声認識した文字列に対して校正発話データとタッチセンサ情報を参照してテキスト解析を行い、「校正指定：『午前』に修正」、「校正ターゲット：座標（x,y）」と分析する。 In this case, in the existing speech recognition system, there is a possibility that the text string that has been speech-recognized is output as “here in the morning”. On the other hand, in the present embodiment, text analysis is performed with reference to the calibration utterance data and touch sensor information with respect to the voice-recognized character string, and “calibration designation: amended to“ AM ””, “calibration target: coordinates (x, y) ”.

図８は、図８に示すユーザ発話に対する最終出力結果の一例を示す図である。図８に示すように、画面３２に表示されている入力された文章中、ユーザによりタッチされた座標（x,y）に対応する文字「午後」が削除されて「午前」に修正された画面３３が最終出力結果として出力される。 FIG. 8 is a diagram showing an example of a final output result for the user utterance shown in FIG. As shown in FIG. 8, in the input text displayed on the screen 32, the character “PM” corresponding to the coordinates (x, y) touched by the user is deleted and corrected to “AM”. 33 is output as the final output result.

上述した例では、タッチセンサにより画面上の座標位置を検知しているが、本実施形態はこれに限定されず、ユーザの視線を的確に捉えることができれば同様に実現できる。すなわち、例えば「ここを午前にして」というユーザ発話の際にユーザが注視している画面上の位置を視線センサ（視線トラッカー）により検知し、コンテキスト情報として考慮する。 In the above-described example, the coordinate position on the screen is detected by the touch sensor, but the present embodiment is not limited to this, and can be similarly realized if the user's line of sight can be accurately captured. That is, for example, the position on the screen on which the user is gazing at the time of the user utterance of “Make this here in the morning” is detected by the gaze sensor (gaze tracker) and considered as context information.

また、ユーザの視線により画面上の注目箇所、範囲、領域を特定できれば、画面上に表示された複数候補ある選択肢等から自動的にユーザ希望の候補を絞ることが可能である。
Further, if a point of interest, a range, or an area on the screen can be specified by the user's line of sight, it is possible to automatically narrow down the user's desired candidates from a plurality of candidate options displayed on the screen.

また、本実施形態では、「ここ」「この辺」というように画面上の位置が指定された場合に、座標（x,y）に対応する文字列部分の背景色を変える等してユーザにフィードバックし、注目個所や範囲の確認を行うようにしてもよい。ユーザは、「そこでＯＫ」「違う」等の回答を口頭で行い得る。 In this embodiment, when a position on the screen is designated as “here” or “this side”, the background color of the character string portion corresponding to the coordinates (x, y) is changed, and the like is fed back to the user. Then, it is possible to confirm the attention location and range. The user can verbally answer such as “OK there” or “No”.

（Ｅ）キーワードの利用
次に、音声認識したユーザ発話から特定のキーワードが抽出された場合の校正処理の一例について説明する。ユーザ発話が"A, as in Adam. D, as in Denver. T, as in Thomas."の場合、音声認識部２０１は、音声認識により"A, as in Adam. D, as in Denver. T, as in Thomas."といった文字列を出力する。この場合、既存の音声認識システムでは、音声認識した文字列そのままに"A, as in Adam. D, as in Denver. T, as in Thomas."と出力してしまう恐れがある。一方、本実施形態では、音声認識した文字列に対して校正発話データを参照してテキスト解析を行い、音声認識結果から、"Adam" "Denver" "Thomas"といった、アルファベットのスペルを伝えるために用いられるキーワードが抽出された場合、「校正指定：アルファベット」、「校正ターゲット："A" "D" "T"」と分析する。これにより、最終出力結果がユーザ希望のスペルで表現される「ADT」となる。 (E) Use of Keywords Next, an example of a proofreading process when a specific keyword is extracted from a speech uttered user utterance will be described. When the user utterance is “A, as in Adam. D, as in Denver. T, as in Thomas.”, The speech recognition unit 201 performs “A, as in Adam. D, as in Denver. T, as in Thomas. " In this case, the existing speech recognition system may output "A, as in Adam. D, as in Denver. T, as in Thomas." On the other hand, in the present embodiment, text analysis is performed with reference to proofreading utterance data for a speech-recognized character string, and the spelling of the alphabet such as “Adam” “Denver” “Thomas” is transmitted from the speech recognition result. When the keyword to be used is extracted, it is analyzed as “Calibration designation: alphabet” and “Calibration target:“ A ”“ D ”“ T ””. As a result, the final output result is “ADT” expressed in the spelling desired by the user.

通信部２１は、外部装置と接続し、データの送受信を行う。例えば通信部２１は、クライアント端末１からユーザ発話の音声情報やコンテキスト情報を受信したり、上述した音声認識処理結果や、テキスト解析処理結果、または最終出力決定処理結果をクライアント端末１に送信したりする。 The communication unit 21 is connected to an external device and transmits / receives data. For example, the communication unit 21 receives voice information and context information of a user utterance from the client terminal 1, transmits the above-described voice recognition processing result, text analysis processing result, or final output determination processing result to the client terminal 1. To do.

校正発話ＤＢ２２は、事前に大量に集められた校正発話データを記憶する記憶部であって、記憶媒体、記憶媒体にデータを記録する記録装置、記憶媒体からデータを読み出す読出し装置および記憶媒体に記録されたデータを削除する削除装置等を含むストレージ装置により構成される。校正発話データは、例えば校正発話に用いられるキーワードや文例を含む。 The calibration utterance DB 22 is a storage unit that stores calibration utterance data collected in large quantities in advance, and includes a storage medium, a recording device that records data in the storage medium, a reading device that reads data from the storage medium, and a recording medium. The storage device includes a deletion device that deletes the recorded data. The proofreading utterance data includes, for example, keywords and sentence examples used for proofreading utterances.

＜＜３．動作処理＞＞
続いて、本実施形態による情報処理システムの動作処理について図１０を参照して説明する。図１０は、本実施形態による情報処理システムの動作処理を示すフローチャートである。下記処理は、クライアント端末１の制御部１０およびサーバ２の制御部２０の少なくともいずれかで行われ得る。 << 3. Action processing >>
Subsequently, an operation process of the information processing system according to the present embodiment will be described with reference to FIG. FIG. 10 is a flowchart showing an operation process of the information processing system according to the present embodiment. The following process may be performed by at least one of the control unit 10 of the client terminal 1 and the control unit 20 of the server 2.

図１０に示すように、まず、ユーザ発話（音声情報）が取得され（ステップＳ１００）、ユーザ発話に対して音声認識が行われる（ステップＳ１０３）。 As shown in FIG. 10, first, a user utterance (voice information) is acquired (step S100), and voice recognition is performed on the user utterance (step S103).

次に、音声認識により出力された文字列に対してテキスト解析が行われる（ステップＳ１０６）。具体的には、校正発話データを参照して文字列の校正発話らしさ、および校正発話だった場合における校正情報の分析が行われる。ユーザ発話の際に取得されたコンテキスト情報が用いられてもよい。 Next, text analysis is performed on the character string output by voice recognition (step S106). Specifically, referring to the proofreading utterance data, the likelihood of the proofreading of the character string and the proofreading information in the case of the proofreading utterance are analyzed. Context information acquired at the time of user utterance may be used.

次いで、テキスト解析結果に基づいて最終的な出力が決定される（ステップＳ１０９）。この際も、ユーザ発話の際に取得されたコンテキスト情報が用いられてもよい。 Next, a final output is determined based on the text analysis result (step S109). Also in this case, the context information acquired at the time of user utterance may be used.

次に、最終出力決定により通常発話と判断された場合、音声認識結果の文字列がそのまま出力される（ステップＳ１１２）。 Next, when it is determined that the normal utterance is determined by the final output determination, the character string of the voice recognition result is output as it is (step S112).

一方、最終出力決定により校正発話と判断された場合、文章校正が行われ、校正結果が出力される（ステップＳ１１５）。 On the other hand, if it is determined that the utterance is proofread based on the final output determination, the sentence is proofread and the proofreading result is output (step S115).

以上、本実施形態による情報処理システムの動作処理について説明した。 The operation processing of the information processing system according to the present embodiment has been described above.

＜＜４．他のシステム構成＞＞
本実施形態による情報処理システムの構成は、図１に示す例に限定されず、例えば図１１に示すように、処理分散を可能とするエッジサーバ４を含むシステム構成であってもよい。図１１は、本実施形態による他のシステム構成を示す図である。図１１に示すように、他のシステム構成として、クライアント端末１、サーバ２、およびエッジサーバ４を含むものが考えられる。 << 4. Other system configuration >>
The configuration of the information processing system according to the present embodiment is not limited to the example illustrated in FIG. 1. For example, as illustrated in FIG. 11, a system configuration including an edge server 4 that enables processing distribution may be used. FIG. 11 is a diagram showing another system configuration according to this embodiment. As shown in FIG. 11, another system configuration including a client terminal 1, a server 2, and an edge server 4 is conceivable.

本実施形態によるエッジサーバ４の構成例を図１２に示す。図１２に示すように、エッジサーバ４は、制御部４０、通信部４１、およびエッジ側校正発話ＤＢ４２を含む。制御部４０は、エッジ側音声認識部４０１、エッジ側テキスト解析部４０２、およびエッジ側最終出力決定部４０３としても機能する。エッジ側音声認識部４０１は、サーバ２の音声認識部２０１のサブセット（以下、外部サブセットと称す）であって、エッジ側テキスト解析部４０２は、テキスト解析部２０２の外部サブセットであって、エッジ側最終出力決定部４０３は、最終出力決定部２０３の外部サブセットである。 A configuration example of the edge server 4 according to the present embodiment is shown in FIG. As illustrated in FIG. 12, the edge server 4 includes a control unit 40, a communication unit 41, and an edge side calibration utterance DB 42. The control unit 40 also functions as an edge side speech recognition unit 401, an edge side text analysis unit 402, and an edge side final output determination unit 403. The edge-side speech recognition unit 401 is a subset of the speech recognition unit 201 of the server 2 (hereinafter referred to as an external subset), and the edge-side text analysis unit 402 is an external subset of the text analysis unit 202 and includes an edge side. The final output determination unit 403 is an external subset of the final output determination unit 203.

エッジサーバ４は、サーバ２に比較して中規模の処理サーバであるが、通信距離的にクライアント端末１の近くに配置され、クライアント端末１よりも高精度かつ、通信遅延を短縮することが可能である。 The edge server 4 is a medium-scale processing server as compared with the server 2, but is disposed near the client terminal 1 in terms of communication distance, and can be more accurate than the client terminal 1 and reduce communication delay. It is.

クライアント端末１は、自身が持つサブセットで十分な処理ができなかった場合やエラーが出た場合にエッジサーバ４にデータを送信して処理要求を行い、エッジサーバ４から処理結果を受信して利用してもよい。または、クライアント端末１は、エッジサーバ４およびサーバ２にデータを送信して処理要求を行うと共に、自身が持つサブセットでも処理を行い、エッジサーバ４およびサーバ２からの処理結果を所定時間待ったり、各処理結果の確信度を示すスコアを参照して、利用するデータを選択してもよい。 The client terminal 1 sends data to the edge server 4 when a sufficient processing cannot be performed with its own subset or when an error occurs, and receives a processing result from the edge server 4 for use. May be. Alternatively, the client terminal 1 transmits data to the edge server 4 and the server 2 to make a processing request, and also performs processing on a subset that the client terminal 1 has, and waits for a predetermined time from the processing results from the edge server 4 and the server 2, Data to be used may be selected with reference to a score indicating the certainty factor of each processing result.

＜＜５．まとめ＞＞
上述したように、本実施形態による情報処理システムによれば、音声入力による文章校正を実現することを可能とする。 << 5. Summary >>
As described above, according to the information processing system of the present embodiment, it is possible to realize sentence proofreading by voice input.

以上、添付図面を参照しながら本開示の好適な実施形態について詳細に説明したが、本技術はかかる例に限定されない。本開示の技術分野における通常の知識を有する者であれば、特許請求の範囲に記載された技術的思想の範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても、当然に本開示の技術的範囲に属するものと了解される。 The preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, but the present technology is not limited to such examples. It is obvious that a person having ordinary knowledge in the technical field of the present disclosure can come up with various changes or modifications within the scope of the technical idea described in the claims. Of course, it is understood that it belongs to the technical scope of the present disclosure.

例えば、上述したクライアント端末１、またはサーバ２に内蔵されるＣＰＵ、ＲＯＭ、およびＲＡＭ等のハードウェアに、クライアント端末１、またはサーバ２の機能を発揮させるためのコンピュータプログラムも作成可能である。また、当該コンピュータプログラムを記憶させたコンピュータ読み取り可能な記憶媒体も提供される。 For example, it is possible to create a computer program for causing the client terminal 1 or the server 2 to exhibit the functions of the client terminal 1 or the server 2 on hardware such as the CPU, ROM, and RAM incorporated in the client terminal 1 or the server 2 described above. A computer-readable storage medium storing the computer program is also provided.

また、本明細書に記載された効果は、あくまで説明的または例示的なものであって限定的ではない。つまり、本開示に係る技術は、上記の効果とともに、または上記の効果に代えて、本明細書の記載から当業者には明らかな他の効果を奏しうる。 Further, the effects described in the present specification are merely illustrative or exemplary and are not limited. That is, the technology according to the present disclosure can exhibit other effects that are apparent to those skilled in the art from the description of the present specification in addition to or instead of the above effects.

なお、本技術は以下のような構成も取ることができる。
（１）
文章の校正指令と校正ターゲットを含む音声情報を送信する送信部と、
前記校正指令と校正ターゲットに基づく処理結果を受信する受信部と、
を備える、情報処理装置。
（２）
前記音声情報は、収音したユーザ音声データである、前記（１）に記載の情報処理装置。
（３）
前記音声情報は、収音したユーザ音声データから抽出した特徴量データである、前記（１）に記載の情報処理装置。
（４）
前記音声情報は、収音したユーザ音声データから認識した校正指令と校正ターゲットを示すデータである、前記（１）に記載の情報処理装置。
（５）
前記送信部は、前記音声情報と共に、音声入力の際のコンテキスト情報を送信し、
前記受信部は、前記校正指令、校正ターゲット、および前記コンテキスト情報に基づく処理結果を受信する、前記（１）〜（４）のいずれか１項に記載の情報処理装置。
（６）
前記コンテキスト情報は、ユーザの動作を検知したセンサ情報である、前記（５）に記載の情報処理装置。
（７）
前記コンテキスト情報は、ユーザの動作を検知したセンサ情報から抽出した特徴量データである、前記（５）に記載の情報処理装置。
（８）
前記コンテキスト情報は、ユーザの動作を検知したセンサ情報から認識した結果を示すデータである、前記（５）に記載の情報処理装置。
（９）
前記受信部により受信する処理結果は、前記送信した音声情報の音声認識結果、テキスト解析結果、または前記音声情報に含まれる校正指令と校正ターゲットに基づく校正情報の少なくともいずれかを含む、前記（１）〜（８）のいずれか１項に記載の情報処理装置。
（１０）
前記処理結果は、当該処理結果の確信度を示すデータを含む、前記（９）に記載の情報処理装置。
（１１）
前記校正情報は、最終決定された校正指令に基づいて校正ターゲットを処理した校正結果を含む、前記（９）または（１０）に記載の情報処理装置。
（１２）
文章の校正指令と校正ターゲットを含む音声情報を受信する受信部と、
前記校正指令と校正ターゲットに基づく処理結果を送信する送信部と、
を備える、情報処理装置。
（１３）
前記送信部により送信する処理結果は、前記受信した音声情報の音声認識結果、テキスト解析結果、または前記音声情報に含まれる校正指令と校正ターゲットに基づく校正情報の少なくともいずれかを含む、前記（１２）に記載の情報処理装置。
（１４）
前記処理結果は、当該処理結果の確信度を示すデータを含む、前記（１３）に記載の情報処理装置。
（１５）
前記校正情報は、最終決定された校正指令に基づいて校正ターゲットを処理した校正結果を含む、前記（１３）または（１４）に記載の情報処理装置。
（１６）
前記受信部は、前記音声情報と共に、音声入力の際のコンテキスト情報を受信し、
前記送信部は、前記校正指令、校正ターゲット、および前記コンテキスト情報に基づく処理結果を送信する、前記（１２）〜（１５）のいずれか１項に記載の情報処理装置。
（１７）
プロセッサが、
文章の校正指令と校正ターゲットを含む音声情報を送信することと、
前記校正指令と校正ターゲットに基づく解析結果を受信することと、
を含む、情報処理方法。
（１８）
プロセッサが、
文章の校正指令と校正ターゲットを含む音声情報を受信することと、
前記校正指令と校正ターゲットに基づく解析結果を送信することと、
を含む、情報処理方法。 In addition, this technique can also take the following structures.
(1)
A transmitter for transmitting voice information including a sentence proofreading command and a proofreading target;
A receiving unit for receiving a processing result based on the calibration command and the calibration target;
An information processing apparatus comprising:
(2)
The information processing apparatus according to (1), wherein the voice information is collected user voice data.
(3)
The information processing apparatus according to (1), wherein the voice information is feature amount data extracted from collected user voice data.
(4)
The information processing apparatus according to (1), wherein the voice information is data indicating a calibration command and a calibration target recognized from collected user voice data.
(5)
The transmission unit transmits context information at the time of voice input together with the voice information,
The information processing apparatus according to any one of (1) to (4), wherein the reception unit receives a processing result based on the calibration command, a calibration target, and the context information.
(6)
The information processing apparatus according to (5), wherein the context information is sensor information that detects a user action.
(7)
The information processing apparatus according to (5), wherein the context information is feature amount data extracted from sensor information that detects a user's operation.
(8)
The information processing apparatus according to (5), wherein the context information is data indicating a result recognized from sensor information obtained by detecting a user operation.
(9)
The processing result received by the receiving unit includes at least one of a speech recognition result of the transmitted speech information, a text analysis result, or calibration information based on a calibration command and a calibration target included in the speech information. The information processing apparatus according to any one of (8) to (8).
(10)
The information processing apparatus according to (9), wherein the processing result includes data indicating a certainty factor of the processing result.
(11)
The information processing apparatus according to (9) or (10), wherein the calibration information includes a calibration result obtained by processing a calibration target based on a calibration command that is finally determined.
(12)
A receiver for receiving voice information including a proofreading command and a proofreading target;
A transmitter for transmitting a processing result based on the calibration command and the calibration target;
An information processing apparatus comprising:
(13)
The processing result transmitted by the transmitting unit includes at least one of a speech recognition result of the received speech information, a text analysis result, or calibration information based on a calibration command and a calibration target included in the speech information. ).
(14)
The information processing apparatus according to (13), wherein the processing result includes data indicating a certainty factor of the processing result.
(15)
The information processing apparatus according to (13) or (14), wherein the calibration information includes a calibration result obtained by processing a calibration target based on a finally determined calibration command.
(16)
The receiving unit receives context information at the time of voice input together with the voice information,
The information processing apparatus according to any one of (12) to (15), wherein the transmission unit transmits a processing result based on the calibration command, a calibration target, and the context information.
(17)
Processor
Sending audio information including proofreading instructions and proofreading targets;
Receiving an analysis result based on the calibration command and the calibration target;
Including an information processing method.
(18)
Processor
Receiving audio information including proofreading instructions and proofreading targets;
Transmitting an analysis result based on the calibration command and the calibration target;
Including an information processing method.

１クライアント端末
１０制御部
１０１ローカル音声認識部
１０２ローカルテキスト解析部
１０３ローカル最終出力決定部
１１音声入力部
１２撮像部
１３センサ
１４通信部
１５表示部
１６記憶部
２サーバ
２０制御部
２０１音声認識部
２０２テキスト解析部
２０３最終出力決定部
２１通信部
２２校正発話ＤＢ
３ネットワーク
４エッジサーバ
４０制御部
４０１エッジ側音声認識部
４０２エッジ側テキスト解析部
４０３エッジ側最終出力決定部
４１通信部
４２エッジ側校正発話ＤＢ
DESCRIPTION OF SYMBOLS 1 Client terminal 10 Control part 101 Local speech recognition part 102 Local text analysis part 103 Local final output determination part 11 Voice input part 12 Imaging part 13 Sensor 14 Communication part 15 Display part 16 Storage part 2 Server 20 Control part 201 Voice recognition part 202 Text analysis unit 203 Final output determination unit 21 Communication unit 22 Proofreading DB
DESCRIPTION OF SYMBOLS 3 Network 4 Edge server 40 Control part 401 Edge side speech recognition part 402 Edge side text analysis part 403 Edge side final output determination part 41 Communication part 42 Edge side calibration utterance DB

Claims

A transmitter for transmitting voice information including a sentence proofreading command and a proofreading target;
A receiving unit for receiving a processing result based on the calibration command and the calibration target;
An information processing apparatus comprising:

The information processing apparatus according to claim 1, wherein the voice information is collected user voice data.

The information processing apparatus according to claim 1, wherein the voice information is feature amount data extracted from collected user voice data.

The information processing apparatus according to claim 1, wherein the voice information is data indicating a calibration command and a calibration target recognized from collected user voice data.

The transmission unit transmits context information at the time of voice input together with the voice information,
The information processing apparatus according to claim 1, wherein the reception unit receives a processing result based on the calibration command, a calibration target, and the context information.

The information processing apparatus according to claim 5, wherein the context information is sensor information that detects a user action.

The information processing apparatus according to claim 5, wherein the context information is feature amount data extracted from sensor information in which a user operation is detected.

The information processing apparatus according to claim 5, wherein the context information is data indicating a result recognized from sensor information detected by a user operation.

The processing result received by the receiving unit includes at least one of a speech recognition result of the transmitted speech information, a text analysis result, or calibration information based on a calibration command and a calibration target included in the speech information. The information processing apparatus described in 1.

The information processing apparatus according to claim 9, wherein the processing result includes data indicating a certainty factor of the processing result.

The information processing apparatus according to claim 9, wherein the calibration information includes a calibration result obtained by processing a calibration target based on a finally determined calibration command.

A receiver for receiving voice information including a proofreading command and a proofreading target;
A transmitter for transmitting a processing result based on the calibration command and the calibration target;
An information processing apparatus comprising:

The processing result transmitted by the transmission unit includes at least one of a speech recognition result of the received speech information, a text analysis result, or calibration information based on a calibration command and a calibration target included in the speech information. The information processing apparatus described in 1.

The information processing apparatus according to claim 13, wherein the processing result includes data indicating a certainty factor of the processing result.

The information processing apparatus according to claim 13, wherein the calibration information includes a calibration result obtained by processing a calibration target based on a finally determined calibration command.

The receiving unit receives context information at the time of voice input together with the voice information,
The information processing apparatus according to claim 12, wherein the transmission unit transmits a processing result based on the calibration command, a calibration target, and the context information.

Processor
Sending audio information including proofreading instructions and proofreading targets;
Receiving an analysis result based on the calibration command and the calibration target;
Including an information processing method.

Processor
Receiving audio information including proofreading instructions and proofreading targets;
Transmitting an analysis result based on the calibration command and the calibration target;
Including an information processing method.