JP6229433B2

JP6229433B2 - Operation guidance server, operation guidance system, image forming apparatus, and program

Info

Publication number: JP6229433B2
Application number: JP2013225254A
Authority: JP
Inventors: 和也姉崎; 淳一長谷; 松原　賢士; 賢士松原; 高橋　一誠; 一誠高橋; 博一久保田
Original assignee: Konica Minolta Inc
Current assignee: Konica Minolta Inc
Priority date: 2013-10-30
Filing date: 2013-10-30
Publication date: 2017-11-15
Anticipated expiration: 2033-10-30
Also published as: JP2015088890A

Description

本発明は、操作案内システムおよびそれに関連する技術に関する。 The present invention relates to an operation guidance system and related technology.

近年、ＭＦＰ（マルチ・ファンクション・ペリフェラル（Multi-Functional Peripheral））などの画像形成装置は、多機能化および高機能化しており、その操作が煩雑になっている。そのため、ユーザは、画像形成装置の操作方法をサポートセンターに問い合わせ、サポータ（サポートセンターにおいてユーザを支援する人物）にその操作の案内を依頼することがある。 In recent years, an image forming apparatus such as an MFP (Multi-Functional Peripheral) has become multifunctional and highly functional, and its operation has become complicated. For this reason, the user may inquire of the support center about the operation method of the image forming apparatus and request a supporter (a person who supports the user at the support center) to guide the operation.

その際、ユーザは、自己が操作している画像形成装置の表示画像に基づいてサポータから操作案内を受けることがあり、表示画像に含まれるユーザの機密情報がサポータの端末にそのまま表示されることによりユーザの機密情報が漏洩する恐れがある。 At that time, the user may receive operation guidance from the supporter based on the display image of the image forming apparatus operated by the user, and the confidential information of the user included in the display image is displayed as it is on the supporter terminal. As a result, the confidential information of the user may be leaked.

特許文献１においては、このような問題に鑑みて、画像形成装置の表示部に表示されたユーザの機密情報をダミー画像に置き換えてサポータのパーソナルコンピュータに表示させることにより、ユーザの機密情報がサポータ側にそのまま表示されることを防止することが記載されている。 In Patent Document 1, in view of such a problem, the confidential information of the user displayed on the display unit of the image forming apparatus is replaced with a dummy image and displayed on the personal computer of the supporter. It is described that it is prevented from being displayed as it is on the side.

特開２００９−２５３６５１号公報JP 2009-253651 A

ところで、ユーザとサポータとの間の操作案内においては、表示画像を用いた操作案内のみならず、音声を用いた操作案内も行われることが想定される。 By the way, in operation guidance between a user and a supporter, it is assumed that not only operation guidance using a display image but also operation guidance using voice is performed.

しかしながら、表示画像および音声を用いた操作案内を行う際において、上記の従来技術を用いると、ユーザの発した音声に含まれる機密情報はサポータの端末にそのまま伝達され、ユーザの機密情報がサポータに漏洩してしまう恐れがある。 However, when performing the operation guidance using the display image and the sound, if the above-described conventional technology is used, the confidential information included in the voice uttered by the user is directly transmitted to the terminal of the supporter, and the confidential information of the user is transmitted to the supporter. There is a risk of leakage.

そこで、この発明は、ユーザとサポータとの間の操作案内において、ユーザの音声に含まれる機密情報の漏洩を回避することが可能な技術を提供することを課題とする。 Therefore, an object of the present invention is to provide a technique capable of avoiding leakage of confidential information included in a user's voice in operation guidance between the user and a supporter.

上記課題を解決すべく、請求項１の発明は、操作案内システムにおける案内サーバであって、ユーザの操作対象である画像形成装置の操作部に表示された第１の表示画像のデータである第１の表示画像データを前記画像形成装置から受信する画像受信手段と、秘匿ワードが前記第１の表示画像データに含まれる場合に、前記第１の表示画像における前記秘匿ワードを当該秘匿ワードに対応する代替ワードに置き換えた第１の合成画像のデータである第１の合成画像データを生成する画像生成手段と、前記ユーザを支援する人物であるサポータにより前記ユーザへの操作案内に利用されるサポータ端末に対して、前記サポータ端末での表示用データとして前記第１の合成画像データを送信する画像送信手段と、前記ユーザにより発せられた音声のデータを含むユーザ音声データを前記画像形成装置から受信する音声受信手段と、前記ユーザ音声データに対する音声認識処理によって、前記ユーザ音声データに前記秘匿ワードが含まれるか否かを判定する音声認識手段と、前記ユーザ音声データに前記秘匿ワードが含まれると判定される場合に、前記ユーザ音声データ内の前記秘匿ワードの音声データである秘匿音声データを、当該秘匿ワードに対応する前記代替ワードの音声データである代替音声データに置き換えたデータである合成ユーザ音声データを生成する音声生成手段と、前記サポータ端末での音声出力用データとして前記合成ユーザ音声データを前記サポータ端末に送信する音声送信手段と、を備えることを特徴とする。 In order to solve the above problem, the invention of claim 1 is a guidance server in an operation guidance system, which is data of a first display image displayed on an operation unit of an image forming apparatus which is a user's operation target. Image receiving means for receiving one display image data from the image forming apparatus, and when a secret word is included in the first display image data, the secret word in the first display image corresponds to the secret word And a supporter used for operation guidance to the user by a supporter who is a person supporting the user An image transmitting means for transmitting the first composite image data as display data on the supporter terminal to the terminal, and a voice uttered by the user A voice receiving means for receiving user voice data including data from the image forming apparatus, a voice recognition processing on the user speech data, a speech recognition unit for determining whether or not the include the secret word to the user speech data When it is determined that the secret word is included in the user voice data, the secret voice data that is the voice data of the secret word in the user voice data is converted into voice data of the alternative word corresponding to the secret word. Voice generating means for generating synthesized user voice data which is data replaced with alternative voice data, and voice transmitting means for sending the synthesized user voice data to the supporter terminal as voice output data at the supporter terminal; It is characterized by providing.

請求項２の発明は、請求項１の発明に係る案内サーバにおいて、前記ユーザ音声データは、複数の部分音声データに区分され、前記音声生成手段は、前記ユーザ音声データにおける一の部分音声データである第１の音声データ内に前記秘匿ワードが含まれる旨が前記音声認識処理によって判定される場合に、前記第１の音声データ内の前記秘匿音声データを前記代替音声データに置き換えた第１の合成音声データを生成し、前記音声送信手段は、前記第１の合成音声データを前記サポータ端末に送信することを特徴とする。 According to a second aspect of the present invention, in the guidance server according to the first aspect of the invention, the user voice data is divided into a plurality of partial voice data, and the voice generation means is one partial voice data in the user voice data. When the voice recognition process determines that the secret word is included in certain first voice data, the first voice data in which the secret voice data in the first voice data is replaced with the alternative voice data. The synthesized voice data is generated, and the voice transmitting means transmits the first synthesized voice data to the supporter terminal.

請求項３の発明は、請求項２の発明に係る案内サーバにおいて、前記第１の音声データは、前記第１の表示画像の表示中に前記ユーザにより発せられた音声を含み、前記画像受信手段は、前記第１の表示画像に引き続いて前記操作部に表示される第２の表示画像のデータである第２の表示画像データを、前記第１の表示画像データの受信完了後に前記画像形成装置から受信し、前記画像生成手段は、秘匿ワードが前記第２の表示画像データに含まれる場合に、前記第２の表示画像データ内の前記秘匿ワードを当該秘匿ワードに対応する代替ワードに置き換えた第２の合成画像のデータである第２の合成画像データを生成し、前記画像送信手段は、前記第１の音声データの受信が開始されると前記第２の合成画像データの送信を許可せず、前記第１の音声データに基づき生成された前記第１の合成音声データの送信完了時点以後において前記第２の合成画像データの送信を許可することを特徴とする。 According to a third aspect of the present invention, in the guidance server according to the second aspect of the invention, the first voice data includes a voice uttered by the user during display of the first display image, and the image receiving means. The second display image data, which is the data of the second display image displayed on the operation unit subsequent to the first display image, is converted into the image forming apparatus after the reception of the first display image data is completed. And when the secret word is included in the second display image data, the image generation means replaces the secret word in the second display image data with an alternative word corresponding to the secret word. Second composite image data, which is data of a second composite image, is generated, and the image transmission means permits transmission of the second composite image data when reception of the first audio data is started. The above And permits the transmission of the second synthesized image data in the transmission completion time after the first synthesized speech data generated based on one of the audio data.

請求項４の発明は、請求項３の発明に係る案内サーバにおいて、前記画像送信手段は、前記第１の音声データの受信開始時点と前記第１の合成音声データの送信完了から前記第１の合成音声データの出力所要時間が経過した時点との間の期間である第１の期間内に前記画像受信手段により前記第２の表示画像データが受信される場合には、前記第１の期間の終了時点までは前記第２の合成画像データの送信を許可せず且つ前記第１の期間の終了後に前記第２の合成画像データの送信を許可することを特徴とする。 According to a fourth aspect of the present invention, in the guidance server according to the third aspect of the present invention, the image transmission means is configured to receive the first voice data from the reception start time of the first voice data and completion of the transmission of the first synthesized voice data. When the second display image data is received by the image receiving means within a first period that is a period between the time when the output time of the synthesized voice data has elapsed, Transmission of the second composite image data is not permitted until the end point, and transmission of the second composite image data is permitted after the end of the first period.

請求項５の発明は、請求項１の発明に係る案内サーバにおいて、音声データを格納する格納手段、をさらに備え、前記音声生成手段は、複数の秘匿ワードに対応する代替ワードの音声データである複数の代替音声データの生成を前記合成ユーザ音声データの生成に先立つ所定の時点から開始するとともに、生成された代替音声データを前記格納手段に格納し、前記ユーザ音声データに前記秘匿ワードが含まれ且つ前記秘匿ワードに対応する前記代替音声データが前記格納手段に格納されていない場合には、前記代替音声データを機械音声生成処理により生成し、生成された当該代替音声データを用いて前記合成ユーザ音声データを生成し、前記ユーザ音声データに前記秘匿ワードが含まれ且つ前記秘匿ワードに対応する前記代替音声データが前記格納手段に既に格納されている場合には、前記格納手段に格納されている前記代替音声データを用いて前記合成ユーザ音声データを生成することを特徴とする。 A fifth aspect of the present invention is the guidance server according to the first aspect of the present invention, further comprising storage means for storing voice data, wherein the voice generating means is voice data of alternative words corresponding to a plurality of secret words. Generation of a plurality of alternative voice data is started from a predetermined time prior to the generation of the synthesized user voice data, and the generated alternative voice data is stored in the storage means, and the secret word is included in the user voice data When the alternative voice data corresponding to the secret word is not stored in the storage unit, the alternative voice data is generated by a machine voice generation process, and the synthesized user is generated using the generated alternative voice data. Voice data is generated, and the user voice data includes the secret word and the alternative voice data corresponding to the secret word is If already stored in the serial storage means, and generating said composite user speech data using the substitute audio data stored in said storage means.

請求項６の発明は、請求項５の発明に係る案内サーバにおいて、前記音声生成手段は、前記ユーザからの操作案内の依頼の発生を示すサポート依頼信号を前記案内サーバが受信することに応答して、前記複数の代替音声データの生成を開始することを特徴とする。 According to a sixth aspect of the present invention, in the guidance server according to the fifth aspect of the invention, the voice generating means is responsive to the guidance server receiving a support request signal indicating the occurrence of a request for operation guidance from the user. Then, the generation of the plurality of alternative voice data is started.

請求項７の発明は、請求項６の発明に係る案内サーバにおいて、前記複数の秘匿ワードは、前記画像形成装置のスキャン画像送信における宛先指定画面内の送信宛先を示す語句と前記画像形成装置のファクシミリ送信における宛先指定画面に含まれる送信宛先を示す語句との少なくとも一方を含むことを特徴とする。 According to a seventh aspect of the present invention, in the guidance server according to the sixth aspect of the invention, the plurality of secret words include a phrase indicating a transmission destination in a destination designation screen in the scan image transmission of the image forming apparatus, and the image forming apparatus. It includes at least one of a phrase indicating a transmission destination included in a destination designation screen in facsimile transmission.

請求項８の発明は、請求項６の発明に係る案内サーバにおいて、前記複数の秘匿ワードは、前記画像形成装置のボックスに格納されたファイルに関する情報表示画面に表示されるファイル情報を示す語句を含むことを特徴とする。 The invention of claim 8 is the guidance server according to the invention of claim 6, wherein the plurality of secret words are words or phrases indicating file information displayed on an information display screen relating to a file stored in a box of the image forming apparatus. It is characterized by including.

請求項９の発明は、請求項５の発明に係る案内サーバにおいて、前記音声生成手段は、前記複数の代替音声データのうち、前記画像形成装置の現在の動作モードにおける表示画像に含まれ得る秘匿ワードに対応する代替音声データを優先的に生成することを特徴とする。 According to a ninth aspect of the present invention, in the guidance server according to the fifth aspect of the invention, the voice generation unit may include a secret that may be included in a display image in the current operation mode of the image forming apparatus among the plurality of alternative voice data. Alternative voice data corresponding to a word is preferentially generated.

請求項１０の発明は、請求項９の発明に係る案内サーバにおいて、前記画像形成装置の現在の動作モードは、スキャンモードとファクシミリ送信モードとボックスモードとを含む複数のモードのうちのいずれかであることを特徴とする。 According to a tenth aspect of the present invention, in the guide server according to the ninth aspect, the current operation mode of the image forming apparatus is any one of a plurality of modes including a scan mode, a facsimile transmission mode, and a box mode. It is characterized by being.

請求項１１の発明は、請求項１の発明に係る案内サーバにおいて、音声データを格納する格納手段、をさらに備え、前記音声生成手段は、前記第１の表示画像データが前記画像受信手段によって受信されると、複数の秘匿ワードに対応する代替ワードの音声データである複数の代替音声データの生成を開始し、生成された代替音声データを前記格納手段に格納し、前記ユーザ音声データに前記秘匿ワードが含まれ且つ前記秘匿ワードに対応する前記代替音声データが前記格納手段に格納されていない場合には、前記代替音声データを機械音声生成処理により生成し、生成された当該代替音声データを用いて前記合成ユーザ音声データを生成し、前記ユーザ音声データに前記秘匿ワードが含まれ且つ前記秘匿ワードに対応する前記代替音声データが前記格納手段に既に格納されている場合には、前記格納手段に格納されている前記代替音声データを用いて前記合成ユーザ音声データを生成することを特徴とする。 The invention of claim 11 is the guidance server according to the invention of claim 1, further comprising storage means for storing voice data, wherein the voice generation means receives the first display image data by the image receiving means. Then, generation of a plurality of alternative voice data, which are voice data of alternative words corresponding to the plurality of secret words, is started, the generated alternative voice data is stored in the storage means, and the secret data is stored in the user voice data When the alternative voice data corresponding to the secret word is not stored in the storage means, the alternative voice data is generated by machine voice generation processing, and the generated alternative voice data is used. Generating the synthesized user voice data, and the user voice data includes the secret word and the alternative voice data corresponding to the secret word. There if already stored in said storage means, and generating said composite user speech data using the substitute audio data stored in said storage means.

請求項１２の発明は、請求項５ないし請求項１１のいずれかの発明に係る案内サーバにおいて、前記音声生成手段は、前記複数の秘匿ワードの使用頻度に基づく優先順位に従って、前記複数の代替音声データを生成することを特徴とする。 According to a twelfth aspect of the present invention, in the guidance server according to any one of the fifth to eleventh aspects, the voice generation unit is configured to perform the plurality of alternative voices according to a priority order based on the frequency of use of the plurality of secret words. It is characterized by generating data.

請求項１３の発明は、請求項２の発明に係る案内サーバにおいて、前記第１の合成音声データの生成に利用された前記代替音声データを格納する格納手段、をさらに備え、前記音声生成手段は、前記ユーザ音声データのうち前記第１の音声データとは異なる部分の部分音声データである第２の音声データ内に前記秘匿ワードが含まれる旨が前記音声認識処理によって判定される場合に、前記格納手段に格納されていた前記代替音声データを用いて、前記第２の音声データ内の前記秘匿音声データを前記代替音声データに置き換えた第２の合成音声データを生成し、前記音声送信手段は、前記第２の合成音声データを前記サポータ端末に送信することを特徴とする。 According to a thirteenth aspect of the present invention, in the guidance server according to the second aspect of the present invention, the guidance server further comprises storage means for storing the alternative voice data used for generating the first synthesized voice data. When the voice recognition process determines that the secret word is included in the second voice data which is partial voice data of the user voice data different from the first voice data, Using the substitute voice data stored in the storage means, the second synthesized voice data is generated by replacing the secret voice data in the second voice data with the substitute voice data, and the voice sending means The second synthesized voice data is transmitted to the supporter terminal.

請求項１４の発明は、請求項１の発明に係る案内サーバにおいて、前記音声受信手段は、前記サポータにより発せられた音声のデータを含むサポータ音声データを受信し、前記音声認識手段は、前記サポータ音声データに対する音声認識処理によって、１又は複数の秘匿ワードのいずれかに対応する一の代替ワードが前記サポータ音声データに含まれるか否かを判定し、前記音声生成手段は、前記一の代替ワードが前記サポータ音声データに含まれる場合に、前記サポータ音声データ内の前記一の代替ワードの音声データである第２の代替音声データを、前記一の代替ワードに対応する秘匿ワードの音声データである第２の秘匿音声データに置き換えた合成サポータ音声データを生成し、前記音声送信手段は、前記合成サポータ音声データを前記画像形成装置に送信することを特徴とする。 According to a fourteenth aspect of the present invention, in the guidance server according to the first aspect of the invention, the voice receiving means receives supporter voice data including voice data generated by the supporter, and the voice recognition means is the supporter. It is determined whether or not one substitute word corresponding to one or a plurality of secret words is included in the supporter voice data by voice recognition processing on the voice data, and the voice generation means Is included in the supporter voice data, the second alternative voice data that is the voice data of the one alternative word in the supporter voice data is the voice data of the secret word corresponding to the one alternative word. The synthesized supporter voice data replaced with the second secret voice data is generated, and the voice transmitting means converts the synthesized supporter voice data into And transmits the serial image forming apparatus.

請求項１５の発明は、請求項１４の発明に係る案内サーバにおいて、音声データを格納する格納手段、をさらに備え、前記音声生成手段は、前記ユーザ音声データに基づき前記合成ユーザ音声データを生成する際に、前記ユーザ音声データから抽出した前記秘匿音声データを前記格納手段に格納しておき、前記格納手段に格納された前記秘匿音声データを前記第２の秘匿音声データとして用いて前記合成サポータ音声データを生成することを特徴とする。 According to a fifteenth aspect of the present invention, in the guidance server according to the fourteenth aspect of the present invention, the guidance server further comprises storage means for storing voice data, and the voice generation means generates the synthesized user voice data based on the user voice data. In this case, the secret voice data extracted from the user voice data is stored in the storage means, and the secret voice data stored in the storage means is used as the second secret voice data. It is characterized by generating data.

請求項１６の発明は、請求項２ないし請求項４のいずれかの発明に係る案内サーバにおいて、前記音声認識手段は、前記ユーザ音声データに所定時間以上の無音部分が存在する場合には、前記ユーザ音声データのうち、前記所定時間の無音状態が経過した時点を終端とするように区分した部分音声データを、前記第１の音声データとして抽出することを特徴とする。 According to a sixteenth aspect of the present invention, in the guidance server according to any one of the second to fourth aspects of the invention, the voice recognizing unit is configured so that the user voice data includes a silent portion of a predetermined time or longer. Of the user voice data, partial voice data divided so as to end at the point in time when the silent state for the predetermined time has elapsed is extracted as the first voice data.

請求項１７の発明は、請求項１６の発明に係る案内サーバにおいて、前記画像受信手段は、前記第１の表示画像データとは異なる第２の表示画像データをも前記画像形成装置から受信し、前記音声認識手段は、前記ユーザ音声データの音声認識処理中に前記第２の表示画像データが前記画像受信手段により受信される場合には、前記ユーザ音声データのうち、前記第２の表示画像データの受信時点を終端とするように区分した部分音声データを、前記第１の音声データとして抽出することを特徴とする。 According to a seventeenth aspect of the present invention, in the guidance server according to the sixteenth aspect of the invention, the image receiving means also receives second display image data different from the first display image data from the image forming apparatus, When the second display image data is received by the image receiving means during the voice recognition process of the user voice data, the voice recognition means includes the second display image data of the user voice data. The partial audio data divided so as to end at the reception time of is extracted as the first audio data.

請求項１８の発明は、請求項１の発明に係る案内サーバにおいて、前記第１の表示画像は、前記画像形成装置のボックスに格納されたファイルに関する情報表示画面の画像であり、前記秘匿ワードは、前記ファイルのファイル名、作成者、日付、およびファイル本文の見出しの少なくとも１つを示すワードを含み、前記画像生成手段は、当該秘匿ワードを前記代替ワードに置き換えた前記第１の合成画像データを生成し、前記音声生成手段は、前記ユーザ音声データに当該秘匿ワードが含まれる場合に、前記秘匿音声データを前記代替音声データに置き換えた合成ユーザ音声データを生成することを特徴とする。 The invention according to claim 18 is the guidance server according to claim 1, wherein the first display image is an image of an information display screen related to a file stored in a box of the image forming apparatus, and the secret word is , the file name, the creator of the file, including the date, and the word indicating at least one of heading file body, the image generation means, the first composite image by replacing an equivalent the secret word in the alternative word When the user voice data includes the secret word, the voice generation unit generates synthesized user voice data in which the secret voice data is replaced with the alternative voice data.

請求項１９の発明は、請求項１８の発明に係る案内サーバにおいて、前記秘匿ワードは、前記ファイル本文の見出しを示すワードを含み、前記画像生成手段は、当該秘匿ワードを前記代替ワードに置き換えた前記第１の合成画像データであって前記ファイル本文のうち前記ファイル本文の見出し以外の部分を判読回避画像に変換した前記第１の合成画像データを生成することを特徴とする。 According to a nineteenth aspect of the present invention, in the guidance server according to the eighteenth aspect of the present invention, the secret word includes a word indicating a heading of the file text, and the image generation means replaces the secret word with the substitute word. and generates the first composite image data obtained by converting the portion other than heading the files present statement read avoid image of the first said file present text a synthetic image data .

請求項２０の発明は、操作案内システムにおける案内サーバに内蔵されたコンピュータに、ａ）ユーザの操作対象である画像形成装置の操作部に表示された第１の表示画像のデータである第１の表示画像データを前記画像形成装置から受信するステップと、ｂ）秘匿ワードが前記第１の表示画像データに含まれる場合に、前記第１の表示画像における前記秘匿ワードを当該秘匿ワードに対応する代替ワードに置き換えた第１の合成画像のデータである第１の合成画像データを生成するステップと、ｃ）前記ユーザを支援する人物であるサポータにより前記ユーザへの案内に利用されるサポータ端末に対して、前記サポータ端末での表示用データとして前記第１の合成画像データを送信するステップと、ｄ）前記ユーザにより発せられた音声のデータを含むユーザ音声データを前記画像形成装置から受信するステップと、ｅ）前記ユーザ音声データに対する音声認識処理によって、前記ユーザ音声データに前記秘匿ワードが含まれるか否かを判定するステップと、ｆ）前記ユーザ音声データに前記秘匿ワードが含まれると判定される場合に、前記ユーザ音声データ内の前記秘匿ワードの音声データである秘匿音声データを、当該秘匿ワードに対応する前記代替ワードの音声データである代替音声データに置き換えたデータである合成ユーザ音声データを生成するステップと、ｇ）前記サポータ端末での音声出力用データとして前記合成ユーザ音声データを前記サポータ端末に送信するステップと、を実行させるためのプログラムであることを特徴とする。 According to a twentieth aspect of the present invention, there is provided a first built-in image data displayed on an operation unit of an image forming apparatus, which is a user's operation target, on a computer built in a guidance server in the operation guidance system. A step of receiving display image data from the image forming apparatus; and b) an alternative corresponding to the secret word in the first display image when the secret word is included in the first display image data. A step of generating first composite image data which is data of a first composite image replaced with a word; and c) a supporter terminal used for guidance to the user by a supporter who is a person supporting the user. Transmitting the first composite image data as display data at the supporter terminal; and d) the voice uttered by the user Receiving a user voice data including over data from the image forming apparatus, e) by a speech recognition process on the user speech data, determining whether the contains the secret word to the user speech data, f) When it is determined that the secret word is included in the user voice data, the secret voice data that is the voice data of the secret word in the user voice data is converted to the voice of the alternative word corresponding to the secret word. Generating synthesized user voice data that is data replaced with alternative voice data that is data; and g) transmitting the synthesized user voice data to the supporter terminal as voice output data at the supporter terminal. It is a program for executing.

請求項２１の発明は、操作案内システムであって、ユーザの操作対象である画像形成装置と、前記ユーザを支援する人物であるサポータにより前記ユーザへの画像形成装置の操作案内に利用されるサポータ端末と、前記画像形成装置と前記サポータ端末とを媒介する案内サーバと、を備え、前記案内サーバは、前記画像形成装置の操作部に表示された第１の表示画像のデータである第１の表示画像データを前記画像形成装置から受信する画像受信手段と、秘匿ワードが前記第１の表示画像データに含まれる場合に、前記第１の表示画像における前記秘匿ワードを当該秘匿ワードに対応する代替ワードに置き換えた第１の合成画像のデータである第１の合成画像データを生成する画像生成手段と、前記サポータ端末に対して、前記サポータ端末での表示用データとして前記第１の合成画像データを送信する画像送信手段と、前記ユーザにより発せられた音声のデータを含むユーザ音声データを前記画像形成装置から受信する音声受信手段と、前記ユーザ音声データに対する音声認識処理によって、前記ユーザ音声データに前記秘匿ワードが含まれるか否かを判定する音声認識手段と、前記ユーザ音声データに前記秘匿ワードが含まれると判定される場合に、前記ユーザ音声データ内の前記秘匿ワードの音声データである秘匿音声データを、当該秘匿ワードに対応する前記代替ワードの音声データである代替音声データに置き換えたデータである合成ユーザ音声データを生成する音声生成手段と、前記サポータ端末での音声出力用データとして前記合成ユーザ音声データを前記サポータ端末に送信する音声送信手段と、を有することを特徴とする。 According to a twenty-first aspect of the present invention, there is provided an operation guidance system, a supporter used for operation guidance of the image forming apparatus to the user by an image forming apparatus which is a user's operation target and a supporter who is a person supporting the user. A guidance server that mediates between the terminal and the image forming apparatus and the supporter terminal, wherein the guidance server is a first display image data displayed on an operation unit of the image forming apparatus. An image receiving unit that receives display image data from the image forming apparatus, and an alternative that corresponds to the secret word in the first display image when the secret word is included in the first display image data. Image generating means for generating first composite image data which is data of a first composite image replaced with a word, and the supporter terminal with respect to the supporter terminal An image transmitting means for transmitting said first combined image data as the display data, and voice receiving means for receiving user voice data from the image forming apparatus including a sound data emitted by the user, the user voice Voice recognition means for determining whether or not the user voice data includes the secret word by voice recognition processing on the data, and when it is determined that the user voice data includes the secret word, the user voice Voice generating means for generating synthesized user voice data, which is data obtained by replacing secret voice data that is voice data of the secret word in data with substitute voice data that is voice data of the alternative word corresponding to the secret word; The synthesized user voice data is used as voice output data at the supporter terminal. And having a, a sound transmission means for transmitting to the end.

請求項２２の発明は、操作案内システムにおける画像形成装置であって、ユーザの操作対象である前記画像形成装置の操作部に表示された第１の表示画像のデータである第１の表示画像データを取得する画像取得手段と、秘匿ワードが前記第１の表示画像データに含まれる場合に、前記第１の表示画像における前記秘匿ワードを当該秘匿ワードに対応する代替ワードに置き換えた第１の合成画像のデータである第１の合成画像データを生成する画像生成手段と、前記ユーザを支援する人物であるサポータにより前記ユーザへの案内に利用されるサポータ端末に対して、前記サポータ端末での表示用データとして前記第１の合成画像データを送信する画像送信手段と、前記ユーザにより発せられた音声のデータを含むユーザ音声データを取得する音声取得手段と、前記ユーザ音声データに対する音声認識処理によって、前記ユーザ音声データに前記秘匿ワードが含まれるか否かを判定する音声認識手段と、前記ユーザ音声データに前記秘匿ワードが含まれると判定される場合に、前記ユーザ音声データ内の前記秘匿ワードの音声データである秘匿音声データを、当該秘匿ワードに対応する前記代替ワードの音声データである代替音声データに置き換えたデータである合成ユーザ音声データを生成する音声生成手段と、前記サポータ端末での音声出力用データとして前記合成ユーザ音声データを前記サポータ端末に送信する音声送信手段と、を備えることを特徴とする。 According to a twenty-second aspect of the present invention, there is provided an image forming apparatus in the operation guidance system, wherein the first display image data is data of a first display image displayed on the operation unit of the image forming apparatus that is a user's operation target. And a first composition in which, when a secret word is included in the first display image data, the secret word in the first display image is replaced with an alternative word corresponding to the secret word. Display on the supporter terminal with respect to a supporter terminal that is used for guidance to the user by an image generation unit that generates first composite image data that is image data and a supporter that is a person supporting the user Image transmission means for transmitting the first composite image data as data for use, and user voice data including voice data uttered by the user A voice recognition unit that determines whether or not the secret word is included in the user voice data by voice recognition processing for the user voice data; and that the secret word is included in the user voice data. If so, the synthesized user voice is data obtained by replacing the secret voice data that is the voice data of the secret word in the user voice data with the substitute voice data that is the voice data of the alternative word corresponding to the secret word. Voice generation means for generating data; and voice transmission means for transmitting the synthesized user voice data to the supporter terminal as voice output data at the supporter terminal.

請求項２３の発明は、操作案内システムにおける画像形成装置に内蔵されたコンピュータに、ａ）ユーザの操作対象である前記画像形成装置の操作部に表示された第１の表示画像のデータである第１の表示画像データを取得するステップと、ｂ）秘匿ワードが前記第１の表示画像データに含まれる場合に、前記第１の表示画像における前記秘匿ワードを当該秘匿ワードに対応する代替ワードに置き換えた第１の合成画像のデータである第１の合成画像データを生成するステップと、ｃ）前記ユーザを支援する人物であるサポータにより前記ユーザへの案内に利用されるサポータ端末に対して、前記サポータ端末での表示用データとして前記第１の合成画像データを送信するステップと、ｄ）前記ユーザにより発せられた音声のデータを含むユーザ音声データを取得するステップと、ｅ）前記ユーザ音声データに対する音声認識処理によって、前記ユーザ音声データに前記秘匿ワードが含まれるか否かを判定するステップと、ｆ）前記ユーザ音声データに前記秘匿ワードが含まれると判定される場合に、前記ユーザ音声データ内の前記秘匿ワードの音声データである秘匿音声データを、当該秘匿ワードに対応する前記代替ワードの音声データである代替音声データに置き換えたデータである合成ユーザ音声データを生成するステップと、ｇ）前記サポータ端末での音声出力用データとして前記合成ユーザ音声データを前記サポータ端末に送信するステップと、実行させるためのプログラムであることを特徴とする。 According to a twenty-third aspect of the present invention, there is provided a computer built in the image forming apparatus in the operation guidance system, wherein a) data of the first display image displayed on the operation unit of the image forming apparatus which is a user's operation target. A step of obtaining one display image data, and b) when a secret word is included in the first display image data, the secret word in the first display image is replaced with an alternative word corresponding to the secret word. A step of generating first composite image data which is data of the first composite image; and c) a supporter terminal used for guidance to the user by a supporter who is a person supporting the user, Transmitting the first composite image data as display data on a supporter terminal; and d) a user including voice data uttered by the user. Obtaining voice data, e) determining whether or not the secret word is included in the user voice data by voice recognition processing on the user voice data, and f) the secret to the user voice data. When it is determined that a word is included, the confidential voice data that is the voice data of the secret word in the user voice data is replaced with the alternative voice data that is the voice data of the alternative word corresponding to the secret word Generating a synthesized user voice data as data; g) transmitting the synthesized user voice data as voice output data at the supporter terminal to the supporter terminal; and a program for executing the program. And

請求項１ないし請求項２３に記載の発明によれば、ユーザとサポータとの間の操作案内において、ユーザの音声に含まれる機密情報の漏洩を回避することが可能である。 According to the first to twenty-third aspects of the present invention, it is possible to avoid leakage of confidential information included in the user's voice in the operation guidance between the user and the supporter.

特に、請求項２に記載の発明によれば、ユーザ音声データを区切った一の部分音声データである第１の音声データに関して、当該第１の音声データ内の秘匿音声データを代替音声データに置き換えた第１の合成音声データが生成され、当該第１の合成音声データがサポータ端末に送信される。したがって、第１の音声データに対応する第１の合成音声データは、ユーザ音声データのうち第１の音声データの次の部分に対する処理の終了を待つことなく、比較的早期にサポータ端末に送信され得る。その結果、サポータ端末への音声データの送信の遅延を抑制することが可能である。 In particular, according to the second aspect of the present invention, the secret audio data in the first audio data is replaced with the substitute audio data for the first audio data which is one partial audio data obtained by dividing the user audio data. The first synthesized voice data is generated, and the first synthesized voice data is transmitted to the supporter terminal. Therefore, the first synthesized voice data corresponding to the first voice data is transmitted to the supporter terminal relatively early without waiting for the end of the process for the next portion of the first voice data in the user voice data. obtain. As a result, it is possible to suppress a delay in transmitting audio data to the supporter terminal.

特に、請求項３に記載の発明によれば、第１の音声データの受信が開始されると第２の合成画像データの送信が許可されず、第１の合成音声データの送信完了時点以後において第２の合成画像データの送信が許可されるので、サポータ端末における第１の表示画像から第２の表示画像への変更は、第１の合成音声データの送信後に行われる。したがって、第１の表示画像を見ながら発せられたユーザの音声が、当該音声の到達遅延に起因して第１の表示画像の次の第２の表示画像の表示中にサポータ端末側で出力されることを抑制あるいは回避することが可能である。 In particular, according to the third aspect of the invention, when the reception of the first audio data is started, the transmission of the second synthesized image data is not permitted, and after the transmission completion time of the first synthesized audio data. Since the transmission of the second synthesized image data is permitted, the change from the first display image to the second display image in the supporter terminal is performed after the transmission of the first synthesized audio data. Therefore, the voice of the user uttered while viewing the first display image is output on the supporter terminal side during the display of the second display image next to the first display image due to the arrival delay of the voice. This can be suppressed or avoided.

特に、請求項４に記載の発明によれば、サポータ端末における第１の表示画像から第２の表示画像への変更は、第１の合成音声データの送信完了から第１の合成音声データの出力所要時間が経過した時点以後において行われる。したがって、第１の表示画像を見ながら発せられたユーザの音声が、当該音声の到達遅延に起因して第１の表示画像の次の第２の表示画像の表示中にサポータ端末側で出力されることをより抑制あるいは回避することが可能である。 In particular, according to the fourth aspect of the present invention, the change from the first display image to the second display image in the supporter terminal is performed when the first synthesized speech data is transmitted after the first synthesized speech data is transmitted. This is performed after the time required has elapsed. Therefore, the voice of the user uttered while viewing the first display image is output on the supporter terminal side during the display of the second display image next to the first display image due to the arrival delay of the voice. This can be further suppressed or avoided.

特に、請求項５に記載の発明によれば、複数の代替音声データの生成が合成ユーザ音声データの生成に先立つ所定の時点から開始されるとともに、生成された代替音声データが格納手段に格納される。そして、ユーザ音声データに秘匿ワードが含まれ且つ秘匿ワードに対応する代替音声データが格納手段に既に格納されている場合には、格納手段に格納されている代替音声データを用いて合成ユーザ音声データが生成され、合成ユーザ音声データがサポータ端末に送信される。したがって、ユーザ音声データに秘匿ワードが含まれると判定した時点から代替音声データの生成を常に開始する場合と比べて、合成音声データの生成に要する時間が短縮される。その結果、サポータ端末への合成音声データの送信の遅延を抑制することが可能である。 In particular, according to the fifth aspect of the invention, the generation of the plurality of alternative voice data is started from a predetermined time prior to the generation of the synthesized user voice data, and the generated alternative voice data is stored in the storage means. The If the user voice data includes a secret word and the alternative voice data corresponding to the secret word is already stored in the storage means, the synthesized user voice data using the alternative voice data stored in the storage means Is generated, and the synthesized user voice data is transmitted to the supporter terminal. Therefore, the time required for generating the synthesized voice data is shortened as compared to the case where the generation of the alternative voice data is always started from the time when it is determined that the user voice data includes the secret word. As a result, it is possible to suppress a delay in transmission of the synthesized voice data to the supporter terminal.

特に、請求項１３に記載の発明によれば、第２の音声データに対する音声処理において、第１の合成音声データの生成に利用され格納されていた代替音声データを用いて第２の合成音声データが生成されるので、当該代替音声データの生成を再び行わずに済む。したがって、第２の合成音声データの生成に要する時間が短縮されるので、サポータ端末への第２の合成音声データの送信の遅延を抑制することが可能である。 In particular, according to the invention described in claim 13, in the audio processing for the second audio data, the second synthesized audio data is used by using the alternative audio data that has been used and stored for generating the first synthesized audio data. Therefore, it is not necessary to generate the alternative voice data again. Therefore, since the time required for generating the second synthesized voice data is shortened, it is possible to suppress a delay in the transmission of the second synthesized voice data to the supporter terminal.

特に、請求項１４に記載の発明によれば、サポータにより発せられたサポータ音声データに含まれる代替ワードを、当該代替ワードに対応する第２の秘匿音声データに置き換えた合成サポータ音声データが、ユーザ側に送信されるので、サポータ音声データ内に含まれる代替ワードがユーザに伝わることがない。したがって、ユーザの知らない代替ワードが画像形成装置において音声出力されることに起因したユーザの混乱を回避することが可能である。 In particular, according to the invention described in claim 14, the synthesized supporter voice data obtained by replacing the substitute word included in the supporter voice data issued by the supporter with the second secret voice data corresponding to the substitute word is the user. Therefore, the alternative word included in the supporter voice data is not transmitted to the user. Therefore, it is possible to avoid the user's confusion caused by the alternative word that the user does not know is output as voice in the image forming apparatus.

特に、請求項１５に記載の発明によれば、合成ユーザ音声データの生成の際に、ユーザにより過去に発せられた音声データが秘匿音声データとして格納手段に格納され、合成サポータ音声データの生成の際に、当該格納手段に既に格納されている秘匿音声データを第２の秘匿音声データとして用いて合成サポータ音声データが生成される。したがって、当該第２の秘匿音声データの生成を再び行わずに済むので、合成サポータ音声データの生成に要する時間が短縮される。その結果、画像形成装置への合成音声データの送信の遅延を抑制することが可能である。 In particular, according to the invention described in claim 15, when the synthesized user voice data is generated, the voice data issued in the past by the user is stored as the secret voice data in the storage means, and the generation of the synthesized supporter voice data is performed. At this time, synthesized supporter voice data is generated using the secret voice data already stored in the storage means as the second secret voice data. Therefore, since it is not necessary to generate the second secret audio data again, the time required for generating the synthesis supporter audio data is shortened. As a result, it is possible to suppress a delay in transmission of the synthesized audio data to the image forming apparatus.

操作案内システムの構成を示す図である。It is a figure which shows the structure of an operation guidance system. ＭＦＰの概略構成を示す機能ブロック図である。2 is a functional block diagram illustrating a schematic configuration of an MFP. FIG. 案内サーバの概略構成を示す機能ブロック図である。It is a functional block diagram which shows schematic structure of a guidance server. サポータ端末の概略構成を示す機能ブロック図である。It is a functional block diagram which shows schematic structure of a supporter terminal. 操作案内システムの動作の概要を示す図である。It is a figure which shows the outline | summary of operation | movement of an operation guidance system. 第１実施形態に係る画像データおよび音声データに関するタイミングを示す図である。It is a figure which shows the timing regarding the image data and audio | voice data which concern on 1st Embodiment. 操作案内システムの動作を示す図である。It is a figure which shows operation | movement of an operation guidance system. 画像データの画像処理等を示すフローチャートである。It is a flowchart which shows the image processing etc. of image data. 秘匿ワードと代替ワードとの対応（変換辞書）を示す図である。It is a figure which shows a response | compatibility (conversion dictionary) with a secret word and an alternative word. ユーザ音声データの音声処理等を示すフローチャートである。It is a flowchart which shows the audio | voice process etc. of user audio | voice data. 合成音声データの生成に関する音声処理を示す図である。It is a figure which shows the audio | voice process regarding the production | generation of synthetic | combination audio | voice data. 第１実施形態に係る画像データおよび音声データに関するタイミングを示す図である。It is a figure which shows the timing regarding the image data and audio | voice data which concern on 1st Embodiment. 第２実施形態に係る画像データおよび音声データに関するタイミングを示す図である。It is a figure which shows the timing regarding the image data and audio | voice data which concern on 2nd Embodiment. 第２実施形態に係る音声データの音声処理等を示すフローチャートである。It is a flowchart which shows the audio | voice process etc. of the audio | voice data which concern on 2nd Embodiment. 第２実施形態に係る画像データの画像処理等を示すフローチャートである。It is a flowchart which shows the image processing etc. of the image data which concern on 2nd Embodiment. 第２実施形態の変形例に係る画像データおよび音声データに関するタイミングを示す図である。It is a figure which shows the timing regarding the image data and audio | voice data which concern on the modification of 2nd Embodiment. 第２実施形態の変形例に係る音声データの音声処理等を示すフローチャートである。It is a flowchart which shows the audio | voice process etc. of the audio | voice data which concern on the modification of 2nd Embodiment. 第３実施形態に係る画像データおよび音声データに関するタイミングを示す図である。It is a figure which shows the timing regarding the image data and audio | voice data which concern on 3rd Embodiment. 第３実施形態に係る音声データの音声処理等を示すフローチャートである。It is a flowchart which shows the audio | voice process etc. of the audio | voice data which concern on 3rd Embodiment. 秘匿ワードと代替ワードとの対応（変換辞書）を示す図である。It is a figure which shows a response | compatibility (conversion dictionary) with a secret word and an alternative word. 第４実施形態に係る音声データの音声処理等を示すフローチャートである。It is a flowchart which shows the audio | voice process etc. of the audio | voice data which concern on 4th Embodiment. 第４実施形態に係る画像データおよび音声データに関するタイミングを示す図である。It is a figure which shows the timing regarding the image data and audio | voice data which concern on 4th Embodiment. 第４実施形態に係る操作案内システムの動作を示す図である。It is a figure which shows operation | movement of the operation guidance system which concerns on 4th Embodiment. 第４実施形態に係る操作案内システムの動作を示す図である。It is a figure which shows operation | movement of the operation guidance system which concerns on 4th Embodiment. 第５実施形態に係る画像データと合成画像データとを示す図である。It is a figure which shows the image data and composite image data which concern on 5th Embodiment. 第５実施形態に係る画像データと合成画像データとを示す図である。It is a figure which shows the image data and composite image data which concern on 5th Embodiment. 第５実施形態に係る秘匿ワードと代替ワードとの対応（変換辞書）を示す図である。It is a figure which shows a response | compatibility (conversion dictionary) with the secret word and alternative word which concern on 5th Embodiment. 変形例に係るサポータからユーザへの音声データの音声処理等を示すフローチャートである。It is a flowchart which shows the audio | voice processing etc. of the audio | voice data from the supporter which concerns on a modification to a user. 変形例に係る操作案内システムの動作を示す図である。It is a figure which shows operation | movement of the operation guidance system which concerns on a modification. 変形例に係る操作案内システムの動作を示す図である。It is a figure which shows operation | movement of the operation guidance system which concerns on a modification. 変形例に係る操作案内システムの動作を示す図である。It is a figure which shows operation | movement of the operation guidance system which concerns on a modification.

以下、本発明の実施形態を図面に基づいて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

＜１−１．システム概要＞
図１は、操作案内システム１の構成を示す概略図である。 <1-1. System overview>
FIG. 1 is a schematic diagram showing the configuration of the operation guidance system 1.

図１に示すように、操作案内システム１は、画像形成装置１０と案内サーバ５０とサポータ端末７０とを備えて形成されている。ここでは、画像形成装置１０として、ＭＦＰ（マルチ・ファンクション・ペリフェラル（Multi-Functional Peripheral））が例示される。 As shown in FIG. 1, the operation guidance system 1 includes an image forming apparatus 10, a guidance server 50, and a supporter terminal 70. Here, the image forming apparatus 10 is exemplified by an MFP (Multi-Functional Peripheral).

操作案内システム１における各要素１０，５０，７０とは、ネットワーク１０８を介して互いに通信可能に接続される。なお、ネットワーク１０８は、ＬＡＮ、ＷＡＮ、インターネットなどによって構成される。また、ネットワーク１０８への接続形態は、有線接続であってもよく或いは無線接続であってもよい。 The elements 10, 50, and 70 in the operation guidance system 1 are connected to each other via a network 108 so as to communicate with each other. The network 108 is configured by a LAN, a WAN, the Internet, or the like. Further, the connection form to the network 108 may be wired connection or wireless connection.

操作案内システム１は、ＭＦＰ（画像形成装置）１０の操作案内を行うシステムである。ＭＦＰ１０のユーザ１０１からの依頼に応じて、サポータ１０２（ユーザ１０１を支援する人物）は、サポータ端末７０を用いて当該ユーザ１０１に対して操作案内を行う。 The operation guidance system 1 is a system that provides operation guidance for an MFP (image forming apparatus) 10. In response to a request from the user 101 of the MFP 10, the supporter 102 (person who supports the user 101) provides operation guidance to the user 101 using the supporter terminal 70.

ＭＦＰ１０とサポータ端末７０との間においては、画像データ３００およびユーザ音声データ４００（ユーザにより発せられた音声のデータ）に関する通信が案内サーバ５０を介して行われる。案内サーバ５０は、画像および音声の伝送に関して、ＭＦＰ１０とサポータ端末７０とを媒介する機能を有している。 Communication between the MFP 10 and the supporter terminal 70 is performed via the guidance server 50 regarding the image data 300 and the user audio data 400 (audio data generated by the user). The guidance server 50 has a function of mediating between the MFP 10 and the supporter terminal 70 with respect to image and sound transmission.

サポータ端末７０には、案内サーバ５０を介したＭＦＰ１０からの画像データ３００に基づいて、ＭＦＰ１０の操作画面と同様の画面が表示される。これにより、サポータ１０２は、ユーザ１０１が見ている画面と同様の画面を見ながら、ユーザ１０１に対する操作案内を行うことができる。また、ＭＦＰ１０のマイクロホン１８を介して入力されたユーザ１０１の音声は、案内サーバ５０を介してサポータ端末７０に送信される。これにより、サポータ１０２は、ユーザ１０１の音声を聞きながら、ユーザ１０１に対する操作案内を行うことができる。 On the supporter terminal 70, a screen similar to the operation screen of the MFP 10 is displayed based on the image data 300 from the MFP 10 via the guidance server 50. Thereby, the supporter 102 can perform operation guidance for the user 101 while viewing a screen similar to the screen that the user 101 is viewing. The voice of the user 101 input via the microphone 18 of the MFP 10 is transmitted to the supporter terminal 70 via the guidance server 50. Thereby, the supporter 102 can perform operation guidance for the user 101 while listening to the voice of the user 101.

ただし、案内サーバ５０は、画像（詳細には、ＭＦＰ１０の操作案内画面）に関する変換処理を行う。たとえば、後述するように、案内サーバ５０は、ＭＦＰ１０から送信されてきた（当該ＭＦＰ１０の）操作案内画面に秘匿ワード１１０（機密情報）が含まれる場合には、当該秘匿ワード１１０を適宜の代替ワード２１０に変換した画像を生成する。そして、案内サーバ５０は、変換後の当該画像をサポータ端末７０における表示用画像としてサポータ端末７０に送信する。サポータ端末７０では変換後の当該画像が操作案内用の画面として表示される。これによれば、操作案内用の画像から機密情報が漏洩することを回避することが可能である。 However, the guidance server 50 performs conversion processing relating to images (specifically, operation guidance screens of the MFP 10). For example, as will be described later, when the confidential information 110 (confidential information) is included in the operation guidance screen (of the MFP 10) transmitted from the MFP 10, the guidance server 50 converts the confidential word 110 into an appropriate alternative word. The image converted into 210 is generated. Then, the guidance server 50 transmits the converted image to the supporter terminal 70 as a display image on the supporter terminal 70. On the supporter terminal 70, the converted image is displayed as an operation guidance screen. According to this, it is possible to avoid leakage of confidential information from the operation guidance image.

さらに、案内サーバ５０は、音声に関する変換処理をも行う。たとえば、後述するように、案内サーバ５０は、ＭＦＰ１０から送信されてきたユーザ音声に秘匿ワード１１０（機密情報）が含まれる場合には、当該秘匿ワード１１０を適宜の代替ワード２１０に変換した音声データを生成して、当該変換後の音声データをサポータ端末７０に送信する。サポータ端末７０では変換後の当該音声データに基づく音声が出力される。これによれば、ユーザ１０１の音声から機密情報が漏洩することを回避することが可能である。 Furthermore, the guidance server 50 also performs conversion processing related to voice. For example, as will be described later, when the confidential voice 110 (confidential information) is included in the user voice transmitted from the MFP 10, the guidance server 50 converts the confidential data 110 into an appropriate alternative word 210. And the converted voice data is transmitted to the supporter terminal 70. The supporter terminal 70 outputs sound based on the converted sound data. According to this, it is possible to avoid leakage of confidential information from the voice of the user 101.

以下、このような操作案内システムについて詳細に説明する。 Hereinafter, such an operation guidance system will be described in detail.

＜１−２．ＭＦＰの構成＞
図２は、ＭＦＰ１０の概略構成を示す機能ブロック図である。ＭＦＰ１０は、スキャン機能、コピー機能、ファクシミリ機能およびボックス格納機能などを備える装置（複合機とも称する）である。ＭＦＰは、複数の動作モード（具体的には、コピーモード、スキャンモード、ファクシミリ送信モード、ボックスモード）を有しており、各モードにおいて対応機能の動作が実行される。 <1-2. Configuration of MFP>
FIG. 2 is a functional block diagram illustrating a schematic configuration of the MFP 10. The MFP 10 is a device (also referred to as a multi-function device) having a scan function, a copy function, a facsimile function, a box storage function, and the like. The MFP has a plurality of operation modes (specifically, a copy mode, a scan mode, a facsimile transmission mode, and a box mode), and the operation of the corresponding function is executed in each mode.

図２の機能ブロック図に示すように、このＭＦＰ１０は、画像読取部２、印刷出力部３、通信部４、格納部５、操作部６、コントローラ９、スピーカ１７およびマイクロホン１８等を備えており、これらの各部を複合的に動作させることによって、各種の機能を実現する。なお、ＭＦＰ１０は、画像形成装置あるいは印刷出力装置などとも表現される。 As shown in the functional block diagram of FIG. 2, the MFP 10 includes an image reading unit 2, a print output unit 3, a communication unit 4, a storage unit 5, an operation unit 6, a controller 9, a speaker 17, a microphone 18, and the like. Various functions are realized by operating these parts in a complex manner. The MFP 10 is also expressed as an image forming apparatus or a print output apparatus.

画像読取部２は、ＭＦＰ１０の所定の位置に載置された原稿を光学的に読み取って（すなわちスキャンして）、当該原稿の画像データ（原稿画像ないしスキャン画像とも称する）を生成する処理部である。この画像読取部２は、スキャン部であるとも称される。 The image reading unit 2 is a processing unit that optically reads (that is, scans) a document placed at a predetermined position of the MFP 10 and generates image data of the document (also referred to as a document image or a scanned image). is there. The image reading unit 2 is also referred to as a scanning unit.

印刷出力部３は、印刷対象に関するデータに基づいて紙などの各種の媒体に画像を印刷出力する出力部である。なお、印刷出力部３は、各種の媒体上に画像を形成する画像形成部とも称される。 The print output unit 3 is an output unit that prints out an image on various media such as paper based on data related to a print target. The print output unit 3 is also referred to as an image forming unit that forms images on various media.

通信部４は、公衆回線等を介したファクシミリ通信を行うことが可能な処理部である。さらに、通信部４は、ネットワーク１０８を介したネットワーク通信が可能である。このネットワーク通信では、ＴＣＰ／ＩＰ（Transmission Control Protocol / Internet Protocol）およびＦＴＰ（File Transfer Protocol）等の各種のプロトコルが利用される。当該ネットワーク通信を利用することによって、ＭＦＰ１０は、所望の相手先との間で各種のデータを授受することが可能である。通信部４は、各種データを送信する送信部４ａと各種データを受信する受信部４ｂとを有する。 The communication unit 4 is a processing unit capable of performing facsimile communication via a public line or the like. Further, the communication unit 4 can perform network communication via the network 108. In this network communication, various protocols such as TCP / IP (Transmission Control Protocol / Internet Protocol) and FTP (File Transfer Protocol) are used. By using the network communication, the MFP 10 can exchange various data with a desired destination. The communication unit 4 includes a transmission unit 4a that transmits various data and a reception unit 4b that receives various data.

格納部５は、ハードディスクドライブ（ＨＤＤ）等の記憶装置で構成される。また、格納部５には、各ユーザ向けに複数のボックス（フォルダ）が設けられており、各ボックスには、電子文書データ（文書ファイル）等が保存される。 The storage unit 5 includes a storage device such as a hard disk drive (HDD). The storage unit 5 is provided with a plurality of boxes (folders) for each user, and electronic document data (document files) and the like are stored in each box.

操作部６は、ＭＦＰ１０に対する入力を受け付ける操作入力部６ａと、各種情報の表示出力を行う表示部６ｂとを備えている。詳細には、ＭＦＰ１０には操作パネル６ｃ（図１参照）が設けられている。この操作パネル（タッチスクリーン）６ｃは、その正面側にタッチパネル２５を有している。タッチパネル２５は、液晶表示パネルに圧電センサ等が埋め込まれて構成され、各種情報を表示するとともに操作者からの操作入力を受け付けることが可能である。タッチパネル２５は、操作入力部６ａの一部としても機能するとともに、表示部６ｂの一部としても機能する。 The operation unit 6 includes an operation input unit 6a that receives input to the MFP 10 and a display unit 6b that displays and outputs various types of information. Specifically, the MFP 10 is provided with an operation panel 6c (see FIG. 1). The operation panel (touch screen) 6c has a touch panel 25 on the front side. The touch panel 25 is configured by embedding a piezoelectric sensor or the like in a liquid crystal display panel, and can display various kinds of information and accept an operation input from an operator. The touch panel 25 functions as a part of the operation input unit 6a and also functions as a part of the display unit 6b.

コントローラ９は、ＭＦＰ１０に内蔵され、ＭＦＰ１０を統括的に制御する制御装置である。コントローラ９は、ＣＰＵおよび各種の半導体メモリ（ＲＡＭおよびＲＯＭ）等を備えるコンピュータシステムとして構成される。コントローラ９は、ＣＰＵにおいて、ＲＯＭ（例えば、ＥＥＰＲＯＭ（登録商標））内に格納されている所定のソフトウエアプログラム（以下、単にプログラムとも称する）を実行することによって、各種の処理部を実現する。なお、当該プログラムは、ＵＳＢメモリなどの可搬性の記録媒体、あるいはネットワーク等を介してＭＦＰ１０にインストールされるようにしてもよい。 The controller 9 is a control device that is built in the MFP 10 and controls the MFP 10 in an integrated manner. The controller 9 is configured as a computer system including a CPU and various semiconductor memories (RAM and ROM). The controller 9 implements various processing units by executing a predetermined software program (hereinafter also simply referred to as a program) stored in a ROM (for example, EEPROM (registered trademark) ) in the CPU. The program may be installed in the MFP 10 via a portable recording medium such as a USB memory or a network.

図２に示すように、コントローラ９は、通信制御部１１と入力制御部１２と表示制御部１３と格納制御部１４とを含む各種の処理部を実現する。 As shown in FIG. 2, the controller 9 implements various processing units including a communication control unit 11, an input control unit 12, a display control unit 13, and a storage control unit 14.

通信制御部１１は、他の装置（案内サーバ５０等）との間の通信動作を制御する処理部である。たとえば、通信制御部１１は、通信部４等と協働して、案内サーバ５０からの各種指令を受信する。 The communication control unit 11 is a processing unit that controls communication operations with other devices (such as the guidance server 50). For example, the communication control unit 11 receives various commands from the guidance server 50 in cooperation with the communication unit 4 and the like.

入力制御部１２は、操作入力部６ａに対する操作入力動作を制御する制御部である。たとえば、入力制御部１２は、操作画面に対する操作入力を受け付ける動作を制御する。 The input control unit 12 is a control unit that controls an operation input operation with respect to the operation input unit 6a. For example, the input control unit 12 controls an operation for receiving an operation input on the operation screen.

表示制御部１３は、表示部６ｂにおける表示動作を制御する処理部である。たとえば、表示制御部１３は、ＭＦＰ１０を操作するための操作画面等を表示部６ｂに表示させる。 The display control unit 13 is a processing unit that controls the display operation in the display unit 6b. For example, the display control unit 13 displays an operation screen or the like for operating the MFP 10 on the display unit 6b.

格納制御部１４は、格納ジョブに関するデータ格納処理等を制御する処理部である。 The storage control unit 14 is a processing unit that controls data storage processing related to a storage job.

スピーカ１７は、音声データに基づき音を発する装置である。また、スピーカ１７は、ＭＦＰ１０に内蔵されるものであってもよく、ＭＦＰ１０に対して端子を介して取り付けられるものであってもよい。 The speaker 17 is a device that emits sound based on audio data. The speaker 17 may be built into the MFP 10 or may be attached to the MFP 10 via a terminal.

マイクロホン１８は、ユーザの音声等を電気信号（アナログ信号）に変換する装置である。また、マイクロホン１８は、ＭＦＰ１０に内蔵されるものであってもよく、ＭＦＰ１０に対して端子を介して取り付けられるものであってもよい。なお、電気信号（アナログ信号）はコントローラ９によってデジタルデータ化（音声データに変換）される。 The microphone 18 is a device that converts a user's voice or the like into an electrical signal (analog signal). The microphone 18 may be built in the MFP 10 or may be attached to the MFP 10 via a terminal. The electrical signal (analog signal) is converted into digital data (converted into audio data) by the controller 9.

＜１−３．案内サーバの構成＞
図３は、案内サーバ５０の概略構成を示す機能ブロック図である。 <1-3. Guide server configuration>
FIG. 3 is a functional block diagram illustrating a schematic configuration of the guidance server 50.

案内サーバ５０は、ＭＦＰ１０の操作案内に関して、ＭＦＰ１０とサポータ端末７０とを媒介（ないし中継）する装置である。 The guidance server 50 is a device that mediates (or relays) the MFP 10 and the supporter terminal 70 with respect to the operation guidance of the MFP 10.

また、案内サーバ５０は、ＣＰＵおよび各種の半導体メモリ（ＲＡＭおよびＲＯＭ等）等を備えるコンピュータシステムとして構成される。処理制御部６０は、ＣＰＵにおいて、ＲＯＭ（たとえば、ＥＥＰＲＯＭ（登録商標）等）内に格納されている所定のソフトウエアプログラムを実行することによって、各種の処理部を実現する。なお、当該プログラムは、ＵＳＢメモリなどの可搬性の記録媒体、あるいはネットワーク等を介して案内サーバ５０にインストールされるようにしてもよい。 The guidance server 50 is configured as a computer system including a CPU and various semiconductor memories (such as RAM and ROM). In the CPU, the processing control unit 60 implements various processing units by executing predetermined software programs stored in a ROM (for example, EEPROM (registered trademark) ). The program may be installed in the guide server 50 via a portable recording medium such as a USB memory or a network.

具体的には、案内サーバ５０は、当該プログラムの実行に伴って、画像処理部６０ａ、音声処理部６０ｂおよび通信制御部６７を含む各種の処理部を実現する。 Specifically, the guidance server 50 realizes various processing units including the image processing unit 60a, the audio processing unit 60b, and the communication control unit 67 in accordance with the execution of the program.

画像処理部６０ａは、受信した画像データの各種画像処理を行う処理部である。 The image processing unit 60a is a processing unit that performs various types of image processing on received image data.

図３に示すように、画像処理部６０ａは、画像生成部６１を有する。画像生成部６１は、操作画面に関する画像合成処理（画像生成処理）を行う。 As illustrated in FIG. 3, the image processing unit 60 a includes an image generation unit 61. The image generation unit 61 performs image composition processing (image generation processing) related to the operation screen.

音声処理部６０ｂは、受信した音声データの各種音声処理を行う処理部である。 The audio processing unit 60b is a processing unit that performs various types of audio processing on received audio data.

図３に示すように、音声処理部６０ｂは、音声認識部６４と音声生成部６５とを有する。音声認識部６４は、受信した音声データ等に対する音声認識処理を行う。また、音声生成部６５は、受信した音声データ等を加工して音声合成処理（機械音声生成処理）を行う。 As shown in FIG. 3, the voice processing unit 60 b includes a voice recognition unit 64 and a voice generation unit 65. The voice recognition unit 64 performs voice recognition processing on the received voice data and the like. The voice generation unit 65 processes the received voice data and performs voice synthesis processing (mechanical voice generation processing).

通信制御部６７は、通信部５４と協働して、通信相手先（たとえば、ＭＦＰ１０）との間のデータの送受信動作を制御する処理部である。 The communication control unit 67 is a processing unit that controls data transmission / reception operations with a communication partner (for example, the MFP 10) in cooperation with the communication unit 54.

案内サーバ５０の格納部５５は、ハードディスクドライブ（ＨＤＤ）等の記憶装置で構成される。 The storage unit 55 of the guidance server 50 includes a storage device such as a hard disk drive (HDD).

また、案内サーバ５０は、通信部５４をさらに備えている。 The guidance server 50 further includes a communication unit 54.

通信部５４は、ネットワーク１０８を介したネットワーク通信が可能である。このネットワーク通信では、ＴＣＰ／ＩＰ（Transmission Control Protocol / Internet Protocol）およびＦＴＰ（File Transfer Protocol）等の各種のプロトコルが利用される。当該ネットワーク通信を利用することによって、案内サーバ５０は、所望の相手先との間で各種のデータを授受することが可能である。通信部４４は、各種データを送信する送信部５４ａと各種データを受信する受信部５４ｂとを有する。送信部５４ａは、画像データの送信を行う画像送信部と音声データの送信を行う音声送信部とを有し、受信部５４ｂは、画像データの受信を行う画像受信部と音声データの受信を行う音声受信部とを有する。 The communication unit 54 can perform network communication via the network 108. In this network communication, various protocols such as TCP / IP (Transmission Control Protocol / Internet Protocol) and FTP (File Transfer Protocol) are used. By using the network communication, the guidance server 50 can exchange various data with a desired destination. The communication unit 44 includes a transmission unit 54a that transmits various data and a reception unit 54b that receives various data. The transmission unit 54a includes an image transmission unit that transmits image data and an audio transmission unit that transmits audio data. The reception unit 54b receives audio data from an image reception unit that receives image data. And an audio receiving unit.

＜１−４．サポータ端末の構成＞
図４は、サポータ端末７０の概略構成を示す機能ブロック図である。 <1-4. Supporter terminal configuration>
FIG. 4 is a functional block diagram showing a schematic configuration of the supporter terminal 70.

サポータ端末７０は、いわゆるパーソナルコンピュータとして構成される。また、サポータ端末７０は、サポータにより操作され、ユーザへの案内に利用される補助装置である。 The supporter terminal 70 is configured as a so-called personal computer. Further, the supporter terminal 70 is an auxiliary device that is operated by the supporter and used for guidance to the user.

サポータ端末７０は、操作部７６を備えている。操作部７６は、サポータ端末７０に対する操作入力を受け付ける操作入力部７６ａと、各種データの表示出力を行う表示部７６ｂとを有している。また、サポータ端末７０は、ＭＦＰ１０を遠隔操作することが可能であり、表示部７６ｂには、ＭＦＰ１０の表示部６ｂに対応する表示画面が表示される。 The supporter terminal 70 includes an operation unit 76. The operation unit 76 includes an operation input unit 76a that receives an operation input to the supporter terminal 70, and a display unit 76b that performs display output of various data. Further, the supporter terminal 70 can remotely operate the MFP 10, and a display screen corresponding to the display unit 6 b of the MFP 10 is displayed on the display unit 76 b.

また、サポータ端末７０は、ＣＰＵおよび半導体メモリ等を備えている。サポータ端末７０は、そのＣＰＵにおいて、所定のソフトウエアプログラムを実行することによって、各種の処理部を実現する。具体的には、図４に示されるように、サポータ端末７０は、通信制御部７１および入力制御部７２等の各種処理部を実現する。 The supporter terminal 70 includes a CPU, a semiconductor memory, and the like. The supporter terminal 70 implements various processing units by executing predetermined software programs in the CPU. Specifically, as shown in FIG. 4, the supporter terminal 70 realizes various processing units such as a communication control unit 71 and an input control unit 72.

通信制御部７１は、通信部７４と協働して、通信宛先（たとえば、案内サーバ５０等）との間のデータの送受信動作を制御する処理部である。 The communication control unit 71 is a processing unit that controls data transmission / reception operations with a communication destination (for example, the guide server 50) in cooperation with the communication unit 74.

入力制御部７２は、操作入力部７６ａに対する操作入力動作を制御する制御部である。 The input control unit 72 is a control unit that controls an operation input operation with respect to the operation input unit 76a.

スピーカ７７は、案内サーバ５０等からの音声データに基づき音を発する装置である。また、スピーカ７７は、サポータ端末７０に内蔵されるものであってもよく、サポータ端末７０に対して端子を介して取り付けられるものであってもよい。 The speaker 77 is a device that emits sound based on voice data from the guidance server 50 or the like. The speaker 77 may be built in the supporter terminal 70 or may be attached to the supporter terminal 70 via a terminal.

マイクロホン７８は、ユーザの音声等を電気信号（アナログ信号）に変換する装置である。また、マイクロホン７８は、ＭＦＰ１０に内蔵されるものであってもよく、ＭＦＰ１０に対して外部接続により取り付けられるものであってもよい。なお、電気信号（アナログ信号）はコントローラ９によってデジタルデータ化（音声データに変換）される。 The microphone 78 is a device that converts a user's voice or the like into an electric signal (analog signal). The microphone 78 may be built in the MFP 10 or may be attached to the MFP 10 by external connection. The electrical signal (analog signal) is converted into digital data (converted into audio data) by the controller 9.

＜１−５．動作＞
つぎに、第１実施形態に係る操作案内システム１の動作について図５〜図１１を参照しながら説明する。 <1-5. Operation>
Next, the operation of the operation guidance system 1 according to the first embodiment will be described with reference to FIGS.

図５は、操作案内システム１の動作を示す図である。案内サーバ５０は、ＭＦＰ１０から画像データ３００（表示画像データ３００）を受信する。そして、当該画像データ３０１に秘匿ワード１１０（後述）が含まれる場合には、案内サーバ５０は、画像処理（画像変換処理等）により合成画像データ３５０を生成し、合成画像データ３５０をサポータ端末７０に送信する。また、案内サーバ５０は、ＭＦＰ１０からユーザ音声データ４００を受信する。当該ユーザ音声データ４００に秘匿ワード１１０が含まれる場合には、案内サーバ５０は、音声処理により合成音声データ４５０（合成ユーザ音声データ４５０）を生成し、合成音声データ４５０をサポータ端末７０に送信する。 FIG. 5 is a diagram illustrating the operation of the operation guidance system 1. The guidance server 50 receives the image data 300 (display image data 300) from the MFP 10. When the secret word 110 (described later) is included in the image data 301, the guidance server 50 generates the composite image data 350 by image processing (image conversion processing or the like), and the composite image data 350 is stored in the supporter terminal 70. Send to. Further, the guidance server 50 receives the user voice data 400 from the MFP 10. When the confidential voice 110 is included in the user voice data 400, the guidance server 50 generates synthesized voice data 450 (synthesized user voice data 450) by voice processing, and transmits the synthesized voice data 450 to the supporter terminal 70. .

これにより、画像データ３００（３０１）およびユーザ音声データ４００に含まれる秘匿ワード１１０がサポータ１０２に漏洩することを回避することが可能である。 Thereby, it is possible to avoid the secret word 110 included in the image data 300 (301) and the user voice data 400 from leaking to the supporter 102.

図６を参照して画像処理および音声処理についてより詳細に説明する。図６は、第１実施形態に係る画像データおよび音声データに関するタイミングを示す図である。 Image processing and sound processing will be described in more detail with reference to FIG. FIG. 6 is a diagram illustrating timing related to image data and audio data according to the first embodiment.

ＭＦＰ１０は、タッチパネル２５に表示された画像データ３００（３０１）を案内サーバ５０に送信する。また、ＭＦＰ１０は、ユーザ１０１により発せられた音声のデータを含むユーザ音声データ４００を案内サーバ５０に送信する。 The MFP 10 transmits the image data 300 (301) displayed on the touch panel 25 to the guidance server 50. In addition, the MFP 10 transmits user voice data 400 including voice data uttered by the user 101 to the guidance server 50.

案内サーバ５０は、ＭＦＰ１０から画像データ３０１を受信すると、画像処理（後述）により合成画像データ３５０（３５１）を生成し、サポータ端末７０に送信する。また、案内サーバ５０は、ユーザ音声データ４００の一部である部分音声データ４３０（後述）を抽出する。そして、案内サーバ５０は、音声処理（後述）により合成音声データ４５０（４５１）を生成し、サポータ端末７０に送信する。 Upon receiving the image data 301 from the MFP 10, the guidance server 50 generates composite image data 350 (351) by image processing (described later) and transmits it to the supporter terminal 70. Further, the guidance server 50 extracts partial voice data 430 (described later) that is a part of the user voice data 400. Then, the guidance server 50 generates synthesized voice data 450 (451) by voice processing (described later), and transmits it to the supporter terminal 70.

サポータ端末７０は、案内サーバ５０から合成画像データ３５１を受信すると、表示部７６ｂに表示する。また、サポータ端末７０は、案内サーバ５０から合成音声データ４５１を受信すると、合成音声データ４５１を出力（再生）する。 When the supporter terminal 70 receives the composite image data 351 from the guidance server 50, the supporter terminal 70 displays it on the display unit 76b. Further, when the supporter terminal 70 receives the synthesized voice data 451 from the guidance server 50, the supporter terminal 70 outputs (reproduces) the synthesized voice data 451.

以下において、このような第１実施形態に係る画像処理および音声処理について、より具体的に説明する。 Hereinafter, such image processing and sound processing according to the first embodiment will be described more specifically.

まず、画像処理に関して、図７等を参照して説明する。図７は、第１実施形態における案内サーバ５０の動作を示す図である。図７では、ＭＦＰ１０を操作している或るユーザ１０１が、ＭＦＰ１０のスキャン機能に関する操作方法についてサポートセンターに問い合わせを行う状況を想定する。 First, image processing will be described with reference to FIG. FIG. 7 is a diagram illustrating the operation of the guidance server 50 in the first embodiment. In FIG. 7, it is assumed that a user 101 operating the MFP 10 makes an inquiry to the support center regarding the operation method related to the scan function of the MFP 10.

図７に示すように、ユーザ１０１が、ＭＦＰ１０のタッチパネル２５に表示された表示画像のデータである表示画像データ３０１を見ながら、サポートセンターに対してサポート依頼の問い合わせを行う。具体的には、ユーザ１０１が、スキャン画像送信における宛先指定画面を見ながら、ＭＦＰ１０の操作パネル６ｃに配設されたヘルプボタン（不図示）を押下する。ヘルプボタンがユーザ１０１により押下されると、ＭＦＰ１０（具体的には、送信部４ａ）は、ユーザ１０１からの操作案内の発生を示すサポート依頼の信号を案内サーバ５０へと送信する。 As illustrated in FIG. 7, the user 101 makes a support request inquiry to the support center while viewing the display image data 301 that is display image data displayed on the touch panel 25 of the MFP 10. Specifically, the user 101 presses a help button (not shown) provided on the operation panel 6c of the MFP 10 while looking at the destination designation screen for scanning image transmission. When the help button is pressed by the user 101, the MFP 10 (specifically, the transmission unit 4 a) transmits a support request signal indicating the generation of operation guidance from the user 101 to the guidance server 50.

図８は、当該サポート依頼の信号が案内サーバ５０により受信された後の画像処理に関する動作を示すフローチャートである。 FIG. 8 is a flowchart showing an operation related to image processing after the support request signal is received by the guidance server 50.

案内サーバ５０の受信部５４ｂがＭＦＰ１０からのサポート依頼の信号を受信すると、案内サーバ５０の送信部５４ａはサポータ端末７０へとサポート依頼の信号を送信する（ステップＳ１１）。 When the reception unit 54b of the guidance server 50 receives the support request signal from the MFP 10, the transmission unit 54a of the guidance server 50 transmits a support request signal to the supporter terminal 70 (step S11).

その後、サポータ１０２がサポータ端末７０の案内開始ボタン（不図示）を押下すると、操作案内を開始すべき旨の信号（開始信号）がサポータ端末７０から案内サーバ５０へと送信され、案内サーバ５０は当該操作案内の開始信号をＭＦＰ１０に送信する。これにより、ＭＦＰ１０とサポータ端末７０とは操作案内モードに遷移する。 Thereafter, when the supporter 102 presses a guidance start button (not shown) of the supporter terminal 70, a signal (start signal) indicating that the operation guidance should be started is transmitted from the supporter terminal 70 to the guidance server 50. The guidance server 50 The operation guidance start signal is transmitted to the MFP 10. As a result, the MFP 10 and the supporter terminal 70 transition to the operation guidance mode.

操作案内の開始信号を受信したＭＦＰ１０（具体的には、送信部４ａ）は、タッチパネル２５に表示中の画像データ３０１を案内サーバ５０に送信し、案内サーバ５０（具体的には、受信部５４ｂ）はＭＦＰ１０から画像データ３０１を受信する（ステップＳ１２）。そして、案内サーバ５０（具体的には、画像生成部６１）は、画像データ３０１に秘匿ワード１１０（後述）が含まれるか否かを判定する（ステップＳ１３）。より詳細には、画像生成部６１は、画像データ３０１に対するＯＣＲ処理によって文字認識を行い、秘匿ワード１１０が含まれるか否かを判定する。 Receiving the operation guidance start signal, the MFP 10 (specifically, the transmission unit 4a) transmits the image data 301 being displayed on the touch panel 25 to the guidance server 50, and the guidance server 50 (specifically, the reception unit 54b). ) Receives the image data 301 from the MFP 10 (step S12). Then, the guidance server 50 (specifically, the image generation unit 61) determines whether or not a secret word 110 (described later) is included in the image data 301 (step S13). More specifically, the image generation unit 61 performs character recognition by OCR processing on the image data 301 and determines whether or not the secret word 110 is included.

画像データ３０１の受信に際して、案内サーバ５０は、画像データ３０１に含まれる秘匿ワード１１０のリストである秘匿ワードリスト６０１（図９参照）をもＭＦＰ１０から受信する。秘匿ワードリスト６０１には、宛先指定画面（画像データ３０１）に含まれる宛先が秘匿ワード１１０として抽出され、登録されている。そして、案内サーバ５０は、当該秘匿ワードリスト６０１に基づいて変換辞書６５１を生成する。生成された変換辞書６５１には、秘匿ワード１１０と、当該秘匿ワード１１０に対応する代替ワード２１０とが登録されている。 When receiving the image data 301, the guide server 50 also receives from the MFP 10 a secret word list 601 (see FIG. 9) that is a list of the secret words 110 included in the image data 301. In the secret word list 601, a destination included in the destination designation screen (image data 301) is extracted and registered as the secret word 110. Then, the guidance server 50 generates a conversion dictionary 651 based on the secret word list 601. In the generated conversion dictionary 651, the secret word 110 and the alternative word 210 corresponding to the secret word 110 are registered.

変換辞書６５１について、具体的には、「長谷不動産」（１１１（１１０））、「高橋電器」（１１２（１１０））および「松原工務店」（１１３（１１０））が秘匿ワード１１０として登録されている。さらに、秘匿ワード１１１「長谷不動産」に対応する代替ワード２１１（２１０）として「ＡＢＣ」が登録され、秘匿ワード１１２「高橋電器」に対応する代替ワード２１２（２１０）として「ＤＥＦ」が登録され、秘匿ワード１１３「松原工務店」に対応する代替ワード２１３（２１０）として「ＧＨＩＪ」が登録されている。各代替ワード２１０は、案内サーバ５０によって自動的に生成される。 Regarding the conversion dictionary 651, specifically, “Hase Real Estate” (111 (110)), “Takahashi Electric” (112 (110)) and “Matsubara Corporation” (113 (110)) are registered as the secret word 110. ing. Furthermore, “ABC” is registered as an alternative word 211 (210) corresponding to the secret word 111 “Hase Real Estate”, and “DEF” is registered as an alternative word 212 (210) corresponding to the secret word 112 “Takahashi Electric”. “GHIJ” is registered as an alternative word 213 (210) corresponding to the secret word 113 “Matsubara Corporation”. Each alternative word 210 is automatically generated by the guidance server 50.

ここにおいて、秘匿ワード１１０は、ユーザの秘匿すべき情報等を示す語句（ワード）である。また、本実施形態においては、タッチパネル２５に表示中の画像データ３０１に含まれる秘匿すべきワード（より詳細には、スキャン画像送信における宛先指定画面の送信宛先）が秘匿ワード１１０として案内サーバ５０により決定される。 Here, the secret word 110 is a word (word) indicating information to be kept secret by the user. In the present embodiment, the word to be concealed (more specifically, the transmission destination of the destination designation screen in scan image transmission) included in the image data 301 displayed on the touch panel 25 is the concealment word 110 by the guide server 50. It is determined.

秘匿ワード１１０が画像データ３０１に含まれる旨がステップＳ１３において判定される場合には、案内サーバ５０は、変換辞書６５１に基づいて、当該秘匿ワード１１０を、当該秘匿ワード１１０に対応する代替ワード２１０に置き換えた合成画像のデータである合成画像データ３５１を生成する（ステップＳ１４）。合成画像データ３５１が生成されると、案内サーバ５０は、当該合成画像データ３５１をサポータ端末７０での表示用データとしてサポータ端末７０に送信する（ステップＳ１５）。合成画像データ３５１を受信したサポータ端末７０の表示部７６ｂには、画像データ３０１に代えて合成画像データ３５１が表示される（図７参照）。 When it is determined in step S <b> 13 that the secret word 110 is included in the image data 301, the guide server 50 converts the secret word 110 into the substitute word 210 corresponding to the secret word 110 based on the conversion dictionary 651. The composite image data 351 that is the composite image data replaced with is generated (step S14). When the composite image data 351 is generated, the guidance server 50 transmits the composite image data 351 to the supporter terminal 70 as display data on the supporter terminal 70 (step S15). The display unit 76b of the supporter terminal 70 that has received the composite image data 351 displays composite image data 351 instead of the image data 301 (see FIG. 7).

一方、画像データ３０１に秘匿ワード１１０が含まれない旨が判定される場合には、案内サーバ５０は、ステップＳ１４の処理を行わず、ステップＳ１５では、画像データ３０１がそのまま合成画像データ３５１として利用される。すなわち、画像データ３０１がサポータ端末７０に送信されて、表示部７６ｂに画像データ３０１がそのまま表示される。 On the other hand, if it is determined that the secret word 110 is not included in the image data 301, the guidance server 50 does not perform the process of step S14, and the image data 301 is used as it is as the composite image data 351 in step S15. Is done. That is, the image data 301 is transmitted to the supporter terminal 70, and the image data 301 is displayed as it is on the display unit 76b.

つぎに、音声処理に関して説明する。 Next, audio processing will be described.

ＭＦＰ１０は、操作案内の開始信号を受信すると、ユーザ音声データ４００の送信を開始する。 Upon receiving the operation guidance start signal, the MFP 10 starts transmitting the user voice data 400.

図７では、ユーザ１０１が、ＭＦＰ１０のタッチパネル２５に表示中の画像データ３０１を見ながら、ＭＦＰ１０に対して「ファイルをスキャンして長谷不動産に送りたいのです。」との音声を発した状況が想定される。 In FIG. 7, the user 101 utters a voice saying “I want to scan a file and send it to Hase Real Estate” to the MFP 10 while watching the image data 301 being displayed on the touch panel 25 of the MFP 10. is assumed.

ユーザ１０１により発せられた音声が、マイクロホン１８を介してＭＦＰ１０に入力されると、ＭＦＰ１０は、当該音声のデータであるユーザ音声データ４００を案内サーバ５０に送信する。ここにおいて、ユーザ音声データ４００は、リアルタイムで（随時）案内サーバ５０に送信されている。 When voice uttered by the user 101 is input to the MFP 10 via the microphone 18, the MFP 10 transmits user voice data 400 that is data of the voice to the guidance server 50. Here, the user voice data 400 is transmitted to the guidance server 50 in real time (as needed).

当該ユーザ音声データ４００が案内サーバ５０により受信された後の動作を図１０のフローチャートを参照して説明する。案内サーバ５０（具体的には、受信部５４ｂ）がユーザ音声データ４００を受信すると（ステップＳ２０）、音声認識部６４は、ユーザ音声データ４００に非無音部分が存在するか否かを判定する（ステップＳ２１）。非無音部分が存在していることが判定されると、つぎに音声認識部６４は、ユーザ音声データ４００に所定時間以上の無音部分が存在するか否かを判定する（ステップＳ２２）。 The operation after the user voice data 400 is received by the guidance server 50 will be described with reference to the flowchart of FIG. When the guidance server 50 (specifically, the receiving unit 54b) receives the user voice data 400 (step S20), the voice recognition unit 64 determines whether or not the user voice data 400 includes a non-silence portion ( Step S21). If it is determined that a non-silence portion exists, the voice recognition unit 64 next determines whether or not a silence portion of a predetermined time or longer exists in the user voice data 400 (step S22).

ユーザ音声データ４００に所定時間以上の無音部分が存在していることが判定される場合に、音声認識部６４は、ユーザ音声データ４００の一部の音声データである部分音声データ４３０を抽出する（ステップＳ２３）。換言すれば、ユーザ音声データ４００のうち、所定時間の無音状態が経過した時点を終端とするように区分した部分音声データが部分音声データ４３０として抽出される。 When it is determined that there is a silent part for a predetermined time or longer in the user voice data 400, the voice recognition unit 64 extracts partial voice data 430 that is a part of the voice data of the user voice data 400 ( Step S23). In other words, from the user voice data 400, partial voice data that is segmented so as to end at a point in time when a silent state for a predetermined time has elapsed is extracted as the partial voice data 430.

ここにおいて、部分音声データ４３０は、ユーザ音声データ４００のうち、非無音部分の開始時点と、当該非無音部分の終端時点から所定時間の無音状態が経過した時点とに挟まれた区間（期間）の音声データである。ユーザ音声データ４００には無音部分と非無音部分とが存在し、ユーザ音声データ４００は当該無音部分等によって複数の部分音声データ４３０に区分される。 Here, the partial audio data 430 is a section (period) sandwiched between the start time of the non-silence part and the time when a silence state for a predetermined time has elapsed from the end time of the non-silence part in the user sound data 400. Voice data. The user voice data 400 includes a silent part and a non-silent part, and the user voice data 400 is divided into a plurality of partial voice data 430 by the silent part.

なお、当該無音部分の存否判定のための所定時間は、比較的短い期間（たとえば０．５秒）に設定されることが好ましい。当該無音部分の所定時間が比較的短い期間に設定されることによれば、部分音声データ４３０も比較的短い期間のデータとして音声認識部６４により抽出される。その結果、部分音声データ４３０に対応する合成音声データ４５０のサポータ端末７０への送信遅延を抑制することが可能である（後述）。 In addition, it is preferable that the predetermined time for the presence / absence determination of the silent portion is set to a relatively short period (for example, 0.5 seconds). When the predetermined time of the silent part is set to a relatively short period, the partial voice data 430 is also extracted by the voice recognition unit 64 as data of a relatively short period. As a result, it is possible to suppress a transmission delay of the synthesized voice data 450 corresponding to the partial voice data 430 to the supporter terminal 70 (described later).

部分音声データ４３０がステップＳ２３において抽出されると、案内サーバ５０（具体的には、音声生成部６５）は、部分音声データ４３０に対する音声認識処理によって、当該部分音声データ４３０に秘匿ワード１１０が含まれるか否かを判定する（ステップＳ２４）。 When the partial voice data 430 is extracted in step S23, the guidance server 50 (specifically, the voice generation unit 65) includes the secret word 110 in the partial voice data 430 by voice recognition processing on the partial voice data 430. It is determined whether or not (step S24).

部分音声データ４３０内に秘匿ワード１１０が含まれる旨が音声認識部６４によって判定される場合には、音声生成部６５は、変換辞書６５１（図９参照）に基づいて、秘匿ワード１１０に対応する代替ワード２１０の音声データである代替音声データ２５０を生成する（ステップＳ２５）。 When the voice recognition unit 64 determines that the secret word 110 is included in the partial voice data 430, the voice generation unit 65 corresponds to the secret word 110 based on the conversion dictionary 651 (see FIG. 9). Alternative voice data 250, which is voice data of alternative word 210, is generated (step S25).

より詳細には、ステップＳ２４では、音声生成部６５は、ユーザ１０１により発せられた部分音声データ４３０に秘匿ワード１１１「長谷不動産」が含まれる旨を、変換辞書６５１に基づいて判定する。そして、この判定に応じて、ステップＳ２５では、音声生成部６５は、秘匿ワード１１１「長谷不動産」に対応する代替ワード２１１「ＡＢＣ」の代替音声データ２５０（２５１）を機械音声生成処理により生成する。なお、代替音声データ２５０は、人間の声を模して人工的に生成された音声データ（機械音声データ）である。 More specifically, in step S <b> 24, the voice generation unit 65 determines based on the conversion dictionary 651 that the secret word 111 “Hase Real Estate” is included in the partial voice data 430 issued by the user 101. In response to this determination, in step S25, the voice generation unit 65 generates the alternative voice data 250 (251) of the alternative word 211 “ABC” corresponding to the secret word 111 “Hase Real Estate” by the machine voice generation process. . The alternative voice data 250 is voice data (mechanical voice data) artificially generated by imitating a human voice.

代替音声データ２５１がステップＳ２５において生成されると、音声生成部６５は、部分音声データ４３０内の秘匿ワード１１０の音声データである秘匿音声データ１５０（１５１を）、当該代替音声データ２５１に置き換えたデータである合成音声データ４５０（４５１）を生成する（ステップＳ２６）。そして、案内サーバ５０は、合成音声データ４５０をサポータ端末７０での音声出力用データとしてサポータ端末７０に送信する（ステップＳ２７）。なお、秘匿音声データ１５０は、ＭＦＰ１０において録音されたユーザ１０１の音声（秘匿ワード１１０に対応する音声）の音声データ（録音音声データ）である。 When the substitute voice data 251 is generated in step S25, the voice generation unit 65 replaces the secret voice data 150 (151), which is the voice data of the secret word 110 in the partial voice data 430, with the substitute voice data 251. Synthetic voice data 450 (451), which is data, is generated (step S26). Then, the guidance server 50 transmits the synthesized voice data 450 to the supporter terminal 70 as voice output data at the supporter terminal 70 (step S27). The secret voice data 150 is voice data (recorded voice data) of the voice of the user 101 (voice corresponding to the secret word 110) recorded in the MFP 10.

一方、部分音声データ４３０に秘匿ワード１１０が含まれない旨が判定される場合には、音声生成部６５は、ステップＳ２５およびＳ２６の処理を行わず、部分音声データ４３０をそのまま合成音声データ４５０として利用する。すなわち、部分音声データ４３０がサポータ端末７０に送信されて、部分音声データ４３０がそのまま出力される。 On the other hand, when it is determined that the secret word 110 is not included in the partial voice data 430, the voice generation unit 65 does not perform the processes of steps S25 and S26, and uses the partial voice data 430 as the synthesized voice data 450 as it is. Use. That is, the partial audio data 430 is transmitted to the supporter terminal 70, and the partial audio data 430 is output as it is.

合成音声データ４５０（４５１）を受信したサポータ端末７０は、スピーカ７７を介して、部分音声データ４３０に代えて合成音声データ４５０（４５１）を出力する。具体的には、サポータ端末７０において、ユーザ１０１の発した音声のうち、「ファイルをスキャンして」および「に送りたいのです。」は、ユーザ１０１の発した音声により出力され、秘匿ワード１１１である「長谷不動産」は、代替音声データ２５１による「ＡＢＣ」に変更されて出力される。 The supporter terminal 70 that has received the synthesized voice data 450 (451) outputs the synthesized voice data 450 (451) instead of the partial voice data 430 via the speaker 77. Specifically, among the voices uttered by the user 101 in the supporter terminal 70, “I want to scan a file” and “I want to send it to” are output by the voice uttered by the user 101 and the secret word 111. “Hase Real Estate” is changed to “ABC” by the alternative voice data 251 and output.

図１１は、合成音声データ４５１の生成に関する音声処理を示す図である。図１１において、ユーザ１０１により発せられた音声である「ファイルをスキャンして長谷不動産に送りたいのです。」の部分音声データ４３１（４３０）には、秘匿ワード１１１「長谷不動産」が含まれる。この場合において、音声生成部６５は、部分音声データ４３１に含まれる秘匿ワード１１１「長谷不動産」の秘匿音声データ１５０（１５１）を、秘匿ワード１１１に対応する代替ワード２１１「ＡＢＣ」の代替音声データ２５０（２５１）に置き換えて、合成音声データ４５１（４５０）を生成する。換言すると、音声生成部６５は、部分音声データ４３０（４３１）のうち秘匿ワード１１１を除いた部分の音声データと、代替音声データ２５０（２５１）とを合成して合成音声データ４５０（４５１）を生成する。 FIG. 11 is a diagram illustrating audio processing related to generation of the synthesized audio data 451. In FIG. 11, the partial voice data 431 (430) of “I want to scan a file and send it to Hase Real Estate” which is a voice uttered by the user 101 includes the secret word 111 “Hase Real Estate”. In this case, the voice generation unit 65 converts the secret voice data 150 (151) of the secret word 111 “Hase Real Estate” included in the partial voice data 431 into the alternative voice data of the alternative word 211 “ABC” corresponding to the secret word 111. In place of 250 (251), synthesized voice data 451 (450) is generated. In other words, the voice generation unit 65 synthesizes the synthesized voice data 450 (451) by synthesizing the voice data of the partial voice data 430 (431) excluding the secret word 111 and the alternative voice data 250 (251). Generate.

以上のような動作によれば、ユーザ１０１により操作されるＭＦＰ１０の表示画面内に秘匿ワード１１０が含まれる場合に、秘匿ワード１１０を代替ワード２１０に置き換えた合成画像データ３５０（３５１）がサポータ端末７０に送信されるので、当該秘匿ワード１１０がサポータ端末７０の表示部７６ｂに表示されない。したがって、ユーザ１０１により操作されるＭＦＰ１０の表示画面内に含まれる機密情報の漏洩を回避することが可能である。 According to the above operation, when the secret word 110 is included in the display screen of the MFP 10 operated by the user 101, the composite image data 350 (351) in which the secret word 110 is replaced with the alternative word 210 is the supporter terminal. 70, the secret word 110 is not displayed on the display unit 76b of the supporter terminal 70. Therefore, leakage of confidential information included in the display screen of the MFP 10 operated by the user 101 can be avoided.

また、特に、ユーザ１０１の発した音声の部分音声データ４３０内に秘匿ワード１１０が含まれる旨が判定される場合に、部分音声データ４３０（４３１）内の秘匿音声データ１５０（１５１）を代替音声データ２５１に置き換えた合成音声データ４５０（４５１）がサポータ端末７０に送信される。したがって、ユーザ１０１により発せられた秘匿ワード１１０の音声がサポータ端末７０へとそのまま送信されない。その結果、ユーザ１０１の音声に含まれる機密情報の漏洩を回避することが可能である。 In particular, when it is determined that the secret word 110 is included in the partial voice data 430 of the voice uttered by the user 101, the secret voice data 150 (151) in the partial voice data 430 (431) is used as the substitute voice. The synthesized voice data 450 (451) replaced with the data 251 is transmitted to the supporter terminal 70. Therefore, the voice of the secret word 110 uttered by the user 101 is not transmitted to the supporter terminal 70 as it is. As a result, it is possible to avoid leakage of confidential information included in the voice of the user 101.

また、仮に、ユーザ１０１により発せられた音声のユーザ音声データ４００から部分音声データ４３０を抽出しない場合には、ユーザ音声データ４００の長さ（ユーザ１０１の音声の時間）が大きくなり、サポータ端末７０への合成音声データ４５０の送信が大きく遅延する。一方、第１実施形態においては、ユーザ１０１の発した音声のユーザ音声データ４００が比較的短い期間で区切られ、ユーザ音声データ４００から複数の部分音声データ４３０が順次に抽出され、当該複数の部分音声データ４３０がサポータ端末７０に順次に送信される。すなわち、部分音声データ４３０に対応する合成音声データ４５０は、ユーザ音声データ４００のうち部分音声データ４３０の次の部分に対する処理の終了を待つことなく、比較的早期にサポータ端末７０に送信され得る。したがって、サポータ端末７０への音声データの送信の遅延を抑制することが可能である。 Further, if the partial audio data 430 is not extracted from the user audio data 400 of the audio uttered by the user 101, the length of the user audio data 400 (the time of the audio of the user 101) increases, and the supporter terminal 70 The transmission of the synthesized voice data 450 is greatly delayed. On the other hand, in the first embodiment, the user voice data 400 of the voice uttered by the user 101 is divided in a relatively short period, and a plurality of partial voice data 430 is sequentially extracted from the user voice data 400, and the plurality of parts The audio data 430 is sequentially transmitted to the supporter terminal 70. That is, the synthesized voice data 450 corresponding to the partial voice data 430 can be transmitted to the supporter terminal 70 relatively early without waiting for the end of the process for the next part of the partial voice data 430 in the user voice data 400. Therefore, it is possible to suppress a delay in transmission of audio data to the supporter terminal 70.

＜第２実施形態＞
第２実施形態は、第１実施形態の変形例である。以下では、第１実施形態との相違点を中心に説明する。 Second Embodiment
The second embodiment is a modification of the first embodiment. Below, it demonstrates centering on difference with 1st Embodiment.

第１実施形態においては、画像生成部６１による合成画像データ３５１の生成が完了した直後に、合成音声データ４５１の送信タイミングを考慮することなく、当該合成画像データ３５０が案内サーバ５０からサポータ端末７０に送信される。そのため、図１２に示すような問題が生じ得る。 In the first embodiment, immediately after the generation of the synthesized image data 351 by the image generating unit 61 is completed, the synthesized image data 350 is transferred from the guidance server 50 to the supporter terminal 70 without considering the transmission timing of the synthesized audio data 451. Sent to. Therefore, a problem as shown in FIG. 12 may occur.

図１２においては、ユーザ１０１による発声中（詳細には部分音声データ４３０の生成中（録音中））に、ユーザ１０１の操作に応じて、ＭＦＰ１０のタッチパネル２５の表示画像が画像データ３０１に基づく画像から画像データ３０２（後述）に基づく画像へと切り替えられている。たとえば、ユーザ１０１が画像データ３０１に基づく画像を見ながら喋っているにもかかわらず、喋っている途中で先回りして操作画面を切り換える操作をも行うような状況が想定される。 In FIG. 12, during the utterance by the user 101 (specifically, during generation of the partial audio data 430 (during recording)), the display image on the touch panel 25 of the MFP 10 is an image based on the image data 301 according to the operation of the user 101. To an image based on image data 302 (described later). For example, a situation may be assumed in which the user 101 performs an operation of switching the operation screen in advance while speaking while the user 101 is speaking while viewing an image based on the image data 301.

また、図１２においては、画像データ３０２に関する合成画像データ３５２の生成が完了した直後に、合成音声データ４５１の送信タイミングとの関係を考慮することなく、合成画像データ３５２が案内サーバ５０からサポータ端末７０に送信されている。そして、この送信動作に応じて、サポータ端末７０における表示画像は、合成画像データ３５１に基づく画像から合成画像データ３５２に基づく画像へと変更される。さらに、当該表示画像の変更後（換言すれば、新たな合成画像データ３５２に基づく画像の表示中）において、部分音声データ４３０に対応する合成音声データ４５１がサポータ端末７０にて出力される。 In FIG. 12, immediately after the generation of the synthesized image data 352 related to the image data 302 is completed, the synthesized image data 352 is transferred from the guidance server 50 to the supporter terminal without considering the relationship with the transmission timing of the synthesized audio data 451. 70. In accordance with this transmission operation, the display image on the supporter terminal 70 is changed from an image based on the composite image data 351 to an image based on the composite image data 352. Further, after the display image is changed (in other words, during display of an image based on the new synthesized image data 352), the synthesized voice data 451 corresponding to the partial voice data 430 is output from the supporter terminal 70.

その結果、元の画像データ３０１に基づく画像を見ながら発せられた音声に対応する合成音声データ４５１が、本来は合成画像データ３５１の表示中に出力されるべきであるにもかかわらず、合成画像データ３５２（画像データ３０１の次の画像データ３０２に対応する合成画像データ）の表示中に出力される。このような画像と音声との「ずれ」に起因して、サポータ１０２の混乱が生じる可能性がある。 As a result, the synthesized voice data 451 corresponding to the voice uttered while viewing the image based on the original image data 301 should be output during the display of the synthesized image data 351, although it should be output. It is output during display of data 352 (composite image data corresponding to image data 302 next to image data 301). There is a possibility that the supporter 102 may be confused due to such a “deviation” between the image and the sound.

なお、画像データ３０２は、表示画像データ３０１に基づく表示画像に引き続いてＭＦＰ１０のタッチパネル２５に表示された表示画像の画像データである。 Note that the image data 302 is image data of a display image displayed on the touch panel 25 of the MFP 10 following the display image based on the display image data 301.

第２実施形態では、このような問題に鑑みて、合成音声データ４５１の送信完了後の所定時点まで（詳細には、合成音声データ４５１の送信完了から合成音声データ４５１の出力所要時間が経過する時点まで）、合成画像データ３５２の送信を停止させる態様を例示する。 In the second embodiment, in view of such a problem, a required output time of the synthesized voice data 451 elapses from the completion of the transmission of the synthesized voice data 451 until a predetermined time after the transmission of the synthesized voice data 451 is completed. The mode of stopping transmission of the composite image data 352 is illustrated.

図１３は、第２実施形態に係る画像データ３００および部分音声データ４３０の送信タイミング等を示すタイミングチャートである。 FIG. 13 is a timing chart showing transmission timings and the like of the image data 300 and the partial audio data 430 according to the second embodiment.

第２実施形態では、ユーザ１０１による発声中（詳細には、部分音声データ４３０の生成中）に、ユーザ１０１が先回りして操作画面の切換操作を行い、ユーザ１０１の操作に応じて、ＭＦＰ１０のタッチパネル２５の表示画像が画像データ３０１に基づく画像から画像データ３０２に基づく画像へと切り替えられた状況を想定する。また、部分音声データ４３０は、画像データ３０１の表示中にユーザ１０１により発せられた音声をその冒頭部分等に含む音声データである。以下、音声処理および画像処理に関して順次に説明する。 In the second embodiment, during the utterance by the user 101 (specifically, during the generation of the partial audio data 430), the user 101 performs a switching operation of the operation screen in advance, and the MFP 10 performs the switching operation according to the operation of the user 101. Assume that the display image on the touch panel 25 is switched from an image based on the image data 301 to an image based on the image data 302. Also, the partial audio data 430 is audio data that includes the audio uttered by the user 101 during the display of the image data 301 in the beginning portion thereof. Hereinafter, audio processing and image processing will be sequentially described.

まず、音声処理に関して、図１４を参照し、図１０と比較しながら説明する。図１４は、第２実施形態に係るユーザ音声データ４００の音声処理等を示すフローチャートである。 First, voice processing will be described with reference to FIG. 14 and comparison with FIG. FIG. 14 is a flowchart showing voice processing and the like of the user voice data 400 according to the second embodiment.

図１４においては、ステップＳ２１とステップＳ２２との間にステップＳ４１が設けられ、ステップＳ２７の後にステップＳ４２とステップＳ４３とが設けられている。具体的には、案内サーバ５０は、非無音部分がユーザ音声データ４００内に存在する旨が音声認識部６４により判定されると（ステップＳ２１）、停止フラグＦＧをオン（ＯＮ）に変更する（ステップＳ４１）。 In FIG. 14, step S41 is provided between step S21 and step S22, and step S42 and step S43 are provided after step S27. Specifically, the guidance server 50 changes the stop flag FG to ON (ON) when the voice recognition unit 64 determines that a non-silence portion exists in the user voice data 400 (step S21) ( Step S41).

ここにおいて、停止フラグＦＧは、格納部５５に格納されるフラグ情報であり、音声認識部６４あるいは音声生成部６５により制御される。停止フラグＦＧは、オン（ＯＮ）またはオフ（ＯＦＦ）に設定（変更）される。案内サーバ５０の送信部５４ａは、当該停止フラグＦＧがＯＮであるかＯＦＦであるかによって画像の送信を行うか否かを決定する。停止フラグＦＧがＯＮであれば、送信部５４ａは画像を送信しない。停止フラグＦＧがＯＦＦであれば、送信部５４ａは画像を送信する。 Here, the stop flag FG is flag information stored in the storage unit 55, and is controlled by the voice recognition unit 64 or the voice generation unit 65. The stop flag FG is set (changed) to ON (ON) or OFF (OFF). The transmission unit 54a of the guidance server 50 determines whether or not to transmit an image depending on whether the stop flag FG is ON or OFF. If the stop flag FG is ON, the transmission unit 54a does not transmit an image. If the stop flag FG is OFF, the transmission unit 54a transmits an image.

この停止フラグＦＧがＯＮに変更された後、案内サーバ５０は、図１０と同様に、ステップＳ２２〜Ｓ２７の各処理を実行する。これにより、部分音声データ４３０に対応する合成音声データ４５０がサポータ端末７０に送信される。 After the stop flag FG is changed to ON, the guidance server 50 executes steps S22 to S27 as in FIG. Thereby, the synthesized voice data 450 corresponding to the partial voice data 430 is transmitted to the supporter terminal 70.

案内サーバ５０がサポータ端末７０に合成音声データ４５０を送信すると、サポータ端末７０は合成音声データ４５０を出力する。そして、案内サーバ５０による合成音声データ４５０の送信完了から合成音声データ４５０の出力所要時間が経過すると（ステップＳ４２）、案内サーバ５０は停止フラグＦＧをＯＦＦに変更する（ステップＳ４３）。 When the guidance server 50 transmits the synthesized voice data 450 to the supporter terminal 70, the supporter terminal 70 outputs the synthesized voice data 450. When the required output time of the synthesized voice data 450 elapses after the transmission of the synthesized voice data 450 by the guidance server 50 (step S42), the guidance server 50 changes the stop flag FG to OFF (step S43).

なお、出力所要時間（再生所要時間）は、合成音声データ４５０の出力（再生）に要する時間である。当該出力所要時間は、部分音声データ４３０（合成音声データ４５０）の録音時間であるとも表現される。出力所要時間は、音声認識部６４によって取得されればよい。ただし、これに限定されず、合成音声データ４５０の生成中に音声生成部６５が出力所要時間を算出してもよい。あるいは、合成音声データ４５０の出力が終了した旨の信号をサポータ端末７０が案内サーバ５０に送信し、当該出力所要時間が経過したことがサポータ端末７０から案内サーバ５０に通知されるようにしてもよい。 The required output time (required reproduction time) is the time required for outputting (reproducing) the synthesized voice data 450. The output required time is also expressed as the recording time of the partial sound data 430 (synthesized sound data 450). The output required time may be acquired by the voice recognition unit 64. However, the present invention is not limited to this, and the voice generation unit 65 may calculate the required output time during the generation of the synthesized voice data 450. Alternatively, the supporter terminal 70 transmits a signal to the effect that the output of the synthesized voice data 450 is completed to the guidance server 50 so that the guidance server 50 is notified from the supporter terminal 70 that the required output time has elapsed. Good.

このようにして、合成音声データ４５０が案内サーバ５０からサポータ端末７０に送信され、サポータ端末７０にて合成音声データ４５０が出力される。ユーザ音声データ４００における非無音部分の検出時点（部分音声データ４３０（４３１）の開始時点）と、当該部分音声データ４３０（４３１）に対応する合成音声データ４５０（４５１）の送信完了から当該合成音声データ４５０（４５１）の出力所要時間（再生所要時間）が経過した時点との間の期間Ｔ１（図１３参照）においては、停止フラグＦＧはオン（ＯＮ）に設定される。一方、それ以外の期間（たとえばユーザ音声データ４００における無音部分（部分音声データ４３０ではないと判定される部分）の受信期間）においては、停止フラグＦＧはオフ（ＯＦＦ）に設定される。 In this way, the synthesized voice data 450 is transmitted from the guidance server 50 to the supporter terminal 70, and the synthesized voice data 450 is output from the supporter terminal 70. From the detection time of the non-silence portion in the user audio data 400 (start time of the partial audio data 430 (431)) and the completion of transmission of the synthetic audio data 450 (451) corresponding to the partial audio data 430 (431) In the period T1 (see FIG. 13) between the time when the required output time (reproduction required time) of the data 450 (451) has elapsed, the stop flag FG is set to ON. On the other hand, the stop flag FG is set to OFF (OFF) in other periods (for example, a reception period of a silent part (a part determined not to be partial voice data 430) in the user voice data 400).

つぎに、画像処理に関して、図１５を参照し、図８と比較しながら説明する。図１５は、第２実施形態に係る画像処理等を示すフローチャートである。以下では、図１５を参照しながら、画像データ３０１の次の画像データ３０２に関する画像処理について説明する。画像データ３０１に関する画像処理は、第１実施形態と同様の動作（図８参照）により既に終了しているものとする。 Next, image processing will be described with reference to FIG. 15 and a comparison with FIG. FIG. 15 is a flowchart showing image processing and the like according to the second embodiment. Hereinafter, image processing relating to image data 302 next to image data 301 will be described with reference to FIG. It is assumed that the image processing relating to the image data 301 has already been completed by the same operation (see FIG. 8) as in the first embodiment.

図１５に示すように、ステップＳ３２〜Ｓ３５の各処理は、図８におけるステップＳ１２〜Ｓ１５の各処理と同様である。なお、サポート依頼の信号の送受信動作は画像データ３０１の送信前に既に終了しているので、図８のステップＳ１１の処理は図１５においては記載されていない。 As shown in FIG. 15, each process of step S32-S35 is the same as each process of step S12-S15 in FIG. Since the support request signal transmission / reception operation has already ended before the transmission of the image data 301, the process of step S11 in FIG. 8 is not described in FIG.

図１５では、ステップＳ３４とステップＳ３５との間にステップＳ３６が設けられている。ステップＳ３６では、画像生成部６１により生成された合成画像データ３５２をサポータ端末７０に送信する前に、案内サーバ５０（具体的には、送信部５４ａ）が、図１４の音声処理にて設定された停止フラグＦＧの値（ＯＮであるかＯＦＦであるか）を認識する。 In FIG. 15, step S36 is provided between step S34 and step S35. In step S36, before transmitting the composite image data 352 generated by the image generation unit 61 to the supporter terminal 70, the guidance server 50 (specifically, the transmission unit 54a) is set by the audio processing of FIG. The value of the stop flag FG (whether it is ON or OFF) is recognized.

停止フラグＦＧがＯＦＦであると認識される場合には、案内サーバ５０は、合成画像データ３５２の送信を許可し、合成画像データ３５２をサポータ端末７０に送信する（ステップＳ３５）。一方、停止フラグＦＧがＯＮであると認識される場合には（具体的には、ステップＳ２２〜Ｓ２７，Ｓ４１〜Ｓ４３の処理を案内サーバ５０が行っている場合には）、案内サーバ５０は、合成画像データ３５２の送信を禁止し、合成画像データ３５２はサポータ端末７０に送信されない。 If it is recognized that the stop flag FG is OFF, the guidance server 50 permits transmission of the composite image data 352 and transmits the composite image data 352 to the supporter terminal 70 (step S35). On the other hand, when it is recognized that the stop flag FG is ON (specifically, when the guidance server 50 performs the processes of steps S22 to S27 and S41 to S43), the guidance server 50 Transmission of the composite image data 352 is prohibited, and the composite image data 352 is not transmitted to the supporter terminal 70.

上述のように、期間Ｔ１（図１３参照）においては、停止フラグＦＧがオン（ＯＮ）に設定されている。そのため、図１３に示すように、上述の期間Ｔ１内に新たな画像データ３０２を受信した案内サーバ５０は、画像処理により合成画像データ３５２を生成するものの、期間Ｔ１内においてはサポータ端末７０に合成画像データ３５２を送信しない。 As described above, the stop flag FG is set to ON (ON) in the period T1 (see FIG. 13). Therefore, as shown in FIG. 13, the guidance server 50 that has received the new image data 302 within the above-described period T1 generates composite image data 352 by image processing, but within the period T1, combines it with the supporter terminal 70. The image data 352 is not transmitted.

その後、期間Ｔ１が終了し、停止フラグＦＧがオン（ＯＮ）からオフ（ＯＦＦ）に変更されると、案内サーバ５０は合成画像データ３５２をサポータ端末７０に送信する。 Thereafter, when the period T1 ends and the stop flag FG is changed from on (ON) to off (OFF), the guidance server 50 transmits the composite image data 352 to the supporter terminal 70.

このように、当該期間Ｔ１（図１３参照）に、案内サーバ５０が新たな画像データ３０２を受信する場合には、案内サーバ５０（具体的には、送信部５４ａ）は、当該期間Ｔ１の終了時点までは新たな合成画像データ３５２の送信を許可せず且つ当該期間Ｔ１の終了後に合成画像データ３５２の送信を許可する。 Thus, when the guidance server 50 receives new image data 302 during the period T1 (see FIG. 13), the guidance server 50 (specifically, the transmission unit 54a) ends the period T1. Until the time point, transmission of new composite image data 352 is not permitted, and transmission of composite image data 352 is permitted after the end of the period T1.

以上のような動作によれば、案内サーバ５０は、部分音声データ４３０の受信が開始されると新たな合成画像データ３５２の送信を許可せず、合成音声データ４５０（４５１）の送信完了時点以後の所定の時点において合成画像データ３５２の送信を許可するので、サポータ端末７０の表示部７６ｂにおける合成画像データ３５１から合成画像データ３５２への画像の変更は、合成音声データ４５０の送信完了後に行われる。したがって、画像データ３０１を見ながら発せられたユーザ１０１の音声が、当該音声の伝達の遅延に起因して合成画像データ３５２（画像データ３０１の次の画像データ３０２に対応する画像）の表示中にサポータ端末側で出力されることを抑制あるいは回避することが可能である。 According to the operation as described above, the guidance server 50 does not permit transmission of new synthesized image data 352 when reception of the partial audio data 430 is started, and after the completion of transmission of the synthesized audio data 450 (451). Since the transmission of the synthesized image data 352 is permitted at a predetermined time, the image change from the synthesized image data 351 to the synthesized image data 352 in the display unit 76b of the supporter terminal 70 is performed after the transmission of the synthesized audio data 450 is completed. . Accordingly, the sound of the user 101 uttered while viewing the image data 301 is displayed during the display of the composite image data 352 (an image corresponding to the next image data 302 of the image data 301) due to the delay in the transmission of the sound. It is possible to suppress or avoid output on the supporter terminal side.

端的に言えば、サポータ端末７０において、合成画像データ３５１に基づく画像の表示のタイミングと合成音声データ４５１に基づく音声の出力のタイミングとのずれを抑制あるいは回避することが可能である。その結果、サポータ１０２が混乱することなくユーザ１０１に的確な操作案内をすることが可能である。 In short, in the supporter terminal 70, it is possible to suppress or avoid a shift between the image display timing based on the synthesized image data 351 and the audio output timing based on the synthesized audio data 451. As a result, it is possible to provide accurate operation guidance to the user 101 without the supporter 102 being confused.

また、特に、合成音声データ４５１の送信完了から当該合成音声データ４５１の出力所要時間（再生所要時間）が経過した時点以後において、停止フラグＦＧがオンからオフに変更され合成画像データ３５２の送信が許可されることが好ましい。これによれば、合成画像データ３５１に基づく画像の表示のタイミングと合成音声データ４５１に基づく音声の出力のタイミングとのずれを更に抑制あるいは回避することが可能である。 In particular, the stop flag FG is changed from on to off and the synthesized image data 352 is transmitted after the time required for outputting the synthesized speech data 451 (required playback time) has elapsed since the completion of the transmission of the synthesized speech data 451. Preferably it is allowed. According to this, it is possible to further suppress or avoid the shift between the timing of image display based on the synthesized image data 351 and the timing of outputting audio based on the synthesized audio data 451.

なお、この第２実施形態等においては、ユーザ音声データ４００に所定時間以上の無音部分が存在する場合に、音声認識部６４は、ユーザ音声データ４００の非無音部分の開始時点から次の無音部分の開始時点までの音声データを部分音声データ４３０として抽出することを例示した。しかしながら、本発明は、これに限定されない。 In the second embodiment or the like, when the user voice data 400 includes a silent part for a predetermined time or longer, the voice recognition unit 64 starts the next silent part from the start point of the silent part of the user voice data 400. The example of extracting the voice data up to the start time of as the partial voice data 430 is exemplified. However, the present invention is not limited to this.

たとえば、第２実施形態（あるいは第１実施形態）において、案内サーバ５０により合成画像データ３５０が受信された時点を終端とするように区分した部分の音声データがユーザ音声データ４００から部分音声データ４３０として抽出されるようにしてもよい。換言すれば、ユーザ１０１による操作画面の切換時点でユーザ音声データ４００が区切られて、ユーザ音声データ４００の一部の音声データである部分音声データ４３０が抽出されるようにしてもよい。 For example, in the second embodiment (or the first embodiment), the audio data of a part divided so as to end at the time when the synthesized image data 350 is received by the guidance server 50 is changed from the user audio data 400 to the partial audio data 430. May be extracted. In other words, the user voice data 400 may be divided at the time when the operation screen is switched by the user 101, and the partial voice data 430 that is a part of the user voice data 400 may be extracted.

図１７は、このような改変例の動作を示すフローチャートである。 FIG. 17 is a flowchart showing the operation of such a modification.

図１７においては、ステップＳ２２の判定処理に加えてステップＳ４４の判定処理も行われる。両判定処理（ステップＳ２２，Ｓ４４）のいずれかで「ＹＥＳ」と判定されるとステップＳ２３に進み、部分音声データ４３０が抽出される。なお、ステップＳ４４では、新たな画像データを受信したか否かが判定される。 In FIG. 17, in addition to the determination process of step S22, the determination process of step S44 is also performed. If “YES” is determined in either of the determination processes (steps S22 and S44), the process proceeds to step S23, and the partial audio data 430 is extracted. In step S44, it is determined whether new image data has been received.

たとえば、所定時間以上の無音部分が存在しない旨がステップＳ２２で判定されたとしても、新たな画像データ３０２が受信された旨がステップＳ４４で判定されると、ステップＳ２３に進む。このステップＳ２３では、音声認識部６４は、ユーザ音声データ４００のうち、新たな表示画像の画像データ３０２の受信時点を終端とするように区分した部分音声データを、部分音声データ４３０として抽出する。 For example, even if it is determined in step S22 that there is no silent part longer than a predetermined time, if it is determined in step S44 that new image data 302 has been received, the process proceeds to step S23. In step S <b> 23, the voice recognition unit 64 extracts, as the partial voice data 430, the partial voice data that is divided from the user voice data 400 so that the reception time of the image data 302 of the new display image ends.

図１６は、この態様に係る動作のタイミング等を示すタイミングチャートである。 FIG. 16 is a timing chart showing operation timing and the like according to this aspect.

図１６に示すように、ＭＦＰ１０において画像データ３０１に基づく画像を見ながら発せられたユーザ１０１のユーザ音声データ４００は、画像データ３０２が案内サーバ５０により受信された時点で区切られる。案内サーバ５０の音声認識部６４は、ＭＦＰ１０から送信されるユーザ音声データ４００のうち、ユーザ音声データ４００の非無音部分の開始時点から新たな画像データ３０２を受信した時点までの部分の音声データを部分音声データ４３１（４３０）として抽出する。 As shown in FIG. 16, the user voice data 400 of the user 101 that is uttered while viewing an image based on the image data 301 in the MFP 10 is divided when the image data 302 is received by the guidance server 50. The voice recognition unit 64 of the guidance server 50 includes the voice data of the user voice data 400 transmitted from the MFP 10 from the start time of the non-silent part of the user voice data 400 to the time when new image data 302 is received. Extracted as partial audio data 431 (430).

案内サーバ５０は、当該部分音声データ４３１に関する合成音声データ４５１を生成し、合成音声データ４５１をサポータ端末７０に送信する。そして、サポータ端末７０において合成音声データ４５１が出力される。 The guidance server 50 generates synthesized voice data 451 related to the partial voice data 431 and transmits the synthesized voice data 451 to the supporter terminal 70. Then, the synthesized voice data 451 is output from the supporter terminal 70.

一方、新たな画像データ３０２は、案内サーバ５０による画像処理によって合成画像データ３５２に変更される。そして、案内サーバ５０は、合成音声データ４５１の送信が完了してから合成音声データ４５１の出力所要時間が経過した後に、合成画像データ３５２をサポータ端末７０に送信する。その後、サポータ端末７０の表示部７６ｂにおいて合成画像データ３５２に基づく画像が表示される。 On the other hand, the new image data 302 is changed to composite image data 352 by image processing by the guidance server 50. Then, the guide server 50 transmits the synthesized image data 352 to the supporter terminal 70 after the required output time of the synthesized voice data 451 has elapsed after the transmission of the synthesized voice data 451 is completed. Thereafter, an image based on the composite image data 352 is displayed on the display unit 76 b of the supporter terminal 70.

これによれば、ユーザ１０１の音声が画像データ３０１から画像データ３０２への変更時点で区切られるので、比較的短い期間を有する部分音声データ４３１を抽出することができる。したがって、合成音声データ４５１のサポータ端末７０への送信遅延を更に抑制することが可能である。 According to this, since the voice of the user 101 is divided at the time of change from the image data 301 to the image data 302, the partial voice data 431 having a relatively short period can be extracted. Therefore, it is possible to further suppress the transmission delay of the synthesized voice data 451 to the supporter terminal 70.

また、部分音声データ４３１には、画像データ３０１を閲覧しながら発せられた音声のみが含まれる（次の画像データ３０２を閲覧しながら発せられた音声は含まれない）。したがって、サポータ端末７０において、表示される画像（合成画像データ３５１に基づく画像）と出力される音声（合成音声データ４５１に基づく音声）とのずれを更に抑制あるいは回避することが可能である。 Further, the partial sound data 431 includes only the sound uttered while viewing the image data 301 (the sound uttered while viewing the next image data 302 is not included). Therefore, in the supporter terminal 70, it is possible to further suppress or avoid the deviation between the displayed image (image based on the synthesized image data 351) and the output sound (sound based on the synthesized speech data 451).

＜第３実施形態＞
第３実施形態は、第１実施形態の変形例である。以下では、第１実施形態との相違点を中心に説明する。 <Third Embodiment>
The third embodiment is a modification of the first embodiment. Below, it demonstrates centering on difference with 1st Embodiment.

第１実施形態においては、ユーザ１０１により発せられた音声のユーザ音声データ４００を案内サーバ５０が受信すると、音声生成部６５は、当該音声に含まれる秘匿ワード１１０に対応する代替音声データ２５０（２５１）を生成し、当該代替音声データ２５０を利用して合成音声データ４５０（４５１）を生成する態様を例示した。第１実施形態においては、音声生成部６５は代替音声データ２５０を逐次生成し、生成された代替音声データ２５１は格納されない。 In the first embodiment, when the guidance server 50 receives the user voice data 400 of the voice uttered by the user 101, the voice generation unit 65 replaces the substitute voice data 250 (251 corresponding to the secret word 110 included in the voice). ) And the synthesized voice data 450 (451) is generated using the alternative voice data 250. In the first embodiment, the sound generation unit 65 sequentially generates the substitute sound data 250, and the generated substitute sound data 251 is not stored.

第３実施形態では、合成音声データ４５０の生成に先立つ所定の時点（具体的には、サポート依頼信号を案内サーバ５０が受信した時点）で、複数の秘匿ワード１１０に対応する複数の代替音声データ２５０の生成が音声生成部６５により開始され、生成された代替音声データ２５０が案内サーバ５０の格納部５５に予め格納される。そして、ユーザ音声データ４００に秘匿ワード１１０が含まれ且つ秘匿ワード１１０に対応する代替音声データ２５０が格納部５５に既に格納されている場合には、格納部５５に予め格納されている代替音声データ２５０を用いて合成音声データ４５０が音声生成部６５により生成される。 In the third embodiment, a plurality of alternative voice data corresponding to a plurality of secret words 110 at a predetermined time (specifically, when the guide server 50 receives a support request signal) prior to the generation of the synthesized voice data 450. The generation of 250 is started by the voice generation unit 65, and the generated alternative voice data 250 is stored in advance in the storage unit 55 of the guidance server 50. If the user voice data 400 includes the secret word 110 and the alternative voice data 250 corresponding to the secret word 110 is already stored in the storage unit 55, the alternative voice data stored in the storage unit 55 in advance is stored. The synthesized voice data 450 is generated by the voice generation unit 65 using 250.

図１８は、第３実施形態に係る動作に関するタイミングを示す図であり、図１９は、第３実施形態に係るユーザ１０１のユーザ音声データ４００に対する音声処理を示すフローチャートである。図１８および図１９を参照して具体的に説明する。 FIG. 18 is a diagram illustrating timing related to the operation according to the third embodiment, and FIG. 19 is a flowchart illustrating audio processing on the user audio data 400 of the user 101 according to the third embodiment. This will be specifically described with reference to FIGS. 18 and 19.

サポートセンターに対するサポート依頼のために、ユーザ１０１が、ＭＦＰ１０の操作パネル６ｃに配設されたヘルプボタン（不図示）を押下すると、ＭＦＰ１０は、ユーザ１０１からの操作案内の発生を示すサポート依頼の信号を案内サーバ５０に送信する。 When the user 101 presses a help button (not shown) provided on the operation panel 6c of the MFP 10 for a support request to the support center, the MFP 10 displays a support request signal indicating the occurrence of operation guidance from the user 101. Is transmitted to the guidance server 50.

案内サーバ５０は、ＭＦＰ１０からサポート依頼信号を受信すると（ステップＳ１１（図１９））、複数の秘匿ワード１１０（後述）のリストである秘匿ワードリスト６０２（図２０参照）をもＭＦＰ１０から受信する。 When the guidance server 50 receives a support request signal from the MFP 10 (step S11 (FIG. 19)), the guidance server 50 also receives a secret word list 602 (see FIG. 20), which is a list of a plurality of secret words 110 (described later), from the MFP 10.

当該秘匿ワードリスト６０２には、当該複数の秘匿ワード１１０が登録されている。そして、案内サーバ５０は、当該秘匿ワードリスト６０２に基づいて変換辞書６５２（図２０参照）を生成する。生成された変換辞書６５２では、秘匿ワード１１０と、当該秘匿ワード１１０にそれぞれ対応する代替ワード２１０とが登録されている。 In the secret word list 602, the plurality of secret words 110 are registered. Then, the guidance server 50 generates a conversion dictionary 652 (see FIG. 20) based on the secret word list 602. In the generated conversion dictionary 652, the secret word 110 and the alternative words 210 respectively corresponding to the secret word 110 are registered.

ここにおいて、複数の秘匿ワード１１０は、ＭＦＰ１０のスキャン画像送信における宛先指定画面内の送信宛先を示す語句（ワード）と、ＭＦＰ１０のファクシミリ送信における宛先指定画面に含まれる送信宛先を示す語句と、ＭＦＰ１０のボックスに格納されたファイルに関する情報表示画面に表示されるファイル情報を示す語句とを含む。換言すれば、当該複数の秘匿ワード１１０には、複数の動作モードのそれぞれにて秘匿すべき複数の種類の語句が含まれる。ただし、秘匿ワード１１０は、これらの語句の全てを含むことを要さず、これらの語句の一部を含むものであってもよい。 Here, the plurality of concealment words 110 are words (words) indicating a transmission destination in the destination designation screen in the scan image transmission of the MFP 10, words and phrases indicating a transmission destination included in the destination designation screen in the facsimile transmission of the MFP 10, and the MFP 10. And a word / phrase indicating file information displayed on the information display screen relating to the file stored in the box. In other words, the plurality of secret words 110 include a plurality of types of words / phrases to be concealed in each of the plurality of operation modes. However, the secret word 110 does not need to include all of these phrases, and may include a part of these phrases.

案内サーバ５０によるサポート依頼信号の受信に応答して、音声生成部６５は、変換辞書６５２に基づいて、複数の秘匿ワード１１０に対応する代替音声データ２５０の生成を開始する（ステップＳ５１）（図１８も参照）。また、案内サーバ５０は、生成した代替音声データ２５０を案内サーバ５０の格納部５５に順次に格納する（ステップＳ５２）。 In response to the reception of the support request signal by the guidance server 50, the voice generation unit 65 starts generating the alternative voice data 250 corresponding to the plurality of secret words 110 based on the conversion dictionary 652 (step S51) (FIG. 18). Further, the guidance server 50 sequentially stores the generated substitute voice data 250 in the storage unit 55 of the guidance server 50 (step S52).

操作案内の開始信号の送受信に伴う所定の時点において、案内サーバ５０は、画像データ３００（３０１）をＭＦＰ１０から受信し、変換辞書６５２に基づいて画像処理を行い、合成画像データ３５０（３５１）を生成する。そして、案内サーバ５０は、生成した合成画像データ３５０（３５１）をサポータ端末７０に送信する（図１８参照）。 At a predetermined time point accompanying transmission / reception of the operation guidance start signal, the guidance server 50 receives the image data 300 (301) from the MFP 10, performs image processing based on the conversion dictionary 652, and generates composite image data 350 (351). Generate. Then, the guidance server 50 transmits the generated composite image data 350 (351) to the supporter terminal 70 (see FIG. 18).

複数の代替音声データ２５０の生成中あるいは生成完了後において、案内サーバ５０は、ユーザ音声データ４００を受信し（ステップＳ２０）、ステップＳ２１〜Ｓ２４の各処理を実行する（図１９参照）。その後、部分音声データ４３０に秘匿ワード１１０が含まれていることがステップＳ２４において判定されると、ステップＳ５３に進む。 During or after the generation of the plurality of alternative voice data 250, the guidance server 50 receives the user voice data 400 (step S20), and executes each process of steps S21 to S24 (see FIG. 19). Thereafter, when it is determined in step S24 that the secret voice 110 is included in the partial voice data 430, the process proceeds to step S53.

ステップＳ５３では、秘匿ワード１１０に対応する代替音声データ２５０が案内サーバ５０の格納部５５に格納されているか否かが音声生成部６５により判定される。 In step S <b> 53, the voice generation unit 65 determines whether the alternative voice data 250 corresponding to the secret word 110 is stored in the storage unit 55 of the guidance server 50.

秘匿ワード１１０に対応する代替音声データ２５０が格納部５５に格納されていることが判定される場合には、音声生成部６５は、既に格納されている代替音声データ２５０（２５１）を格納部５５から取得する（ステップＳ５４）。そして、音声生成部６５は当該代替音声データ２５１を用いて合成音声データ４５０（４５１）を生成し（ステップＳ２６）、案内サーバ５０は合成音声データ４５０をサポータ端末７０に送信する（ステップＳ２７）。 When it is determined that the alternative voice data 250 corresponding to the secret word 110 is stored in the storage unit 55, the voice generation unit 65 stores the alternative voice data 250 (251) already stored in the storage unit 55. (Step S54). Then, the voice generation unit 65 generates synthesized voice data 450 (451) using the alternative voice data 251 (step S26), and the guidance server 50 transmits the synthesized voice data 450 to the supporter terminal 70 (step S27).

たとえば、秘匿ワード１１１「長谷不動産」に対応する代替音声データ２５１「ＡＢＣ」（図２０）が格納部５５に格納されていることが判定される場合には、音声生成部６５は、格納されている代替音声データ２５１「ＡＢＣ」を格納部５５から取得する。そして、音声生成部６５は、当該代替音声データ２５１「ＡＢＣ」を用いて合成音声データ４５１を生成し、案内サーバ５０は合成音声データ４５１をサポータ端末７０に送信する。合成音声データ４５１を受信したサポータ端末７０においては、合成音声データ４５１に基づく音声が出力される。 For example, when it is determined that the alternative voice data 251 “ABC” (FIG. 20) corresponding to the secret word 111 “Hase Real Estate” is stored in the storage unit 55, the voice generation unit 65 is stored. The alternative voice data 251 “ABC” is acquired from the storage unit 55. Then, the voice generation unit 65 generates synthesized voice data 451 using the alternative voice data 251 “ABC”, and the guidance server 50 transmits the synthesized voice data 451 to the supporter terminal 70. The supporter terminal 70 that has received the synthesized voice data 451 outputs a voice based on the synthesized voice data 451.

一方、代替音声データ２５０が格納部５５に格納されていないことがステップＳ５４において判定される場合には、音声生成部６５は、秘匿ワード１１０に対応する代替音声データ２５０を機械音声生成処理により生成する（ステップＳ２５）。そして、音声生成部６５は、生成した代替音声データ２５０を格納部５５に格納し（ステップＳ５５）、ステップＳ２６に進む。ステップＳ２６では、ステップＳ２５で生成された代替音声データ２５０を用いて合成音声データ４５０が生成される。 On the other hand, when it is determined in step S54 that the alternative voice data 250 is not stored in the storage unit 55, the voice generation unit 65 generates the alternative voice data 250 corresponding to the secret word 110 by the machine voice generation process. (Step S25). Then, the voice generation unit 65 stores the generated alternative voice data 250 in the storage unit 55 (step S55), and proceeds to step S26. In step S26, synthesized voice data 450 is generated using the alternative voice data 250 generated in step S25.

以上のような動作によれば、ユーザ１０１からのサポート依頼信号を案内サーバ５０が受信すると、複数の代替音声データ２５０の生成が開始され、生成された代替音声データ２５０が格納部５５に予め格納される。そして、ユーザ音声データ４００に秘匿ワード１１０が含まれ且つ代替音声データ２５０が既に格納部５５に格納されている旨が判定される場合には、格納されている代替音声データ２５０（２５１）を用いて合成音声データ４５０（４５１）が生成される。この場合、既に存在する代替音声データ２５０が利用されるため、代替音声データ２５０（２５１）が新たに生成されることを要しない。したがって、たとえばユーザ音声データ４００に秘匿ワード１１０が含まれる旨が判定された時点から代替音声データ２５０（２５１）の生成を開始する場合と比べて、代替音声データ２５１の準備時間が短縮され、合成音声データ４５１の生成に要する時間が短縮される。その結果、サポータ端末７０への合成音声データ４５１の送信の遅延を抑制することが可能である。 According to the operation as described above, when the guidance server 50 receives a support request signal from the user 101, generation of a plurality of alternative voice data 250 is started, and the generated alternative voice data 250 is stored in the storage unit 55 in advance. Is done. When it is determined that the secret voice 110 is included in the user voice data 400 and the alternative voice data 250 is already stored in the storage unit 55, the stored alternative voice data 250 (251) is used. Thus, synthesized voice data 450 (451) is generated. In this case, since the alternative voice data 250 that already exists is used, it is not necessary to newly generate the alternative voice data 250 (251). Therefore, for example, the preparation time of the alternative voice data 251 is shortened compared to when the generation of the alternative voice data 250 (251) is started from the time when it is determined that the user voice data 400 includes the secret word 110, and the synthesis is performed. The time required for generating the audio data 451 is reduced. As a result, it is possible to suppress a delay in transmission of the synthesized voice data 451 to the supporter terminal 70.

また、上記第３実施形態では、ＭＦＰ１０における複数の動作モード（スキャンモード、ファクシミリ送信モード、ボックスモード等）で表示され得る複数の秘匿ワード１１０に対応する複数の代替音声データ２５０が順次に生成されている。上記においては、複数の代替音声データ２５０の生成順序については特に言及していないが、次述するような優先順序で複数の代替音声データ２５０が生成されるようにしてもよい。 In the third embodiment, a plurality of alternative voice data 250 corresponding to a plurality of secret words 110 that can be displayed in a plurality of operation modes (scan mode, facsimile transmission mode, box mode, etc.) in the MFP 10 are sequentially generated. ing. In the above description, the generation order of the plurality of alternative sound data 250 is not particularly mentioned, but the plurality of alternative sound data 250 may be generated in the priority order as described below.

たとえば、ＭＦＰ１０における複数の動作モードで表示され得る複数の秘匿ワード１１０のうち、ユーザ１０１により操作されているＭＦＰ１０の現在の動作モードにて表示され得る秘匿ワード１１０に対応する代替音声データ２５０が優先的に生成されるようにしてもよい。 For example, among the plurality of secret words 110 that can be displayed in a plurality of operation modes in the MFP 10, the alternative voice data 250 corresponding to the secret word 110 that can be displayed in the current operation mode of the MFP 10 operated by the user 101 has priority. May be generated automatically.

より具体的には、ＭＦＰ１０の現在の動作モードがスキャンモードであるときには、音声生成部６５は、複数の秘匿ワード１１０のうち、スキャンモード（現モード）にて表示され得る１つまたは複数の画像（宛先指定画面３０１等）に含まれる秘匿ワード１１０を優先処理対象ワードとして決定する。そして、音声生成部６５は、当該優先処理対象ワードに対応する代替音声データ２５０を生成し、生成した代替音声データ２５０を格納部５５に格納する。 More specifically, when the current operation mode of the MFP 10 is the scan mode, the sound generation unit 65 includes one or more images that can be displayed in the scan mode (current mode) among the plurality of secret words 110. The secret word 110 included in the (address designation screen 301 or the like) is determined as the priority processing target word. Then, the voice generation unit 65 generates alternative voice data 250 corresponding to the priority processing target word, and stores the generated alternative voice data 250 in the storage unit 55.

これによれば、現在の動作モードにて表示され得る秘匿ワード１１０に対応する代替音声データ２５０が優先的に生成されるので、ユーザ１０１により発せられる可能性の高い秘匿ワード１１０に対応する代替音声データ２５０を予め生成しておくことが可能である。したがって、合成音声データ４５０の生成の際に、格納部５５に格納されている代替音声データ２５０が用いられる可能性が高くなる。 According to this, since the alternative voice data 250 corresponding to the secret word 110 that can be displayed in the current operation mode is preferentially generated, the alternative voice corresponding to the secret word 110 that is likely to be emitted by the user 101. Data 250 can be generated in advance. Therefore, when the synthesized voice data 450 is generated, there is a high possibility that the alternative voice data 250 stored in the storage unit 55 is used.

あるいは、秘匿ワード１１０の使用頻度に基づく優先順位に従って、複数の代替音声データ２５０が順次に生成されるようにしてもよい。 Alternatively, a plurality of alternative voice data 250 may be sequentially generated according to the priority order based on the usage frequency of the secret word 110.

具体的には、秘匿ワードリスト６０２の受信に際して、案内サーバ５０は、秘匿ワード１１０のそれぞれの使用頻度をもＭＦＰ１０から受信し、当該秘匿ワードリスト６０２および使用頻度に基づいて変換辞書６５２を生成する（図２０参照）。そして、音声生成部６５は、変換辞書６５２に登録されている複数の秘匿ワード１１０のうち、使用頻度が多い秘匿ワード１１０から順に、対応する代替音声データ２５０を生成し、生成した代替音声データ２５０を格納部５５に格納する。 Specifically, when receiving the secret word list 602, the guide server 50 also receives the usage frequency of each secret word 110 from the MFP 10 and generates the conversion dictionary 652 based on the secret word list 602 and the usage frequency. (See FIG. 20). Then, the voice generation unit 65 generates corresponding alternative voice data 250 in order from the secret word 110 having the highest use frequency among the plurality of secret words 110 registered in the conversion dictionary 652, and the generated alternative voice data 250. Is stored in the storage unit 55.

図２０では、秘匿ワード１１１の「長谷不動産」の使用頻度は１０であり、秘匿ワード１１２の「高橋電器」の使用頻度は２０であり、秘匿ワード１１３の「松原工務店」の使用頻度は５である。この場合、３つの秘匿ワード１１２，１１１，１１３に着目すると、音声生成部６５は、秘匿ワード１１２，１１１，１１３の順にそれぞれ対応する代替音声データ２５０を生成する。すなわち、代替音声データ２５２，２５１，２５３が、この順序で生成される。 In FIG. 20, the usage frequency of “Hase Real Estate” in the secret word 111 is 10, the usage frequency of “Takahashi Electric” in the secret word 112 is 20, and the usage frequency of “Matsubara Corporation” in the secret word 113 is 5. It is. In this case, paying attention to the three secret words 112, 111, and 113, the voice generation unit 65 generates alternative voice data 250 corresponding to the secret words 112, 111, and 113 in that order. That is, the alternative voice data 252 251 253 are generated in this order.

なお、変換辞書６５２に記述された秘匿ワード１１０の使用頻度は、ＭＦＰ１０を使用する複数のユーザによる秘匿ワード１１０の使用頻度（換言すれば、ＭＦＰ１０の使用頻度）であってもよく、あるいは、現在ＭＦＰ１０を操作しているユーザ１０１（ログインユーザ）による秘匿ワード１１０の使用頻度であってもよい。 Note that the usage frequency of the secret word 110 described in the conversion dictionary 652 may be the usage frequency of the secret word 110 by a plurality of users who use the MFP 10 (in other words, the usage frequency of the MFP 10). The usage frequency of the secret word 110 by the user 101 (logged-in user) who operates the MFP 10 may be used.

このように、秘匿ワード１１０の使用頻度に基づく優先順位（のみ）に従って、複数の代替音声データ２５０が順次に生成されるようにしてもよい。これによれば、ユーザ１０１により発せられる可能性の高い秘匿ワード１１０に対応する代替音声データ２５０を予め生成しておくことが可能である。したがって、合成音声データ４５０の生成の際に、格納部５５に格納されている代替音声データ２５０が用いられる可能性が高くなる。 As described above, a plurality of alternative voice data 250 may be sequentially generated according to the priority (only) based on the usage frequency of the secret word 110. According to this, it is possible to generate in advance the alternative voice data 250 corresponding to the secret word 110 that is likely to be issued by the user 101. Therefore, when the synthesized voice data 450 is generated, there is a high possibility that the alternative voice data 250 stored in the storage unit 55 is used.

さらには、現在のスキャンモードと使用頻度との双方を考慮した優先順位に従って、複数の代替音声データ２５０が順次に生成されるようにしてもよい。 Furthermore, a plurality of alternative audio data 250 may be sequentially generated in accordance with the priority order considering both the current scan mode and the usage frequency.

また、上記第３実施形態等においては、ユーザ１０１からのサポート依頼信号を案内サーバ５０が受信したことに応答して、複数の代替音声データ２５０の生成が開始される態様が例示されているが、これに限定されない。 Moreover, in the said 3rd Embodiment etc., although the guidance server 50 received the support request signal from the user 101, the aspect in which the production | generation of the some alternative audio | voice data 250 is started is illustrated. However, the present invention is not limited to this.

たとえば、画像データ３００を案内サーバ５０が受信すると、当該画像データ３００に含まれている秘匿ワード１１０に対応する代替音声データ２５０の生成が開始されるようにしてもよい。 For example, when the guidance server 50 receives the image data 300, the generation of the alternative voice data 250 corresponding to the secret word 110 included in the image data 300 may be started.

具体的には、案内サーバ５０による画像データ３００（３０１）の受信に応答して、音声生成部６５は、当該画像データ３００に含まれている複数の秘匿ワード１１０に対応する複数の代替音声データ２５０の生成を開始する。なお、生成された代替音声データ２５０は格納部５５に格納される。たとえば、秘匿ワード１１１，１１２，１１３に対応する代替音声データ２５１，２５２，２５３が生成され、格納部５５に随時格納される。 Specifically, in response to the reception of the image data 300 (301) by the guidance server 50, the sound generation unit 65 includes a plurality of alternative sound data corresponding to the plurality of secret words 110 included in the image data 300. 250 generation is started. The generated alternative audio data 250 is stored in the storage unit 55. For example, alternative voice data 251, 252, 253 corresponding to the secret words 111, 112, 113 are generated and stored in the storage unit 55 as needed.

これら複数の代替音声データ２５０の生成中あるいは生成完了後において、図１９のステップＳ２０以降の動作と同様の動作が実行される。具体的には、ユーザ音声データ４００に秘匿ワード１１０が含まれ且つ秘匿ワード１１０に対応する代替音声データ２５０が格納部５５に既に格納されている場合には、格納部５５に格納されている当該代替音声データ２５０を用いて合成音声データ４５０が生成される。一方、ユーザ音声データ４００に秘匿ワード１１０が含まれ且つ秘匿ワード１１０に対応する代替音声データ２５０が格納部５５に格納されていない場合には、当該代替音声データ２５０が機械音声生成処理により生成され、生成された当該代替音声データ２５０を用いて合成ユーザ音声データが生成される。 During the generation of the plurality of alternative audio data 250 or after the generation is completed, the same operation as the operation after step S20 in FIG. 19 is executed. Specifically, when the user voice data 400 includes the secret word 110 and the alternative voice data 250 corresponding to the secret word 110 is already stored in the storage unit 55, the user voice data 400 stored in the storage unit 55 Synthetic voice data 450 is generated using alternative voice data 250. On the other hand, when the user voice data 400 includes the secret word 110 and the alternative voice data 250 corresponding to the secret word 110 is not stored in the storage unit 55, the alternative voice data 250 is generated by the machine voice generation process. Then, synthesized user voice data is generated using the generated alternative voice data 250.

これによれば、受信した画像データ３００に含まれている秘匿ワード１１０に対応する代替音声データ２５０が優先的に生成されるので、ユーザ１０１により発せられる可能性が比較的高い秘匿ワード１１０に対応する代替音声データ２５０が予め生成され得る。したがって、合成音声データ４５０の生成の際に、格納部５５に格納されている代替音声データ２５０が用いられる可能性を向上させることができる。 According to this, since the alternative voice data 250 corresponding to the secret word 110 included in the received image data 300 is preferentially generated, it corresponds to the secret word 110 that is likely to be emitted by the user 101. Alternative voice data 250 to be generated can be generated in advance. Therefore, it is possible to improve the possibility that the substitute voice data 250 stored in the storage unit 55 is used when the synthesized voice data 450 is generated.

また、このような改変例において、上述の使用頻度に基づく優先順位に従って、複数の代替音声データ２５０が順次に生成されるようにしてもよい。すなわち、受信した画像データ３００に含まれる複数の秘匿ワードの使用頻度に基づく優先順位に従って、当該複数の秘匿ワード１１０に対応する複数の代替音声データ２５０が生成されるようにしてもよい。 In such a modification, a plurality of alternative audio data 250 may be sequentially generated in accordance with the priority order based on the above-described usage frequency. That is, a plurality of alternative sound data 250 corresponding to the plurality of secret words 110 may be generated in accordance with the priority order based on the use frequency of the plurality of secret words included in the received image data 300.

＜第４実施形態＞
第４実施形態は、第１実施形態の変形例である。以下では、第１実施形態との相違点を中心に説明する。 <Fourth embodiment>
The fourth embodiment is a modification of the first embodiment. Below, it demonstrates centering on difference with 1st Embodiment.

第１実施形態においては、部分音声データ４３０を案内サーバ５０が受信すると、当該部分音声データ４３０に含まれた秘匿ワード１１０に対応する代替音声データ２５０（２５１）が、その都度、音声生成部６５により生成される。そして、音声生成部６５は、当該代替音声データ２５０を利用して合成音声データ４５０（４５１）を生成する。第１実施形態においては、音声生成部６５は代替音声データ２５０を逐次生成し、生成された代替音声データ２５１は格納されない。 In the first embodiment, when the guidance server 50 receives the partial voice data 430, the alternative voice data 250 (251) corresponding to the secret word 110 included in the partial voice data 430 is converted into the voice generation unit 65 each time. Is generated by Then, the voice generation unit 65 generates synthesized voice data 450 (451) using the alternative voice data 250. In the first embodiment, the sound generation unit 65 sequentially generates the substitute sound data 250, and the generated substitute sound data 251 is not stored.

この第４実施形態では、音声生成部６５により生成された（すなわち、合成音声データ４５０の生成に利用された）代替音声データ２５０が案内サーバ５０の格納部５５に格納される。そして、格納部５５に格納されている代替音声データ２５０を用いて合成音声データ４５０が生成される。 In the fourth embodiment, the alternative voice data 250 generated by the voice generation unit 65 (that is, used for generating the synthesized voice data 450) is stored in the storage unit 55 of the guidance server 50. Then, synthesized voice data 450 is generated using the substitute voice data 250 stored in the storage unit 55.

第４実施形態では、ユーザ１０１により発せられたユーザ音声データ４００のうち、部分音声データ４３１とは異なる部分音声データ４３２（後述）を案内サーバ５０が受信した状況を想定する。格納部５５に予め格納された代替音声データ２５０に対応する秘匿ワード１１０が当該部分音声データ４３２内に含まれる場合には、音声生成部６５は、当該格納された代替音声データ２５０を用いて合成音声データ４５２を生成する。 In the fourth embodiment, it is assumed that the guidance server 50 receives partial voice data 432 (described later) different from the partial voice data 431 among the user voice data 400 emitted by the user 101. When the secret word 110 corresponding to the alternative voice data 250 stored in advance in the storage unit 55 is included in the partial voice data 432, the voice generation unit 65 synthesizes using the stored alternative voice data 250. Audio data 452 is generated.

ここにおいて、部分音声データ４３２は、ユーザ音声データ４００のうち、部分音声データ４３１の次に音声認識部６４が抽出した部分の音声データである。 Here, the partial voice data 432 is voice data of a part extracted by the voice recognition unit 64 next to the partial voice data 431 in the user voice data 400.

図２１は、第４実施形態に係る案内サーバ５０の音声処理に関する動作を示すフローチャートである。 FIG. 21 is a flowchart showing an operation related to voice processing of the guidance server 50 according to the fourth embodiment.

案内サーバ５０はユーザ音声データ４００を受信し（ステップＳ２１）、ステップＳ２１〜Ｓ２３の処理を実行することにより音声認識部６４は部分音声データ４３０を抽出する。その後、音声認識部６４は、当該部分音声データ４３０に対する音声認識処理によって、部分音声データ４３０に秘匿ワード１１０が含まれるか否かを判定する（ステップＳ２４）。そして、部分音声データ４３０内に秘匿ワード１１０が含まれる旨が判定される場合には、音声生成部６５は、当該秘匿ワード１１０に対応する代替音声データ２５０が格納部５５に格納されているか否かを判定する（ステップＳ６２）。 The guidance server 50 receives the user voice data 400 (step S21), and the voice recognition unit 64 extracts the partial voice data 430 by executing the processes of steps S21 to S23. Thereafter, the voice recognition unit 64 determines whether or not the secret word 110 is included in the partial voice data 430 by voice recognition processing on the partial voice data 430 (step S24). When it is determined that the secret word 110 is included in the partial voice data 430, the voice generation unit 65 determines whether the alternative voice data 250 corresponding to the secret word 110 is stored in the storage unit 55. Is determined (step S62).

秘匿ワード１１０に対応する代替音声データ２５０が格納部５５に格納されていないことが判定される場合には、音声生成部６５は、秘匿ワード１１０に対応する代替音声データ２５０を生成し（ステップＳ２５）、生成した代替音声データ２５０を格納部５５に格納する（ステップＳ６４）。そして、音声生成部６５は、部分音声データ４３０内の秘匿音声データ１５０を、生成した代替音声データ２５０に置き換えた合成音声データ４５０を生成する（ステップＳ２６）。当該合成音声データ４５０はサポータ端末７０に送信される（ステップＳ２７）。 When it is determined that the alternative voice data 250 corresponding to the secret word 110 is not stored in the storage unit 55, the voice generation unit 65 generates the alternative voice data 250 corresponding to the secret word 110 (step S25). ), And the generated alternative voice data 250 is stored in the storage unit 55 (step S64). Then, the voice generation unit 65 generates synthesized voice data 450 in which the secret voice data 150 in the partial voice data 430 is replaced with the generated alternative voice data 250 (step S26). The synthesized voice data 450 is transmitted to the supporter terminal 70 (step S27).

一方、当該代替音声データ２５０が格納部５５に格納されていることが判定される場合には、音声生成部６５は、格納されていた代替音声データ２５０を格納部５５から取得する（ステップＳ６３）。 On the other hand, when it is determined that the alternative voice data 250 is stored in the storage unit 55, the voice generation unit 65 acquires the stored alternative voice data 250 from the storage unit 55 (step S63). .

そして、音声生成部６５は、部分音声データ４３０内の秘匿音声データ１５０を、取得した代替音声データ２５０に置き換えた合成音声データ４５０を生成する（ステップＳ２６）。当該合成音声データ４５０はサポータ端末７０に送信される（ステップＳ２７）。 Then, the voice generation unit 65 generates synthesized voice data 450 in which the secret voice data 150 in the partial voice data 430 is replaced with the acquired alternative voice data 250 (step S26). The synthesized voice data 450 is transmitted to the supporter terminal 70 (step S27).

図２２は、第４実施形態における画像データ３００および部分音声データ４３１，４３２に関するタイミングを示す図である。また、図２３は、或る合成音声データ４５０（４５１）の生成に利用された代替音声データ２５０（２５１）が格納部５５へ格納される状況を示す図であり、図２４は、別の合成音声データ４５０（４５２）の生成の際に、既に格納されている代替音声データ２５０（２５１）が用いられる状況を示す図である。 FIG. 22 is a diagram illustrating timings related to the image data 300 and the partial audio data 431 and 432 according to the fourth embodiment. FIG. 23 is a diagram showing a situation in which the substitute voice data 250 (251) used for generating a certain synthesized voice data 450 (451) is stored in the storage unit 55, and FIG. It is a figure which shows the condition where the alternative voice data 250 (251) already stored is used in the case of the production | generation of the audio | voice data 450 (452).

図２２〜図２４をも参照しながら、或る合成音声データ４５１の生成に際して利用された代替音声データ２５１が予め格納部５５に格納され、格納済みの代替音声データ２５１を用いて別の合成音声データ４５２が生成される動作について説明する。 With reference to FIGS. 22 to 24 as well, alternative voice data 251 used in generating a certain synthesized voice data 451 is stored in the storage unit 55 in advance, and another synthesized voice is stored using the stored alternative voice data 251. An operation for generating the data 452 will be described.

まず、案内サーバ５０は、ユーザ音声データ４００から部分音声データ４３１を抽出する（ステップＳ２３）。たとえば、図２３では、ユーザ１０１により発せられた音声のうち、「ファイルをスキャンして長谷不動産に送りたいのです。」の部分の音声のデータが部分音声データ４３１として抽出される状況が示されている。 First, the guidance server 50 extracts partial voice data 431 from the user voice data 400 (step S23). For example, FIG. 23 shows a situation in which voice data of the part “I want to scan a file and send it to Hase Real Estate” among the voices uttered by the user 101 is extracted as the partial voice data 431. ing.

その後、音声認識部６４は、部分音声データ４３１内に秘匿ワード１１１「長谷不動産」が含まれる旨を判定する（ステップＳ２４）。この時点では、秘匿ワード１１１に対応する代替音声データ２５０は、格納部５５には格納されていないので、ステップＳ２４からステップＳ６２を経てステップＳ２５に進む。そして、音声生成部６５は、秘匿ワード１１１「長谷不動産」に対応する代替ワード２１１「ＡＢＣ」の代替音声データ２５１を生成し（ステップＳ２５）、生成した代替音声データ２５１（「ＡＢＣ」）を案内サーバ５０の格納部５５に格納する（ステップＳ６４）。 Thereafter, the voice recognition unit 64 determines that the secret word 111 “Hase Real Estate” is included in the partial voice data 431 (step S24). At this time, since the alternative voice data 250 corresponding to the secret word 111 is not stored in the storage unit 55, the process proceeds from step S24 to step S62 to step S25. Then, the voice generation unit 65 generates the alternative voice data 251 of the alternative word 211 “ABC” corresponding to the secret word 111 “Hase Real Estate” (step S25), and guides the generated alternative voice data 251 (“ABC”). The data is stored in the storage unit 55 of the server 50 (step S64).

そして、音声生成部６５は、生成した代替音声データ２５１（「ＡＢＣ」）を用いて合成音声データ４５１を生成し（ステップＳ２６）、案内サーバ５０は当該合成音声データ４５１をサポータ端末７０に送信する（ステップＳ２７）。サポータ端末７０は、受信した合成音声データ４５１に基づく音声（「ファイルをスキャンしてＡＢＣに送りたいのです。」）を出力する。 Then, the voice generation unit 65 generates synthesized voice data 451 using the generated alternative voice data 251 (“ABC”) (step S26), and the guidance server 50 transmits the synthesized voice data 451 to the supporter terminal 70. (Step S27). The supporter terminal 70 outputs a voice based on the received synthesized voice data 451 (“I want to scan a file and send it to ABC”).

その後、案内サーバ５０は、ユーザ音声データ４００から、別の部分の音声データである部分音声データ４３２を抽出する（ステップＳ２３）。たとえば、図２４では、ユーザ１０１により発せられた音声のうち、「長谷不動産をタッチしましたが、次はどうすれば良いですか？」の部分の音声のデータが新たな部分音声データ４３２として抽出される状況が示されている。 Thereafter, the guidance server 50 extracts partial voice data 432, which is voice data of another part, from the user voice data 400 (step S23). For example, in FIG. 24, the voice data of the portion “Hase Real Estate has been touched, what should I do next?” In the voice uttered by the user 101 is extracted as new partial voice data 432. The status is shown.

音声認識部６４は、部分音声データ４３２に秘匿ワード１１０（秘匿ワード１１１「長谷不動産」）が含まれている旨を判定し、当該秘匿ワード１１０（１１１）に対応する代替ワード２１１「ＡＢＣ」を求める。また、当該代替ワード２１１「ＡＢＣ」に対応する代替音声データ２５１（「ＡＢＣ」）が格納部５５に既に格納されているか否かが判定される。この時点では、当該代替音声データ２５１（「ＡＢＣ」）が格納部５５に既に格納されている旨が判定される。換言すれば、既に格納部５５に格納されている代替音声データ２５１（「ＡＢＣ」）に対応する秘匿ワード１１０（「長谷不動産」）が部分音声データ４３２内に含まれている旨が判定される。 The voice recognition unit 64 determines that the secret word 110 (the secret word 111 “Hase Real Estate”) is included in the partial voice data 432, and determines the alternative word 211 “ABC” corresponding to the secret word 110 (111). Ask. Further, it is determined whether or not the alternative voice data 251 (“ABC”) corresponding to the alternative word 211 “ABC” is already stored in the storage unit 55. At this time, it is determined that the alternative voice data 251 (“ABC”) is already stored in the storage unit 55. In other words, it is determined that the secret voice 110 (“Hase Real Estate”) corresponding to the alternative voice data 251 (“ABC”) already stored in the storage unit 55 is included in the partial voice data 432. .

そして、音声生成部６５は、当該格納された代替音声データ２５１を格納部５５から取得する（ステップＳ６３）。ここでは、秘匿ワード１１１「長谷不動産」に対応する代替ワード２１１「ＡＢＣ」の代替音声データ２５１が格納部５５から取得される。音声生成部６５は、格納部５５から取得された当該代替音声データ２５１を用いて合成音声データ４５２を生成する（ステップＳ２６）。その後、案内サーバ５０は、生成された合成音声データ４５２をサポータ端末７０に送信し（ステップＳ２７）、サポータ端末７０は、受信した合成音声データ４５２に基づく音声（「ＡＢＣをタッチしましたが、次はどうすれば良いですか？」）を出力する。 Then, the voice generation unit 65 acquires the stored alternative voice data 251 from the storage unit 55 (step S63). Here, the alternative voice data 251 of the alternative word 211 “ABC” corresponding to the secret word 111 “Hase Real Estate” is acquired from the storage unit 55. The voice generation unit 65 generates synthesized voice data 452 using the alternative voice data 251 acquired from the storage unit 55 (step S26). Thereafter, the guidance server 50 transmits the generated synthesized voice data 452 to the supporter terminal 70 (step S27), and the supporter terminal 70 touches the voice based on the received synthesized voice data 452 (“ABC is touched, "What should I do?") Is output.

以上のような動作によれば、或る合成音声データ４５１の生成に際して利用された代替音声データ２５１が予め格納されて、次の合成音声データ４５２の生成の際に利用される。そのため、当該代替音声データ２５１の生成を再び行わずに済む。したがって、合成音声データ４５２の生成に要する時間が短縮されるので、サポータ端末７０への合成音声データ４５２の送信の遅延を抑制することが可能である。 According to the operation as described above, the alternative voice data 251 used when generating a certain synthesized voice data 451 is stored in advance and used when the next synthesized voice data 452 is generated. Therefore, it is not necessary to generate the alternative audio data 251 again. Therefore, since the time required for generating the synthesized speech data 452 is shortened, it is possible to suppress a delay in transmitting the synthesized speech data 452 to the supporter terminal 70.

＜第５実施形態＞
第５実施形態は、第１実施形態の変形例である。以下では、第１実施形態との相違点を中心に説明する。 <Fifth Embodiment>
The fifth embodiment is a modification of the first embodiment. Below, it demonstrates centering on difference with 1st Embodiment.

第１実施形態では、スキャンモードにおいて、画像データ３００（３０１）に含まれる送信宛先が秘匿ワード１１０として決定される態様が例示されている。より詳細には、ＭＦＰ１０から受信した秘匿ワードリスト６０１において、画像データ３００に基づく画像に含まれる送信宛先が秘匿ワード１１０として登録されている。そのような登録内容に基づいて秘匿ワードが決定される。 The first embodiment exemplifies a mode in which the transmission destination included in the image data 300 (301) is determined as the secret word 110 in the scan mode. More specifically, in the secret word list 601 received from the MFP 10, the transmission destination included in the image based on the image data 300 is registered as the secret word 110. A secret word is determined based on such registered contents.

第５実施形態では、ボックスモードにおいて、ＭＦＰ１０のボックスに格納されたファイル５５０のファイル名、作成者、日付、およびファイル本文の見出しを示す語句（ワード）が秘匿ワード１１０として決定される態様を例示する。この第５実施形態では、ＭＦＰ１０のボックスに格納されたファイル５５０に関する情報の表示画面を見ながら操作案内が行われる。以下、第５実施形態における画像処理および音声処理に関して順次に説明する。 The fifth embodiment exemplifies a mode in which the word (word) indicating the file name, creator, date, and heading of the file body of the file 550 stored in the box of the MFP 10 is determined as the secret word 110 in the box mode. To do. In the fifth embodiment, operation guidance is performed while viewing the display screen of information relating to the file 550 stored in the box of the MFP 10. Hereinafter, image processing and sound processing in the fifth embodiment will be described in order.

まず、ユーザ１０１からのサポート依頼信号の受信に際して、案内サーバ５０は、秘匿ワードリスト６０３をもＭＦＰ１０から受信する（図２７参照）。ここでは、秘匿ワード１１０として、ＭＦＰ１０のボックスに格納されたファイル５５０のファイル名（「パテント」等）、作成者（「山田太郎」等）、日付（「２０１３／０３／１１」等）、およびファイル本文の見出し（「画像形成装置」および「発明概要」等）を示す各語句（ワード）が秘匿ワードリスト６０３に登録されている。そして、案内サーバ５０は、当該秘匿ワードリスト６０３に基づいて、変換辞書６５３を生成する。 First, when receiving a support request signal from the user 101, the guidance server 50 also receives the secret word list 603 from the MFP 10 (see FIG. 27). Here, as the secret word 110, the file name (such as “patent”) of the file 550 stored in the box of the MFP 10, the creator (such as “Taro Yamada”), the date (such as “2013/03/11”), and the like Each word (word) indicating the heading (“image forming apparatus”, “invention summary”, etc.) of the file text is registered in the secret word list 603. Then, the guide server 50 generates a conversion dictionary 653 based on the secret word list 603.

図２５および図２６を参照して第５実施形態における画像処理に関して説明する。 Image processing in the fifth embodiment will be described with reference to FIGS.

ここでは、図２５に示すように、ファイル５５０に関する情報表示画面である画像データ３０３（３００）がＭＦＰ１０のタッチパネル２５に表示されているものとする。当該画像データ３０３を案内サーバ５０が受信すると、変換辞書６５３に基づく画像処理によって、合成画像データ３５３（３５０）が生成される。そして、サポータ端末７０の表示部７６ｂにおいて合成画像データ３５３が表示される。 Here, as shown in FIG. 25, it is assumed that image data 303 (300), which is an information display screen related to file 550, is displayed on touch panel 25 of MFP 10. When the guide server 50 receives the image data 303, composite image data 353 (350) is generated by image processing based on the conversion dictionary 653. Then, the composite image data 353 is displayed on the display unit 76b of the supporter terminal 70.

具体的には、ＭＦＰ１０のタッチパネル２５においては、３つのアイコン５００（５０１〜５０３）を有する画像データ３０３が表示されている。これらの各アイコン５００（５０１〜５０３）の下方には、それぞれ対応するファイル５５０（５５１〜５５３）のファイル名「パテント１」〜「パテント３」が表示されている。そして、画像データ３０３を案内サーバ５０が受信すると、変換辞書６５３に基づく画像処理によって、合成画像データ３５３が生成され、合成画像データ３５３はサポータ端末７０に送信される。そして、サポータ端末７０の表示部７６ｂに合成画像データ３５３が表示される。合成画像データ３５３においては、各ファイル５５１〜５５３のファイル名「ＸＹＺ１」〜「ＸＹＺ３」（代替ワードを用いて表現されたファイル名）が、対応するアイコン５０１〜５０３の下方に表示されている。 Specifically, the touch panel 25 of the MFP 10 displays image data 303 having three icons 500 (501 to 503). Below these icons 500 (501 to 503), file names “Patent 1” to “Patent 3” of the corresponding files 550 (551 to 553) are displayed. When the guide server 50 receives the image data 303, the composite image data 353 is generated by image processing based on the conversion dictionary 653, and the composite image data 353 is transmitted to the supporter terminal 70. Then, the composite image data 353 is displayed on the display unit 76b of the supporter terminal 70. In the composite image data 353, the file names “XYZ1” to “XYZ3” (file names expressed using alternative words) of the files 551 to 553 are displayed below the corresponding icons 501 to 503.

つぎに、ファイル５５１「パテント１」に対応するアイコン５０１がユーザ１０１により押下される状況を想定する。ファイル５５１に対応するアイコン５０１がユーザ１０１により押下されると、画像データ３０４に基づく画像がＭＦＰ１０のタッチパネル２５に表示される（図２６左側参照）。そして、画像データ３０４はＭＦＰ１０から案内サーバ５０に送信される。 Next, it is assumed that the user 101 presses the icon 501 corresponding to the file 551 “Patent 1”. When the icon 501 corresponding to the file 551 is pressed by the user 101, an image based on the image data 304 is displayed on the touch panel 25 of the MFP 10 (see the left side of FIG. 26). Then, the image data 304 is transmitted from the MFP 10 to the guidance server 50.

案内サーバ５０は、画像データ３０４を受信すると、変換辞書６５３（図２７参照）に基づいて、画像データ３０４内に秘匿ワード１１０が含まれるか否かを判定する。 When receiving the image data 304, the guidance server 50 determines whether or not the secret word 110 is included in the image data 304 based on the conversion dictionary 653 (see FIG. 27).

画像データ３０４内に秘匿ワード１１０が含まれる旨が判定される場合には、画像生成部６１は、当該秘匿ワード１１０を代替ワード２１０に置き換えた合成画像データ３５４を生成する。 When it is determined that the secret word 110 is included in the image data 304, the image generation unit 61 generates composite image data 354 in which the secret word 110 is replaced with the alternative word 210.

具体的には、画像データ３０４には、ファイル５５１のファイル名（「パテント１」）、作成者（「山田太郎」）、日付（「２０１３／０３／１１」）の秘匿ワード１１０、ならびにファイル５５１の本文の見出し（「画像形成装置」および「発明概要」）の秘匿ワード１１０が含まれる旨が判定される。画像生成部６１は、当該秘匿ワード１１０をそれぞれ対応する代替ワード２１０に置き換えた合成画像データ３５４（図２６右側参照）を生成する。たとえば、画像データ３０４内の秘匿ワード１１１（１１０）である「パテント」は、合成画像データ３５４の生成に際して、代替ワード２１１（２１０）である「ａｂｃｄ」に置き換えられる。 Specifically, the image data 304 includes the file name (“patent 1”) of the file 551, the creator (“Taro Yamada”), the secret word 110 of the date (“2013/03/11”), and the file 551. It is determined that the secret word 110 of the headline (“image forming apparatus” and “invention summary”) is included. The image generation unit 61 generates composite image data 354 (see the right side of FIG. 26) in which the secret word 110 is replaced with the corresponding alternative word 210. For example, “patent” that is the secret word 111 (110) in the image data 304 is replaced with “abcd” that is the alternative word 211 (210) when the composite image data 354 is generated.

そして、案内サーバ５０は、生成した合成画像データ３５４をサポータ端末７０に送信し、サポータ端末７０は、表示部７６ｂに合成画像データ３５４を表示する。 Then, the guidance server 50 transmits the generated composite image data 354 to the supporter terminal 70, and the supporter terminal 70 displays the composite image data 354 on the display unit 76b.

この実施形態では、上述のように、画像データ３００に含まれるファイル５５０のファイル名、作成者、日付およびファイル５５０の本文の見出しが秘匿ワード１１０として決定される。一方、ファイル５５０の本文に含まれる語句（ワード）のうち当該見出し以外のワードは、秘匿ワード１１０として決定されない。 In this embodiment, as described above, the file name, creator, date, and heading of the text of the file 550 included in the image data 300 are determined as the secret word 110. On the other hand, words other than the heading among words (words) included in the text of the file 550 are not determined as the secret word 110.

ただし、当該見出し以外のワードを秘匿ワード１１０として決定せず、そのままサポータ端末７０において表示される場合には、ファイル５５０の本文に含まれる語句（ワード）から漏洩する恐れがある。このような問題を回避するため、画像生成部６１は、当該見出し以外の部分を判読回避画像（当該部分を判読することが不可能な画像）に変換する。 However, when a word other than the heading is not determined as the secret word 110 and is displayed as it is on the supporter terminal 70, there is a risk of leakage from a word (word) included in the text of the file 550. In order to avoid such a problem, the image generation unit 61 converts a part other than the heading into a reading avoidance image (an image in which the part cannot be read).

また、ファイル５５０の本文には非常に多数のワードが含まれている可能性が高く、これらのワードの全てに対して個別の変換処理（各ワードを個別の代替ワードに変換する処理）を伴う画像処理を行うことは効率的とは言えない。 In addition, there is a high possibility that the body of the file 550 contains a very large number of words, and individual conversion processing (processing for converting each word into an individual alternative word) is performed on all of these words. Performing image processing is not efficient.

当該多数のワードに対する秘匿化を効率的に行うため、この判読回避画像は、個別の変換処理（各ワードを個別の代替ワードに変換する処理）を伴わない画像処理によって生成される画像であることが好ましい。判読回避画像は、たとえば、ファイル本文の表示領域のうち当該本文の見出し以外の全領域に亘って一律に行われる定型的な画像処理によって生成されればよい。 In order to efficiently conceal the large number of words, the interpretation avoidance image is an image generated by image processing that does not involve individual conversion processing (processing for converting each word into individual substitute words). Is preferred. The interpretation avoidance image may be generated by, for example, standard image processing performed uniformly over the entire area other than the heading of the text in the display area of the file text.

具体的には、ファイル５５１の本文に含まれるワードのうち、当該見出し以外の部分の画像を、その内容を判読することが不可能である「ＤＵＭＭＹ」の文字を羅列させた判読回避画像に変換する（図２６参照）。なお、本実施形態では、判読回避画像として「ＤＵＭＭＹ」の文字を繰り返し表示する画像を用いているが、これに限定されず、たとえば、「＊＊＊（アスタリスク）」などの他の文字を繰り返し表示する画像などであってもよい。また、判読回避画像として、空白画像を用いるようにしてもよい（換言すれば、当該見出し以外の部分の画像を削除するようにしてもよい）。 Specifically, among the words included in the body of the file 551, the image of the portion other than the heading is converted into an interpretation avoidance image in which the characters “DUMMY” that cannot be interpreted are enumerated. (See FIG. 26). In this embodiment, an image that repeatedly displays the characters “DUMMY” is used as the interpretation avoidance image. However, the present invention is not limited to this, and other characters such as “*** (asterisk)” are repeated. It may be an image to be displayed. Further, a blank image may be used as the interpretation avoidance image (in other words, an image of a portion other than the heading may be deleted).

つぎに、第５実施形態における音声処理に関して説明する。 Next, sound processing in the fifth embodiment will be described.

案内サーバ５０がユーザ音声データ４００を受信すると、音声認識部６４は、当該ユーザ音声データ４００に秘匿ワード１１０が含まれるか否かを判定する。 When the guidance server 50 receives the user voice data 400, the voice recognition unit 64 determines whether or not the secret voice 110 is included in the user voice data 400.

ここにおいて、秘匿ワード１１０は、上述のように、ファイル５５０のファイル名、作成者、日付、およびファイル本文の見出しを示す語句（ワード）である（図２７参照）。 Here, the secret word 110 is a word (word) indicating the file name, creator, date, and heading of the file text of the file 550 as described above (see FIG. 27).

ユーザ音声データ４００内に秘匿ワード１１０が含まれる旨が判定される場合には、音声生成部６５は、当該秘匿ワード１１０に対応する代替ワード２１０の代替音声データ２５０を生成し、当該代替音声データ２５０を用いて合成音声データ４５０を生成する。 When it is determined that the secret word 110 is included in the user voice data 400, the voice generation unit 65 generates the substitute voice data 250 of the substitute word 210 corresponding to the secret word 110, and the substitute voice data 250 is used to generate synthesized speech data 450.

案内サーバ５０は当該合成音声データ４５０をサポータ端末７０に送信し、サポータ端末７０において当該合成音声データ４５０が出力される。 The guidance server 50 transmits the synthesized voice data 450 to the supporter terminal 70, and the synthesized voice data 450 is output from the supporter terminal 70.

たとえば、ユーザ１０１が画像データ３０４を見ながら発したユーザ音声データ４００に秘匿ワード１１２（１１０）「山田太郎」が含まれていることが判定される場合には、秘匿ワード１１２「山田太郎」に対応する代替ワード２１２「ａｂｃｄ」（図２７参照）の代替音声データ２５０が生成される。その後、生成した代替音声データ２５０を用いて合成音声データ４５０が生成され、合成音声データ４５０がサポータ端末に送信される。合成音声データ４５０を受信したサポータ端末７０は、当該合成音声データ４５０に基づく音声を出力する。 For example, when it is determined that the confidential voice word 112 (110) “Taro Yamada” is included in the user voice data 400 issued while the user 101 looks at the image data 304, the confidential word 112 “Taro Yamada” Alternative voice data 250 of the corresponding alternative word 212 “abcd” (see FIG. 27) is generated. Thereafter, synthesized voice data 450 is generated using the generated alternative voice data 250, and the synthesized voice data 450 is transmitted to the supporter terminal. The supporter terminal 70 that has received the synthesized voice data 450 outputs a voice based on the synthesized voice data 450.

以上のような動作によれば、ファイル５５０のファイル名、作成者、日付を示す語句が秘匿ワード１１０として決定されて、当該秘匿ワード１１０に対する画像処理（画像変換処理等）および音声処理（音声変換処理等）が行われる。したがって、ＭＦＰ１０のボックスモードにおける表示画面内に含まれる秘匿ワード（機密情報）の漏洩を回避することが可能である。詳細には、ファイル５５０のファイル名、作成者、日付を示す語句に関しては、視覚を通じて機密情報が漏洩することを防止することが可能であるとともに、聴覚を通じて機密情報が漏洩することをも防止することが可能である。 According to the operation as described above, the phrase indicating the file name, creator, and date of the file 550 is determined as the secret word 110, and image processing (image conversion processing, etc.) and voice processing (voice conversion) for the secret word 110 are performed. Processing). Accordingly, it is possible to avoid leakage of a secret word (confidential information) included in the display screen in the box mode of the MFP 10. Specifically, regarding the words and phrases that indicate the file name, creator, and date of the file 550, it is possible to prevent confidential information from being leaked through vision and to prevent leakage of confidential information through hearing. It is possible.

同様に、ファイル５５０の本文の見出しも秘匿ワード１１０として決定されるので、見出しに関して、聴覚および／または視覚を通じて機密情報が漏洩することを防止することが可能である。 Similarly, since the headline of the text of the file 550 is also determined as the secret word 110, it is possible to prevent confidential information from leaking through hearing and / or vision.

また、ファイル５５０の本文の見出し以外の部分に関しては、当該部分が判読回避画像に変換されるので、少なくとも視覚を通じて機密情報が漏洩することを防止することが可能である。 Further, regarding the part other than the heading of the text of the file 550, the part is converted into the interpretation avoidance image, so that it is possible to prevent leakage of confidential information at least visually.

また、仮に、ファイル５５０の本文の見出し以外の部分に関しても変換処理（音声変換処理および／または画像変換処理）を行うときには、非常に多数のワードに関する当該変換処理に多大な時間を要する。一方、上記態様では、当該見出し以外の部分の音声に関する変換処理（音声変換処理）が行われないので、音声変換処理に要する時間を抑制することが可能である。また、画像に関しても、当該見出し以外の部分は、秘匿ワード１１０と判定されず、代替ワード２１０への画像変換処理が行われないので、画像変換処理に要する時間を抑制することが可能である。 In addition, if the conversion process (speech conversion process and / or image conversion process) is performed on a part other than the heading of the body of the file 550, the conversion process for a very large number of words takes a lot of time. On the other hand, in the above aspect, since the conversion process (speech conversion process) related to the voice other than the heading is not performed, the time required for the voice conversion process can be suppressed. Also for the image, the portion other than the heading is not determined as the secret word 110, and the image conversion process to the alternative word 210 is not performed, so that the time required for the image conversion process can be suppressed.

＜変形例等＞
以上、この発明の実施の形態について説明したが、この発明は上記説明した内容のものに限定されるものではない。 <Modifications>
Although the embodiments of the present invention have been described above, the present invention is not limited to the contents described above.

上記各実施形態においては、ＭＦＰ１０からサポータ端末７０への音声伝達処理について例示したが、これに限定されない。たとえば、サポータ端末７０からＭＦＰ１０への音声伝達処理も同様にして実施される。図２８のフローチャートを参照して、サポータ端末７０からＭＦＰ１０への音声伝達処理について説明する。 In each of the above embodiments, the voice transmission process from the MFP 10 to the supporter terminal 70 has been exemplified, but the present invention is not limited to this. For example, the voice transmission process from the supporter terminal 70 to the MFP 10 is performed in the same manner. With reference to the flowchart of FIG. 28, the audio | voice transmission process from the supporter terminal 70 to MFP10 is demonstrated.

サポータ１０２により発せられたサポータ音声データ４１０はサポータ端末７０により案内サーバ５０へと送信される。案内サーバ５０がサポータ音声データ４１０を受信すると（ステップＳ７０）、音声認識部６４は、サポータ音声データ４１０に非無音部分が存在するか否かを判定する（ステップＳ７１）。その後、音声認識部６４は、サポータ音声データ４１０に所定時間以上の無音部分が存在するか否かを判定する（ステップＳ７２）。 The supporter voice data 410 issued by the supporter 102 is transmitted to the guidance server 50 by the supporter terminal 70. When the guidance server 50 receives the supporter voice data 410 (step S70), the voice recognition unit 64 determines whether or not a non-silent part exists in the supporter voice data 410 (step S71). After that, the voice recognition unit 64 determines whether or not the supporter voice data 410 includes a silent part for a predetermined time or longer (step S72).

サポータ音声データ４１０に所定時間以上の無音部分が存在する旨が判定される場合に、音声認識部６４は、サポータ音声データ４１０の一部である部分音声データ４４０を抽出する（ステップＳ７３）。 When it is determined that there is a silent part for a predetermined time or longer in the supporter voice data 410, the voice recognition unit 64 extracts the partial voice data 440 that is a part of the supporter voice data 410 (step S73).

そして、音声生成部６５は、部分音声データ４４０に対する音声認識処理によって、サポータ音声データ４１０内に、秘匿ワードリスト６０１（図９参照）に含まれる秘匿ワード１１０のいずれかに対応する代替ワード２１０が含まれるか否かを判定する（ステップＳ７４）。 Then, the voice generation unit 65 performs a voice recognition process on the partial voice data 440 to generate an alternative word 210 corresponding to one of the secret words 110 included in the secret word list 601 (see FIG. 9) in the supporter voice data 410. It is determined whether it is included (step S74).

部分音声データ４４０に当該代替ワード２１０が含まれる旨が判定される場合に、音声生成部６５は、代替ワード２１０に対応する秘匿ワード１１０の秘匿音声データ１６１を生成する（ステップＳ７５）。 When it is determined that the alternative word 210 is included in the partial voice data 440, the voice generation unit 65 generates the secret voice data 161 of the secret word 110 corresponding to the substitute word 210 (step S75).

そして、音声生成部６５は、部分音声データ４４０に含まれる代替ワード２１０の音声データである代替音声データ２６１を当該秘匿音声データ１６１に置き換えた合成音声データ４６０（合成サポータ音声データ４６０）を生成する（ステップＳ７６）。 Then, the voice generation unit 65 generates synthesized voice data 460 (synthesized supporter voice data 460) in which the substitute voice data 261 that is the voice data of the substitute word 210 included in the partial voice data 440 is replaced with the secret voice data 161. (Step S76).

その後、案内サーバ５０は合成音声データ４６０をＭＦＰ１０に送信し（ステップＳ７７）、ＭＦＰ１０において、当該合成音声データ４６０が出力される。 Thereafter, the guidance server 50 transmits the synthesized voice data 460 to the MFP 10 (step S77), and the synthesized voice data 460 is output in the MFP 10.

図２９は、サポータ端末７０からＭＦＰ１０への音声伝達処理の一例を示す図である。図２９を参照して具体的に説明する。 FIG. 29 is a diagram illustrating an example of a voice transmission process from the supporter terminal 70 to the MFP 10. This will be specifically described with reference to FIG.

図２９では、図７の音声伝達処理に引き続いてサポータ１０２が、「ＡＢＣをタッチしてください。」との音声を発した状況を想定する。 In FIG. 29, it is assumed that the supporter 102 utters a voice “Please touch ABC” following the voice transmission process of FIG. 7.

音声認識部６４は、まず、「ＡＢＣをタッチしてください。」との音声を含む音声データを部分音声データ４４１として認識する。また、音声認識部６４は、当該部分音声データ４４１（４４０）内に、変換辞書６５１（図９参照）に登録されている複数の秘匿ワード１１０（１１１〜１１３）のいずれかに対応する代替ワード２１０（２１１〜２１３）が含まれるか否かを判定する。具体的には、部分音声データ４４１の「ＡＢＣをタッチしてください。」には、秘匿ワード１１１「長谷不動産」に対応する代替ワード２１１「ＡＢＣ」が含まれる旨が判定される。 First, the voice recognition unit 64 recognizes voice data including a voice “Please touch ABC” as partial voice data 441. In addition, the voice recognition unit 64 includes an alternative word corresponding to one of the plurality of secret words 110 (111 to 113) registered in the conversion dictionary 651 (see FIG. 9) in the partial voice data 441 (440). Whether 210 (211 to 213) is included is determined. Specifically, it is determined that the alternative word 211 “ABC” corresponding to the secret word 111 “Hase Real Estate” is included in “Please touch ABC” in the partial audio data 441.

そして、音声生成部６５は、代替ワード２１１「ＡＢＣ」に対応する秘匿ワード１１１「長谷不動産」の秘匿音声データ１６１（１６０）を生成する。なお、秘匿音声データ１６１（１６０）は、人間の声を模して人工的に生成された音声データ（機械音声データ）である。 Then, the voice generation unit 65 generates the secret voice data 161 (160) of the secret word 111 “Hase Real Estate” corresponding to the alternative word 211 “ABC”. The secret voice data 161 (160) is voice data (mechanical voice data) artificially generated by imitating a human voice.

その後、部分音声データ４４１に含まれる代替ワード２１１「ＡＢＣ」の代替音声データ２６１を当該秘匿音声データ１６１（「長谷不動産」）に置き換えた合成音声データ４６１（４６０）を生成する。そして、案内サーバ５０は、当該合成音声データ４６１をＭＦＰ１０に送信する。 Thereafter, synthesized voice data 461 (460) is generated by replacing the substitute voice data 261 of the substitute word 211 “ABC” included in the partial voice data 441 with the secret voice data 161 (“Hase Real Estate”). Then, the guidance server 50 transmits the synthesized voice data 461 to the MFP 10.

合成音声データ４６１を受信したＭＦＰ１０は、当該合成音声データ４６１を出力する。具体的には、ＭＦＰ１０において、合成音声データ４６１に基づく音声である「長谷不動産をタッチしてください。」が出力される。 The MFP 10 that has received the synthesized voice data 461 outputs the synthesized voice data 461. Specifically, the MFP 10 outputs “Please touch Hase Real Estate.” Which is a voice based on the synthesized voice data 461.

ここにおいて、ユーザ１０１は代替ワード２１０の内容を知らず、サポータ１０２は秘匿ワード１１０の内容を知らない。 Here, the user 101 does not know the content of the alternative word 210, and the supporter 102 does not know the content of the secret word 110.

このため、仮に、サポータ１０２により発せられたサポータ音声データ４１０がそのままＭＦＰ１０に対して送信されると、ユーザ１０１の知らない代替ワード２１０がユーザ１０１に伝達されるので、ユーザ１０１に混乱が生じる恐れがある。 For this reason, if the supporter voice data 410 generated by the supporter 102 is transmitted to the MFP 10 as it is, the alternative word 210 that the user 101 does not know is transmitted to the user 101, which may cause confusion to the user 101. There is.

一方、上記態様によれば、サポータ１０２により発せられたサポータ音声データ４１０に含まれる代替ワード２１０が秘匿音声データ１６０に置き換えられて合成音声データ４６０が生成され、当該合成音声データ４６０がユーザ１０１に送信されるので、ユーザ１０１の混乱を回避することが可能である。 On the other hand, according to the aspect described above, the substitute word 210 included in the supporter voice data 410 issued by the supporter 102 is replaced with the secret voice data 160 to generate the synthesized voice data 460, and the synthesized voice data 460 is transmitted to the user 101. Since it is transmitted, confusion of the user 101 can be avoided.

また、特定の秘匿ワード１１０（１１１）の秘匿音声データ１６０（人工音声）がサポータ音声データ４１０（サポータ音声）に含まれている（人工音声がサポータ音声に含まれている）ので、サポータ音声データ４１０のうち特定の秘匿ワード１１０（１１１）に対応する音声部分に対して何らかの処理が施されていることをユーザ１０１は知得できる。ユーザ１０１が幾つかの秘匿ワードに関する変換処理が施されていることを知っている場合において、特定の秘匿ワード１１０（たとえば１１１）に対して何らかの変換処理が施されていることをも知得したユーザ１０１は、当該特定の秘匿ワード１１０（１１１）がサポータ１０２に伝わっていないことを確認（推測）できる。換言すれば、特定の秘匿ワードに関する機密情報が漏洩していないことを確認できる。 Further, since the secret voice data 160 (artificial voice) of the specific secret word 110 (111) is included in the supporter voice data 410 (supporter voice) (the artificial voice is included in the supporter voice), the supporter voice data The user 101 can know that some processing has been performed on the voice portion corresponding to the specific secret word 110 (111) out of 410. In the case where the user 101 knows that conversion processing related to some secret words has been performed, the user 101 has also learned that some conversion processing has been performed on a specific secret word 110 (for example, 111). The user 101 can confirm (guess) that the specific secret word 110 (111) is not transmitted to the supporter 102. In other words, it can be confirmed that confidential information regarding a specific secret word is not leaked.

また、上記態様においては、サポータ１０２側からユーザ１０１側への音声伝達において、秘匿音声データ１６１（サポータ音声データ４１０に含まれていた代替ワード２１０に対応する秘匿ワード１１０の音声データ）が逐一生成され、当該秘匿音声データ１６１（機械音声）を用いてサポータ音声に対する変換処理（代替ワード２１０を秘匿ワード１１０に変換（逆変換）する処理）が行われている。 Further, in the above aspect, the secret voice data 161 (the voice data of the secret word 110 corresponding to the alternative word 210 included in the supporter voice data 410) is generated one by one in the voice transmission from the supporter 102 side to the user 101 side. Then, conversion processing for the supporter speech (processing for converting the alternative word 210 to the secret word 110 (inverse conversion)) is performed using the confidential speech data 161 (machine speech).

しかしながら、本発明はこれに限定されない。たとえば、まずユーザ１０１側からサポータ１０２側への音声伝達においてユーザ１０１の音声データ（秘匿音声データ１５１）を格納部５５に予め格納しておき（図３０参照）、次にサポータ１０２側からユーザ１０１側への音声伝達がなされた場合に、当該格納部５５に既に格納されている秘匿音声データ１５１を用いて、サポータ音声に対する変換処理が行われる（図３１参照）ようにしてもよい。 However, the present invention is not limited to this. For example, the voice data of the user 101 (secret voice data 151) is first stored in the storage unit 55 in advance in voice transmission from the user 101 side to the supporter 102 side (see FIG. 30), and then the user 101 from the supporter 102 side. When the voice transmission to the side is performed, the conversion process for the supporter voice may be performed using the secret voice data 151 already stored in the storage unit 55 (see FIG. 31).

図３０および図３１を参照して具体的に説明する。図３０は、ユーザ１０１からサポータ１０２への音声伝達処理を示す図である。図３１は、サポータ１０２からユーザ１０１への音声伝達処理がなされる状況を示す図である。 This will be specifically described with reference to FIGS. 30 and 31. FIG. FIG. 30 is a diagram illustrating a voice transmission process from the user 101 to the supporter 102. FIG. 31 is a diagram illustrating a situation in which voice transmission processing from the supporter 102 to the user 101 is performed.

図３０では、ユーザ１０１が「ファイルをスキャンして長谷不動産に送りたいのです。」との音声を発した状況が想定されている。 In FIG. 30, it is assumed that the user 101 has issued a voice saying “I want to scan a file and send it to Hase Real Estate”.

ユーザ音声データ４００を受信した案内サーバ５０は、ユーザ音声データ４００から部分音声データ４３１（４３０）を抽出する。そして、変換辞書６５１（図９参照）に基づいて、部分音声データ４３１内に秘匿ワード１１０が含まれるか否かを判定する。 The guidance server 50 that has received the user voice data 400 extracts partial voice data 431 (430) from the user voice data 400. Then, based on the conversion dictionary 651 (see FIG. 9), it is determined whether or not the secret word 110 is included in the partial voice data 431.

具体的には、部分音声データ４３１「ファイルをスキャンして長谷不動産に送りたいのです。」には、秘匿ワード１１１「長谷不動産」が含まれる旨が音声認識部６４によって判定される。 Specifically, the voice recognition unit 64 determines that the partial voice data 431 “I want to scan a file and send it to Hase Real Estate” includes the secret word 111 “Hase Real Estate”.

そして、音声生成部６５は、ユーザ１０１により発せられた秘匿ワード１１１「長谷不動産」の秘匿音声データ１５１を抽出し、案内サーバ５０の格納部５５に格納する。 Then, the voice generation unit 65 extracts the secret voice data 151 of the secret word 111 “Hase Real Estate” issued by the user 101 and stores it in the storage unit 55 of the guidance server 50.

その後、音声生成部６５は、秘匿ワード１１１「長谷不動産」に対応する代替ワード２１１「ＡＢＣ」の代替音声データ２５１を用いて、合成音声データ４５１を生成する。生成された合成音声データ４５１はサポータ端末７０に送信され、サポータ端末７０において出力される。 Thereafter, the voice generation unit 65 generates the synthesized voice data 451 by using the alternative voice data 251 of the alternative word 211 “ABC” corresponding to the secret word 111 “Hase Real Estate”. The generated synthesized voice data 451 is transmitted to the supporter terminal 70 and output from the supporter terminal 70.

このように、この態様では、合成音声データ４５０（４５１）を生成する際に、ユーザ音声データ４００（部分音声データ４３０（４３１））から出した秘匿音声データ１５０（１５１）を格納部に予め格納しておく。 As described above, in this aspect, when the synthesized voice data 450 (451) is generated, the secret voice data 150 (151) output from the user voice data 400 (partial voice data 430 (431)) is stored in the storage unit in advance. Keep it.

つぎに、図３１を参照しながら、サポータ端末７０からＭＦＰ１０への音声伝達処理について説明する。図３１では、ユーザ１０１からサポータ１０２への音声伝達処理の次に、サポータ１０２が、「ＡＢＣをタッチしてください。」との音声を発した状況を想定している。 Next, an audio transmission process from the supporter terminal 70 to the MFP 10 will be described with reference to FIG. In FIG. 31, it is assumed that, after the voice transmission process from the user 101 to the supporter 102, the supporter 102 emits a voice “Please touch ABC”.

サポータ音声データ４１０を受信した案内サーバ５０は、サポータ音声データ４１０から部分音声データ４４１（４４０）を抽出する。そして、変換辞書６５１（図９参照）に基づいて、部分音声データ４４１内に格納済みの秘匿音声データ１５１（１５０）に対応する代替ワード２１０が含まれるか否かを判定する。 The guide server 50 that has received the supporter voice data 410 extracts the partial voice data 441 (440) from the supporter voice data 410. Then, based on the conversion dictionary 651 (see FIG. 9), it is determined whether or not the alternative word 210 corresponding to the stored secret voice data 151 (150) is included in the partial voice data 441.

具体的には、格納済みの秘匿音声データ１５１（より詳細には、ユーザ１０１により発せられた秘匿ワード１１１「長谷不動産」の音声データ）に対応する代替ワード２１１「ＡＢＣ」が、部分音声データ４４１に基づく音声「ＡＢＣをタッチしてください。」に含まれる旨が、音声認識部６４により判定される。 Specifically, the alternative word 211 “ABC” corresponding to the stored secret voice data 151 (more specifically, the voice data of the secret word 111 “Hase Real Estate” issued by the user 101) is the partial voice data 441. The voice recognition unit 64 determines that it is included in the voice “Please touch ABC.”

そして、音声生成部６５は、部分音声データ４４１内の代替ワード２１１「ＡＢＣ」に対応する代替音声データ２６１（２６０）を、格納済みの秘匿音声データ１５１（１５０）に置き換えた合成音声データ４６１（４６０）を生成する。この合成音声データ４６１の生成に際しては、秘匿音声データ１６１（代替ワード２１１「ＡＢＣ」に対応する秘匿ワード１１１「長谷不動産」の機械音声データ）ではなく、秘匿音声データ１５１（格納部５５に格納されていたユーザ１０１の録音音声データ）が用いられる。換言すれば、予め格納された秘匿音声データ１５０が、部分音声データ４４１内の代替ワード２１１「ＡＢＣ」に対応する秘匿音声データとして利用され、合成音声データ４６１が生成される。 Then, the voice generation unit 65 replaces the substitute voice data 261 (260) corresponding to the substitute word 211 “ABC” in the partial voice data 441 with the stored secret voice data 151 (150), and the synthesized voice data 461 ( 460). When this synthesized voice data 461 is generated, not the secret voice data 161 (the machine voice data of the secret word 111 “Hase Real Estate” corresponding to the alternative word 211 “ABC”) but the secret voice data 151 (stored in the storage unit 55). Recorded voice data of the user 101). In other words, the secret voice data 150 stored in advance is used as the secret voice data corresponding to the alternative word 211 “ABC” in the partial voice data 441, and the synthesized voice data 461 is generated.

その後、案内サーバ５０は、合成音声データ４６１（４６０）をサポータ端末７０に送信し、サポータ端末７０において、「長谷不動産をタッチしてください。」の音声が合成音声データ４６１に基づいて出力される。この合成音声データ４６１に含まれる音声「長谷不動産」は、ユーザ１０１の音声を用いて出力され、当該合成音声データ４６１に含まれる音声「をタッチしてください」は、サポータ１０２の音声を用いて出力される。 Thereafter, the guidance server 50 transmits the synthesized voice data 461 (460) to the supporter terminal 70, and the voice of “Please touch Hase Real Estate” is output based on the synthesized voice data 461 at the supporter terminal 70. . The voice “Hase Real Estate” included in the synthesized voice data 461 is output using the voice of the user 101, and the voice “Please touch” contained in the synthesized voice data 461 is used using the voice of the supporter 102. Is output.

このような改変例によれば、サポータ１０２により発せられたサポータ音声データ４１０（部分音声データ４４１）に含まれる代替ワード２１０を、予め格納された秘匿音声データ１５０に置き換えた合成音声データ４６０が音声出力用データとしてユーザ１０１側のＭＦＰ１０に送信される。したがって、ユーザ１０１の知らない代替ワード２１０がユーザ１０１に伝達されることに起因したユーザ１０１の混乱を回避することが可能である。 According to such a modified example, the synthesized voice data 460 obtained by replacing the substitute word 210 included in the supporter voice data 410 (partial voice data 441) issued by the supporter 102 with the previously stored secret voice data 150 is voiced. The output data is transmitted to the MFP 10 on the user 101 side. Therefore, it is possible to avoid the confusion of the user 101 due to the substitution word 210 that the user 101 does not know is transmitted to the user 101.

また、ユーザ１０１により過去に発せられた音声データが秘匿音声データ１５０として格納部５５に格納されており、当該格納部５５に既に格納されている秘匿音声データ１５０を用いて合成音声データ４６０が生成される。したがって、一の代替ワード２１１「ＡＢＣ」に対応する秘匿ワード１１１「長谷不動産」の音声データである秘匿音声データ１６０を再び生成することを要しないので、合成音声データ４６０の生成に要する時間が短縮される。その結果、ＭＦＰ１０への合成音声データ４６０の送信の遅延を抑制することが可能である。 In addition, voice data issued in the past by the user 101 is stored as the secret voice data 150 in the storage unit 55, and the synthesized voice data 460 is generated using the secret voice data 150 already stored in the storage unit 55. Is done. Accordingly, since it is not necessary to generate the secret voice data 160 that is the voice data of the secret word 111 “Hase Real Estate” corresponding to the one alternative word 211 “ABC”, the time required for generating the synthesized voice data 460 is shortened. Is done. As a result, it is possible to suppress delay in transmission of the synthesized voice data 460 to the MFP 10.

さらに、ユーザ１０１の発した特定の秘匿ワード１１０（１１１）の秘匿音声データ１５０がサポータ音声データ４１０に含まれている。したがって、ユーザ１０１は、サポータ音声データ４１０のうち特定の秘匿ワード１１０（１１１）に対応する音声部分に対して何らかの処理が施されていることを知得できる。 Further, the secret voice data 150 of the specific secret word 110 (111) issued by the user 101 is included in the supporter voice data 410. Therefore, the user 101 can know that some processing has been performed on the voice portion corresponding to the specific secret word 110 (111) in the supporter voice data 410.

なお、上記態様においては、ユーザ１０１の発した音声（秘匿ワード１１０に係る音声）のデータ（秘匿音声データ１５０）を格納部５５に格納しておき、当該秘匿音声データ１５０をサポータ１０２側からユーザ１０１側への音声伝達処理において利用する態様が例示されているが、これに限定されない。 In the above aspect, the voice (voice related to the secret word 110) uttered by the user 101 (the secret voice data 150) is stored in the storage unit 55, and the secret voice data 150 is transmitted from the supporter 102 side to the user. Although the aspect utilized in the audio | voice transmission process to 101 side is illustrated, it is not limited to this.

たとえば、サポータ１０２の発した音声（代替ワード２１０に係る音声）のデータ（代替音声データ２６０）を格納部５５に格納しておき、当該代替音声データ２６０をユーザ１０１側からサポータ１０２側への音声伝達処理において利用するようにしてもよい。 For example, data (substitute voice data 260) of voice (voice related to the substitute word 210) uttered by the supporter 102 is stored in the storage unit 55, and the substitute voice data 260 is voiced from the user 101 side to the supporter 102 side. You may make it utilize in a transmission process.

このような態様について、図３１を参照して説明する。 Such an aspect will be described with reference to FIG.

まず、図３１に示すように、サポータ１０２側からユーザ１０１側への音声伝達処理において、サポータ１０２が「ＡＢＣをタッチしてください。」との音声を発すると、案内サーバ５０では、上記態様と同様の処理により、音声変換処理が施され、合成音声データ４６１（４６０）が生成される。この合成音声データ４６１の生成に際して、サポータ１０２により発せられた代替ワード２１１「ＡＢＣ」の録音データである代替音声データ２６１が格納部５５に格納される。 First, as shown in FIG. 31, in the voice transmission process from the supporter 102 side to the user 101 side, when the supporter 102 utters “Please touch ABC”, the guidance server 50 Through similar processing, voice conversion processing is performed, and synthesized voice data 461 (460) is generated. When the synthesized voice data 461 is generated, the substitute voice data 261 that is the recording data of the substitute word 211 “ABC” issued by the supporter 102 is stored in the storage unit 55.

その後、ユーザ１０１側からサポータ１０２側への音声伝達処理がなされる場合に、音声生成部６５は、当該格納されている代替音声データ２６０（２６１）を用いて合成音声データ４５０を生成する。 Thereafter, when voice transmission processing from the user 101 side to the supporter 102 side is performed, the voice generation unit 65 generates synthesized voice data 450 using the stored alternative voice data 260 (261).

詳細には、たとえば「長谷不動産のボタンを押しますね？」との音声をユーザ１０１が発する場合において、当該音声を含む部分音声データ４３３（不図示）がユーザ音声データ４００から抽出される。そして、部分音声データ４３３に秘匿ワード１１１「長谷不動産」が含まれる旨が音声認識部６４によって判定されると、合成音声データ４５３が生成される。このとき、秘匿ワード１１１「長谷不動産」に対応する代替ワード２１１「ＡＢＣ」の代替音声データ（置換用の音声データ）として、格納部５５に既に格納されている上述の代替音声データ２６０（２６１）が利用されて、合成音声データ４５３が生成される。生成された合成音声データ４５３はサポータ端末７０に送信され、サポータ端末７０において出力される。 Specifically, for example, when the user 101 utters a voice saying “Do you want to press the Hase Real Estate button?”, Partial voice data 433 (not shown) including the voice is extracted from the user voice data 400. When the voice recognition unit 64 determines that the secret word 111 “Hase Real Estate” is included in the partial voice data 433, synthesized voice data 453 is generated. At this time, the alternative voice data 260 (261) already stored in the storage unit 55 as alternative voice data (substitution voice data) of the alternative word 211 “ABC” corresponding to the secret word 111 “Hase Real Estate” Is used to generate synthesized speech data 453. The generated synthesized voice data 453 is transmitted to the supporter terminal 70 and output from the supporter terminal 70.

このような態様によれば、特に、合成音声データ４５３の生成に際して、格納部５５に予め格納されている代替音声データ２６０が利用されるので、機械音声生成処理によって代替音声データを改めて生成することを要しない。 According to such an aspect, particularly, when the synthesized voice data 453 is generated, the alternative voice data 260 stored in advance in the storage unit 55 is used. Therefore, the alternative voice data is generated again by the machine voice generation process. Is not required.

また、上記各実施形態においては、案内サーバ５０が画像処理および音声処理を行うことが例示されているが、これに限定されない。たとえば、上記案内サーバ５０の動作がＭＦＰ１０により実行されてもよい。具体的には、ＭＦＰ１０が案内サーバ５０の画像処理部６０ａおよび音声処理部６０ｂの動作と同様の動作を行うようにすればよい。 Moreover, in each said embodiment, although the guidance server 50 performs image processing and an audio | voice process is illustrated, it is not limited to this. For example, the operation of the guide server 50 may be executed by the MFP 10. Specifically, the MFP 10 may perform the same operation as that of the image processing unit 60a and the audio processing unit 60b of the guidance server 50.

１操作案内システム
１０ＭＦＰ（画像形成装置）
５０案内サーバ
７０サポータ端末
１０１ユーザ
１０２サポータ
１１０〜１１３秘匿ワード
２１０〜２１３代替ワード
３００〜３０４画像データ
３５０〜３５４合成画像データ
４００ユーザ音声データ
４３０〜４３２部分音声データ
１５０〜１５２，１６０秘匿音声データ
２５０，２５１，２６０代替音声データ
４５０〜４５２，４６０，４６１合成音声データ
６５１〜６５３変換辞書 1 Operation Guidance System 10 MFP (Image Forming Device)
DESCRIPTION OF SYMBOLS 50 Guide server 70 Supporter terminal 101 User 102 Supporter 110-113 Secret word 210-213 Alternative word 300-304 Image data 350-354 Composite image data 400 User voice data 430-432 Partial voice data 150-152,160 Secret voice data 250 , 251, 260 Alternative voice data 450 to 452, 460, 461 Synthetic voice data 651 to 653 Conversion dictionary

Claims

A guidance server in the operation guidance system,
Image receiving means for receiving, from the image forming apparatus, first display image data which is data of a first display image displayed on an operation unit of the image forming apparatus which is a user's operation target;
When a secret word is included in the first display image data, the first synthesized image data is obtained by replacing the secret word in the first display image with an alternative word corresponding to the secret word. Image generating means for generating composite image data;
Image transmitting means for transmitting the first combined image data as display data on the supporter terminal to a supporter terminal used for operation guidance to the user by a supporter who is a person supporting the user;
Sound receiving means for receiving user sound data including sound data emitted by the user from the image forming apparatus ;
Voice recognition means for determining whether or not the secret word is included in the user voice data by voice recognition processing on the user voice data;
When it is determined that the secret word is included in the user voice data, the secret voice data that is the voice data of the secret word in the user voice data is the voice data of the alternative word corresponding to the secret word. Voice generating means for generating synthesized user voice data which is data replaced with certain alternative voice data;
Voice transmitting means for transmitting the synthesized user voice data to the supporter terminal as voice output data at the supporter terminal;
A guidance server comprising:

In the guidance server according to claim 1,
The user voice data is divided into a plurality of partial voice data,
When the voice recognition process determines that the secret word is included in the first voice data that is one partial voice data in the user voice data, the voice generation unit is configured to output the first voice data. Generating the first synthesized voice data by replacing the secret voice data in the substitute voice data,
The guidance server characterized in that the voice transmitting means transmits the first synthesized voice data to the supporter terminal.

In the guidance server according to claim 2,
The first audio data includes audio uttered by the user during display of the first display image,
The image receiving means receives second display image data, which is data of a second display image displayed on the operation unit subsequent to the first display image, after completion of reception of the first display image data. Received from the image forming apparatus,
The image generation means replaces the secret word in the second display image data with an alternative word corresponding to the secret word when a secret word is included in the second display image data. Generating second composite image data which is image data;
When the reception of the first audio data is started, the image transmission means does not permit transmission of the second synthesized image data, and the first synthesized audio generated based on the first audio data A guide server that permits transmission of the second composite image data after the data transmission is completed.

In the guidance server according to claim 3,
The image transmission means is a period between a reception start time of the first audio data and a time when an output required time of the first synthetic audio data has elapsed since the completion of transmission of the first synthetic audio data. If the second display image data is received by the image receiving means within the first period, transmission of the second composite image data is not permitted until the end of the first period and A guide server that permits transmission of the second composite image data after the end of the first period.

In the guidance server according to claim 1,
Storage means for storing audio data;
Further comprising
The voice generation means includes
Generation of a plurality of alternative voice data, which are voice data of alternative words corresponding to a plurality of secret words, is started from a predetermined time prior to the generation of the synthesized user voice data, and the generated alternative voice data is stored in the storage means. Store and
When the user voice data includes the secret word and the alternative voice data corresponding to the secret word is not stored in the storage unit, the alternative voice data is generated by a machine voice generation process. Generating the synthesized user voice data using the alternative voice data,
When the user voice data includes the secret word and the alternative voice data corresponding to the secret word is already stored in the storage unit, the alternative voice data stored in the storage unit is used. And generating the synthesized user voice data.

In the guidance server according to claim 5,
The voice generation means starts generating the plurality of alternative voice data in response to the guide server receiving a support request signal indicating the generation of an operation guidance request from the user. Information server.

In the guidance server according to claim 6,
The plurality of concealment words are at least one of a word indicating a transmission destination in a destination designation screen in scan image transmission of the image forming apparatus and a word indicating a transmission destination included in a destination designation screen in facsimile transmission of the image forming apparatus A guide server characterized by including:

In the guidance server according to claim 6,
The guide server, wherein the plurality of secret words include a phrase indicating file information displayed on an information display screen related to a file stored in a box of the image forming apparatus.

In the guidance server according to claim 5,
The guidance server preferentially generates alternative voice data corresponding to a secret word that can be displayed in the current operation mode of the image forming apparatus among the plurality of alternative voice data. .

In the guidance server according to claim 9,
A guide server characterized in that a current operation mode of the image forming apparatus is one of a plurality of modes including a scan mode, a facsimile transmission mode, and a box mode.

In the guidance server according to claim 1,
Storage means for storing audio data;
Further comprising
The voice generation means includes
When the first display image data is received by the image receiving means, generation of a plurality of alternative voice data which are voice data of alternative words corresponding to a plurality of secret words is started, and the generated alternative voice data is Storing in the storage means;
When the user voice data includes the secret word and the alternative voice data corresponding to the secret word is not stored in the storage unit, the alternative voice data is generated by a machine voice generation process. Generating the synthesized user voice data using the alternative voice data,
When the user voice data includes the secret word and the alternative voice data corresponding to the secret word is already stored in the storage unit, the alternative voice data stored in the storage unit is used. And generating the synthesized user voice data.

In the guidance server in any one of Claims 5 thru | or 11,
The guidance server, wherein the voice generation means generates the plurality of alternative voice data according to a priority order based on a frequency of use of the plurality of secret words.

In the guidance server according to claim 2,
Storage means for storing the substitute voice data used for generating the first synthesized voice data;
Further comprising
The voice generation means determines that the secret word is included in the second voice data, which is partial voice data of the user voice data different from the first voice data, by the voice recognition process. And generating the second synthesized voice data in which the secret voice data in the second voice data is replaced with the substitute voice data using the substitute voice data stored in the storage means,
The guidance server characterized in that the voice transmitting means transmits the second synthesized voice data to the supporter terminal.

In the guidance server according to claim 1,
The voice receiving means receives supporter voice data including voice data emitted by the supporter,
The voice recognition means determines whether or not one support word corresponding to one or a plurality of secret words is included in the supporter voice data by voice recognition processing on the supporter voice data,
When the one alternative word is included in the supporter voice data, the voice generation means converts the second alternative voice data, which is the voice data of the one alternative word in the supporter voice data, into the one alternative. Generating synthesized supporter voice data replaced with second secret voice data which is voice data of the secret word corresponding to the word;
The guidance server characterized in that the voice transmitting means transmits the synthesized supporter voice data to the image forming apparatus.

In the guidance server according to claim 14,
Storage means for storing audio data;
Further comprising
The voice generation means includes
When generating the synthesized user voice data based on the user voice data, the secret voice data extracted from the user voice data is stored in the storage means,
A guidance server that generates the synthesized supporter voice data by using the secret voice data already stored in the storage means as the second secret voice data.

In the guidance server according to any one of claims 2 to 4,
The voice recognition means, when there is a silent part for a predetermined time or more in the user voice data, a part of the user voice data that is classified so as to end when the silent state for the predetermined time has elapsed A guidance server that extracts voice data as the first voice data.

The guidance server according to claim 16, wherein
The image receiving means also receives second display image data different from the first display image data from the image forming apparatus;
When the second display image data is received by the image receiving means during the voice recognition process of the user voice data, the voice recognition means includes the second display image data of the user voice data. A guidance server that extracts partial voice data divided so as to end at the reception time of the first voice data.

In the guidance server according to claim 1,
The first display image is an image of an information display screen related to a file stored in a box of the image forming apparatus,
The secret word includes a word indicating at least one of a file name, an author, a date, and a file body header of the file ;
Wherein the image generating means generates the first synthesized image data by replacing the person the secret word in the alternative word,
The said voice production | generation means produces | generates the synthetic | combination user audio | voice data which replaced the said confidential audio | voice data with the said alternative audio | speech data, when the said confidential word is contained in the said user audio | voice data.

The guide server according to claim 18, wherein
The secret word includes a word indicating a heading of the file body ,
The image generating unit converts the confidential words the portion other than heading the files present statements of the file the statement a first synthesized image data is replaced with the alternative word read avoid image A guide server for generating the first composite image data.

In the computer built in the guidance server in the operation guidance system,
a) receiving, from the image forming apparatus, first display image data that is data of a first display image displayed on an operation unit of the image forming apparatus that is an operation target of the user;
b) First composite image data obtained by replacing the secret word in the first display image with an alternative word corresponding to the secret word when a secret word is included in the first display image data. Generating one composite image data;
c) transmitting the first composite image data as display data on the supporter terminal to a supporter terminal used for guidance to the user by a supporter who is a person supporting the user;
d) receiving user audio data including audio data uttered by the user from the image forming apparatus ;
e) determining whether or not the secret word is included in the user voice data by voice recognition processing on the user voice data;
f) When it is determined that the secret word is included in the user voice data, the secret voice data that is the voice data of the secret word in the user voice data is converted to the voice of the alternative word corresponding to the secret word. Generating synthesized user voice data that is data replaced with alternative voice data that is data;
g) transmitting the synthesized user voice data as voice output data at the supporter terminal to the supporter terminal;
A program for running

An operation guidance system,
An image forming apparatus to be operated by the user;
A supporter terminal used for operation guidance of the image forming apparatus to the user by a supporter who is a person supporting the user;
A guidance server that mediates between the image forming apparatus and the supporter terminal;
With
The guidance server is
Image receiving means for receiving, from the image forming apparatus, first display image data which is data of a first display image displayed on the operation unit of the image forming apparatus;
When a secret word is included in the first display image data, the first synthesized image data is obtained by replacing the secret word in the first display image with an alternative word corresponding to the secret word. Image generating means for generating composite image data;
Image transmitting means for transmitting the first composite image data as display data at the supporter terminal to the supporter terminal;
Sound receiving means for receiving user sound data including sound data emitted by the user from the image forming apparatus ;
Voice recognition means for determining whether or not the secret word is included in the user voice data by voice recognition processing on the user voice data;
When it is determined that the secret word is included in the user voice data, the secret voice data that is the voice data of the secret word in the user voice data is the voice data of the alternative word corresponding to the secret word. Voice generating means for generating synthesized user voice data which is data replaced with certain alternative voice data;
Voice transmitting means for transmitting the synthesized user voice data to the supporter terminal as voice output data at the supporter terminal;
An operation guidance system comprising:

An image forming apparatus in an operation guidance system,
Image acquisition means for acquiring first display image data which is data of a first display image displayed on the operation unit of the image forming apparatus which is a user's operation target;
When a secret word is included in the first display image data, the first synthesized image data is obtained by replacing the secret word in the first display image with an alternative word corresponding to the secret word. Image generating means for generating composite image data;
Image transmitting means for transmitting the first composite image data as display data on the supporter terminal to a supporter terminal used for guidance to the user by a supporter who is a person supporting the user;
Voice acquisition means for acquiring user voice data including voice data generated by the user;
Voice recognition means for determining whether or not the secret word is included in the user voice data by voice recognition processing on the user voice data;
When it is determined that the secret word is included in the user voice data, the secret voice data that is the voice data of the secret word in the user voice data is the voice data of the alternative word corresponding to the secret word. Voice generating means for generating synthesized user voice data which is data replaced with certain alternative voice data;
Voice transmitting means for transmitting the synthesized user voice data to the supporter terminal as voice output data at the supporter terminal;
An image forming apparatus comprising:

In the computer built in the image forming apparatus in the operation guidance system,
a) acquiring first display image data which is data of a first display image displayed on the operation unit of the image forming apparatus which is an operation target of the user;
b) First composite image data obtained by replacing the secret word in the first display image with an alternative word corresponding to the secret word when a secret word is included in the first display image data. Generating one composite image data;
c) transmitting the first composite image data as display data on the supporter terminal to a supporter terminal used for guidance to the user by a supporter who is a person supporting the user;
d) obtaining user voice data including voice data uttered by the user;
e) determining whether or not the secret word is included in the user voice data by voice recognition processing on the user voice data;
f) When it is determined that the secret word is included in the user voice data, the secret voice data that is the voice data of the secret word in the user voice data is converted to the voice of the alternative word corresponding to the secret word. Generating synthesized user voice data that is data replaced with alternative voice data that is data;
g) transmitting the synthesized user voice data as voice output data at the supporter terminal to the supporter terminal;
A program for running