JP4319334B2

JP4319334B2 - Audio / image processing equipment

Info

Publication number: JP4319334B2
Application number: JP2000208021A
Authority: JP
Inventors: 岩夫野崎; 喜也丸本
Original assignee: Noritsu Koki Co Ltd
Current assignee: Noritsu Koki Co Ltd
Priority date: 2000-07-10
Filing date: 2000-07-10
Publication date: 2009-08-26
Anticipated expiration: 2020-07-10
Also published as: JP2002027177A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声データを光学的に読み取り可能なようにコード化された音声コードイメージに変換するコード変換部と、音声付き画像シートを作成するために前記音声コードイメージと画像データに基づく画像イメージをプリントするプリント部を備えた音声・画像処理装置に関する。
【０００２】
【従来の技術】
近年、マルチメディア時代を迎えて、情報の伝達を視覚だけに頼るのではなく、聴覚も利用することが積極的に試みられており、音声付き画像シート、特に音声付き写真もそのような試みの１つであり、例えば、日本国特開平６−２３１４６６号公報、及び、日本国特開平７−１８１６０６号公報では、図や写真、文字に加えて音声を光学的に読取可能に変換したドットコード（音声コードイメージ）を同一の用紙上に印刷し、このドットコードを読み取る専用のスキャナーにより音声が聞こえるという、音声付き画像シートを開示している。このような音声付き画像シートは、特に発音を繰り返し勉強するための語学教材、動物の鳴き声を収録する写真図鑑、音の出る絵本、あるいは、結婚式、成人式、七五三などの記念行事を行事に付随する音声とともに記録する写真として適用されている。
【０００３】
【発明が解決しようとする課題】
また、最近では、適当な写真に、音声コード化されたメッセージを付与したものが、新しいメッセージカードとして注目されてきているが、このようなメッセージカードの作成をＤＰショップ等に依頼する場合、プリントしたい画像を収めた写真フィルムやデジタルカメラの記録メディアを提出するだけでなく、店頭でマイクを通じて音声メッセージを吹き込む必要がある。これは、メッセージの内容にかかわらず、一般の人にとって結構恥ずかしい行為であり、このためにメッセージカードの作成を躊躇する人が少なくない。店頭での音声メッセージの吹き込みを避けるため、予め家で音声メッセージを吹き込んだカセットテープやＭＤなどを持参してもよいが、確認のために再生するケースが多いし、簡単なメッセージのためにいちいち家で録音することは煩わしいものである。
上記実状に鑑み、本発明の課題は、音声付き画像シートを作成依頼する際の音声データの入力方法を改善することである。
【０００４】
【課題を解決するための手段】
上記課題を解決するため、音声データを光学的に読み取り可能なようにコード化された音声コードイメージに変換するコード変換部と、音声付き画像シートを作成するために前記音声コードイメージと画像データに基づく画像イメージをプリントするプリント部を備えた音声・画像処理装置において、本発明では、入力されたテキストデータを処理するテキスト入力処理部と、前記テキスト入力処理部で処理されたテキストデータに基づいて合成音声データを生成する音声合成部とが備えられ、前記コード変換部が前記音声合成部で生成された合成音声データを前記音声コードイメージのためのソース音声データとして使用して、前記テキストデータの音声を再生する音声コードイメージを生成することを特徴としている。
【０００５】
この構成では、音声付き画像シートを作成するために必要な音声コードイメージのソースデータとしてテキストデータの形態で入力されたものから音声合成技術を利用して合成音声データ化されるものを用いるので、顧客はメッセージ内容を肉声で吹き込む必要がない。テキストデータの入力としては、例えば、この音声・画像処理装置に接続されたキーボードを用いて直接メッセージ内容を打ち込んでもよいし、予めワープロ等を利用して作成したメッセージをフロッピー等の記録メディアに記録して、それを店に持ち込んでもよい。さらには、電子メールを介して店にメッセージ内容を送ることも可能であり、その際、作成すべき音声付き画像シートのための画像データを添付ファイルとして送るなら、音声付き画像シートの注文時には顧客が店に出向く必要がなくなる。
【０００８】
音声合成部の一例として、本発明の好適な実施形態では、テキスト解析用辞書を用いて入力テキストデータを解析することでその読みを同定するとともにさらにアクセントと韻律を設定して得られた音韻系列から合成音声エレメント辞書を用いて合成音声データを生成するテキスト音声合成部を備えている。この構成では、キーボードから入力された仮名漢字混じりテキストデータや記録メディアに保存されたテキスト文書や電子メールを通じて送られてきたテキスト文書を読み込むことで入力されたテキストデータに対してテキスト解析用辞書を用いて読みと文節のアクセントが与えられ、さらに合成音声エレメント辞書にアクセスしながらの韻律パラメータの編集工程を通じて音声のパワーと基本周波数を調整することで、ある程度の口調を設定することができる。従って、顧客の要望に応じて、女性口調や男性口調、あるいは怒り口調や喜び口調を選択して、最終的な合成音声データを作成することができる。この点に関する、より好ましい形態として、前記合成音声エレメント辞書に格納される合成音声エレメントを個人別で登録された肉声データに基づいて作製するならば、合成音声を顧客の肉声に類似した親しみのある音声とすることも可能となる。
【０００９】
上述したようなテキスト音声合成部は高度の技術を必要とし、装置的にも操作的にも大きな負担となるものであることから、これに代わる簡易的な音声合成技術として、本発明の別な実施形態では、入力テキストデータを予め登録された語彙やフレーズの肉声データを格納している登録音声エレメント辞書を用いて断片的に順次合成音声データに変換する音声編集合成部を備えているものがある。これは、語彙・フレーズの限定された肉声の断片から文音声を生成する編集合成と呼ばれる方式であり、合成音声データの生成は、テキストデータの断片を登録音声エレメント辞書を用いて音声データの断片で置き換えることで行われるので、高速処理可能でかつ装置コストも抑えることができる。
【００１０】
このような編集合成によって得られる肉声に比べて低品質の合成音声データをより親しみをもって聞くことができるように、本発明では、その登録音声エレメント辞書に、個人別で登録された肉声データを格納することが提案される。つまり、音声付き画像シートの顧客に対して予め、編集合成のために最低限必要とされる音声エレメントを顧客自身の肉声で登録しておく。音声付き画像シートの注文時には、音声メッセージのソースとしてのテキストデータと適当な画像データを提出すると、本人の登録音声エレメントを用いた編集合成で合成音声データが作成されるので、流暢に流れる音声でなくとも、本人の肉声断片が使われているだけに、親しみのある音声として再生されることになる。
【００１１】
さらに、本発明の好適実施形態として、音声合成部が合成音声データの声質を変形させる声質変形部を備えているならば、世の中に存在しないような音声データを作り出すことが可能であり、特に遊び感覚での音声付き画像シートの交換などの目的にかなったものとなる。このような音質変形は、例えば、音声データの周波数を線形変換することにより簡単に実施することができる。その際、音質変形のパラメータを顧客別に管理しておくと、顧客は独自の音声特徴をもった音声データ変形パラメータを自分専用として確保することができる。
【００１２】
キーボードを用いて直接メッセージ内容を打ち込んだりすることを嫌う顧客に対する方策として、本発明の好適な実施形態の１つでは、文字認識装置が追加的に備えられており、この文字認識装置によって出力されたテキストデータが音声コードイメージ変換に用いられる。ここで、文字認識装置は用紙に手書きされた文字をＯＣＲで読み取ってテキストデータ化したり、タッチパネル上で所定のペンで書かれた文字を読み取ってテキストデータ化する装置の総称であり、この構成により、音声付き画像シートを作成依頼する際の音声データの入力方法はさらに簡単になるとともに、その入力形態も多様化することになる。
【００１３】
以上の説明から明らかなように、本発明の重要な特徴は、テキストデータを音声化することにあるが、本発明で取り扱っているテキストデータは、印刷されたテキストとしての文字や数字・記号の集まり、印刷物等に対するスキャナによる読取データ、電子化されたテキストとしての文字や数字・記号の集まり、入力デバイスを通じて逐次入力されるキャラクターコード群などに代表されるように、広義の意味合いで解釈されるべきであり、コンピュータのメディア変換技術において何らかの形で文字情報として認識される全てのデータが含まれるものである。
本発明によるその他の特徴及び利点は、以下図面を用いた実施例の説明により明らかになるだろう。
【００１４】
【発明の実施の形態】
音声付き画像シートを作成するための、本発明による音声・画像処理装置の１つの実施形態が、図１の外観図及び図２の機能ブロック図によって示されている。この音声・画像処理装置の中核となるのが、汎用コンピュータ１であり、図２で示された音声付き画像シートの作成に要求される種々の機能をハードウエアとソフトウエアによって構築している。この音声・画像処理装置はＤＰショップなどの店頭に配置され、顧客の依頼による音声付き写真を作成するサービスを行うタイプのものである。
【００１５】
このコンピュータ１には、Ｉ／Ｏインタフェース部１０を介して種々の入力機器と出力機器が接続している。出力機器としては、最終的に音声付き画像シートとしての音声付き写真２を出力するプリント部として銀塩写真プリンタ３（銀塩写真フィルムのプリントなどに使用されているものが兼用される）、作業中の画像の確認等のためのモニタ４や入力された音声データのチェックのためのスピーカ５が挙げられる。入力機器としては、直接的に音声をコンピュータ１に入力するためのマイク６ａやカセットプレーヤ６ｂ、さらにデジタルカメラによる撮影画像の取り込みのためのカードリーダ７ａや銀塩フィルムからの撮影画像の取り込みのためのフィルムスキャナ７ｂが挙げられる。さらに、テキストデータをコンピュータに入力するための機器として、キーボード８ａ、手書き又は印刷された文字を読み取るフラットベットスキャナ８ｃ、インターネットを通じて送られてくるテキストデータを受信するための通信機器８ｄが挙げられる。
【００１６】
また、音声データや画像データの入出力のためによく用いられているフロッピドライブ８ｅやＭＯドライブ８ｆもコンピュータ１に内蔵されている。キーボード８ａは、マウス８ｂとともに図３で示された各機能に対しコマンドを与えるためにも用いられるし、通信機器８ｄは当然テキストデータだけでなく、画像データも受信することができる。
【００１７】
入力された画像データと音声データを用いて音声付き写真２を作成するしくみは後で詳しく説明するとして、銀塩写真プリンタ３から出力された音声付き写真２では、図３に示すように写真画像領域２ａの周辺に配置された音声コードイメージ領域２ｂに対して専用の読取スキャナ９０で走査すると、この読取スキャナ９０に内蔵されている音声再生回路の働きで音声コードイメージに対応する音声信号が出力され、例えばイヤフォン９１等で音を聞くことができる。
【００１８】
この音声・画像処理装置は、主な機能ユニットととして、図２から理解できるように、音声付き写真２における写真画像のソースとしての画像データを受け取る画像入力処理部２１、音声付き写真２における音声コードイメージのソースとしての音声データを外部から直接受け取る音声入力処理部２２、音声コードイメージに変換される音声データのソースとなるべきテキストデータを受け取るテキスト入力処理部２３、入力されたテキストデータに基づいて合成音声データを生成する音声合成部３０、音声データを光学的に読み取り可能なようにコード化された音声コードイメージに変換するコード変換部４０、画像データ格納部５１、音声コードイメージ格納部５２、そして適正に処理された画像データと音声コードイメージとから音声付き写真２のためのプリントデータを生成する画像音声合成処理部６０を備えている。
【００１９】
画像入力処理部２１は、画像編集部２１ａや画像選択部２１ｂを備えており、カードリーダ７ａ、フィルムスキャナ７ｂ、通信機器８ｄ、フロッピドライブ８ｅ、ＭＯドライブ８ｆなどから入力された画像データは必要に応じて画像選択部２１ｂによって選択され、選択された画像データに対して画像編集部２１ａが色調補正や解像度変換などの編集処理を行う。
【００２０】
音声入力処理部２２は、音声付き写真２に形成される音声コードイメージのソースとしての音声データが直接、顧客から与えられる場合に利用されるものであり、マイク６ａやカセットプレーヤ６ｂ、カードリーダ７ａ（デジタルボイスレコーダ用メモリカードの使用時）などから入力された音声データは必要に応じて、音声選択部２２ｂによって選択され、音声編集部２２ａによって編集処理が行われる。
【００２１】
テキスト入力処理部２３は、音声付き写真２に形成される音声コードイメージのソースとして顧客がテキストデータを与える場合に利用されるもので、顧客が持参したフロッピディスクに保存されたテキストファイルや電子メールの形で送付されたテキストデータをフロッピドライブ８ｅや通信機器８ｄを通じて取り込んだ後、テキスト編集部２３ａが必要なテキスト編集を施す。また、キーボード８ａを通じて、顧客又はオペレータが直接入力したテキストデータもこのテキスト編集部２３ａによって処理される。さらに、オプションとして、ＯＣＲ機能を持たせるために文字認識部２４を備えることも可能である。ＯＣＲ機能を持たせた場合、顧客が提示したメッセージ文書をフラットベットスキャナ８ｃで読み取らせた後、文字認識部２４によってテキストデータに変換する。つまり、フラットベットスキャナ８ｃと文字認識部２４が文字認識装置を構築している。
【００２２】
テキスト入力処理部２２によって必要な編集処理を施されたテキストデータを合成音声データに変換する音声合成部３０はテキスト音声合成部３１とテキスト解析用辞書３２と合成音声エレメント辞書３３を備えており、テキスト音声合成部３１はテキスト解析用辞書３２を用いて入力テキストデータを解析することでその読みを同定するとともにさらにアクセントと韻律を設定して得られた音韻系列から合成音声エレメント辞書３３を用いて合成音声データを生成する。なお、合成音声エレメント辞書３３のソースとしての音声としては女性の音声又は男性の音声のいずれでもよいが、両方備えて選択するようにすることも可能である。さらには、特定の人物の音声をソースとした数多くの合成音声エレメント辞書３３を用意して、任意に切り換えて利用する構成も可能である。
【００２３】
さらに、音声合成部３０には、上述のように作成された合成音声データの声質を変形させる声質変形部３４も付随しており、この声質変形部３４は入力した音声データに対して、アップ・ダウンサンプリングによる周波数の線形変換や時間軸調整によって、テープレコーダの早回しや遅回しと類似した変形を施して出力するものである。この音声変形部３４は、音声入力処理部２２から送られてくる音声データに対しても音声変形処理を施すことができる。
【００２４】
音声入力処理部２２から送られてきた肉声の音声データや音声合成部３０から送られてきた合成音声データを音声コードイメージに変換する音声コード変換部４０は、波形符号化、分析合成符号化など公知の符号化手法から適当に選ばれたもので構築された音声データ圧縮符号化部４１と、これにより符号化された音声コードデータを二次元のコードイメージに展開する音声コードイメージ生成部４２と、後ほど行われる画像データに基づく画像イメージと音声コードイメージとの音声付き写真におけるレイアウト編集の際に便利なように音声付き写真２に形成される音声コードイメージのサイズ（外形寸法）を算出するプリコードイメージ生成部４３とを備えている。
【００２５】
画像入力処理部２１で編集された画像データは画像イメージとして画像データ格納部５１に、コード変換部２１で変換された音声コードイメージは音声コードイメージ格納部５２に一時的に格納され、画像音声合成処理部６０によって所望のレイアウトでもってプリンタ３によってプリント出力されるようにプリントデータ化される。このため、画像音声合成処理部６０は、画像データ格納部５１に格納された画像イメージと音声コードイメージ格納部５２に格納された音声コードイメージのレイアウト処理を行う画像・音声コードイメージレイアウト編集部６１と、決定されたレイアウトで両イメージを合成してプリントデータを生成する画像・音声コードイメージ合成処理部６２を備えている。このレイアウト編集時には、プリコードイメージ生成部４３で算出された音声コードイメージのサイズに基づくダミーボックスエリアがモニタ４上に表示され、同じく表示されている画像イメージとの位置関係を見比べながらの正確なレイアウト作業を可能にしている。
【００２６】
上述した音声・画像処理装置による音声付き写真２の典型的な作成手順を図４のフローチャートを用いて説明する。ここでは音声付き写真２の注文が電子メールによってなされているとする。
電子メールが到着すると（＃１）、この電子メールの添付ファイルとしての画像データが画像入力処理部２１に入力される（＃１１）と、その画像データは画像編集部２１ａの働きで、モニタ４でその画像イメージを確認しながらオペレータの操作を通じて色調・階調変換、拡大縮小等の編集処理が行われる（＃１２）。入力された画像が複数存在する場合は画像選択部２１ｂによって選択された後この編集処理が行われる。編集処理された画像データは、一旦画像データ格納部５１に格納される（＃１３）。
【００２７】
一方、音声コードイメージのソースとしてのテキストデータを含む電子メールファイルは、テキスト入力処理部２３のテキスト編集部２３ａに送られ（＃１４）、そこで、その電子メールから音声付き写真２に音声コードイメージとして取り込まれるべきメッセージだけを含むテキストデータが切り出される（＃１５）。
【００２８】
漢字仮名混じりテキストとして音声合成部３０に送られてきたテキストデータは、テキスト音声合成部３１によってテキスト解析用辞書３２にアクセスしながら解析され（＃２１）、単語を同定しながら読み、アクセントが付与される（＃２２）。次いで、息継ぎ位置が設定されるとともに文全体のイントネーションが決定され、音素記号と韻律パラメータからなる音韻系列が作り出される（＃２３）。作り出された音韻系列に対して合成音声エレメント辞書３３にアクセスしながら順次合成音声エレメントを接続し、合成音声データを生成する（＃２４）。
【００２９】
この合成音声データに声質変形処理が要求されている場合（＃２５YES 分岐）、声質変形部３４によって周波数線形変換等が施され（＃２６）、要求されていない場合（＃２５NO分岐）、合成音声データはそのままコード変換部４０に送られる。
【００３０】
まず、合成音声データは音声データ圧縮符号化部４１に送られ、圧縮処理が行われ、続いて、音声コードイメージ生成部４２にて、光学的に読取り可能な音声コードイメージに変換される（＃３１）。さらにこの音声コードイメージのサイズ（外形寸法）がプリコードイメージ生成部４３によって算出され（＃３２）、音声コードイメージのデータとともにサイズデータもは音声コードイメージ格納部５２に一旦格納される（＃３３）。
【００３１】
画像データ格納部５１に記憶された画像データと、音声コードイメージ格納部５２に記憶された音声コードイメージは、画像音声合成処理部６０の画像・音声コードイメージレイアウト編集部６１にそれぞれ取り込まれて画像イメージと音声コードイメージのレイアウト編集処理がなされる（＃４０）。実際のレイアウト編集処理ではモニタ４の画面にレイアウト編集画面が表示され、カーソルの指示により画像イメージと音声コードイメージを擬似的に示すダミーボックスエリアのレイアウト編集が行われる。このレイアウト編集は予め選択されたテンプレートを用いて画像イメージと音声コードイメージを自動的に流し込む方法を採用することも可能である。その際、例えば、音声コードイメージの長さが印刷可能長さを越えると、これを２つに分離して２段構成にするなどの再編集が行われる。
【００３２】
画像・音声コードイメージ合成処理部６２は、画像・音声コードイメージレイアウト編集部６１からのレイアウト情報を受け取ると、画像データ格納部５１及び音声コードイメージ格納部５２にそれぞれリクエスト信号を送信し、対応画像データ及び音声コードイメージデータを受け取る。受け取った画像イメージのデータと音声コードイメージのデータはレイアウト情報に基づいて一体化され、プリントデータとして生成される（＃４１）。このプリントデータがプリンタ３に送信されることにより、画像イメージと音声コードイメージが印画紙に露光され、露光印画紙が現像処理されることにより図３で示されるような音声付き写真２が作成される（＃５０）。
【００３３】
〔別実施形態〕
図５で示された本発明の別実施形態の機能ブロック図では、図２で示された先の実施形態のものと比べて、音声合成部３０がテキスト音声合成部３１の代わりに音声編集合成部３５によって構成されている点で異なっている。
【００３４】
語彙・フレーズの限定された肉声の断片から文音声を生成する編集合成と呼ばれるこの方式で合成音声データを生成するためには、予め登録された語彙やフレーズの肉声データを格納している登録音声エレメント辞書３６が必要であり、音声編集合成部３５は、テキスト入力処理部２３から送られてきたテキストデータを断片化し、その断片を登録音声エレメント辞書を用いて音声データの断片で置き換えていく。
【００３５】
この実施形態では、その登録音声エレメント辞書３６に、個人別で登録された肉声データを格納することも可能である。つまり、音声付き画像シートの顧客に対して予め、編集合成のために最低限必要とされる音声エレメントを顧客自身の肉声で登録・格納しておき、音声付き画像シートの注文時には、本人の登録音声エレメントを用いた編集合成で合成音声データが作成される。登録されていない顧客に対しては、標準で用意されている音声エレメントが使用される。
【００３６】
また、この実施形態の音声・画像処理装置は、図６に示すような、証明写真装置やプリクラ装置のようなボックス形の外観を備えており、音声付き写真２を作成しようとする顧客は、料金を投入した後、モニタ４に表示される指示メッセージに従って、備え付けられているデジタルカメラで自分を撮影するとともに、音声メッセージ化したいテキストデータを備え付けられているタッチパネル式キーボード８ａを使って入力するか、又はマイク６ａを通じて肉声で入力する。また、プリント部３として昇華型の熱転写プリンタが採用されている。
【００３７】
この別実施形態の音声・画像処理装置による音声付き写真２の典型的な作成手順を図７のフローチャートを用いて説明する。ここでは音声付き写真２のための画像ソースはデジタルカメラの撮像画像データであり、その音声ソースは備え付けのキーボード８ａから直接入力されたテキストデータとする。
【００３８】
音声付き写真２の作成を希望する顧客は、指定された硬貨を硬貨投入口に入れることにより（＃１０１）モニタ４に表示されるメニュに従って、まず装置に備えられたデジタルカメラで証明写真装置やプリクラ装置と同様な手順で自分を撮影する（＃１１０）。このデジタルカメラはＩ／Ｏインタフェース１０と直接接続されているので、デジタルカメラによって取得された画像データは直ちに画像入力処理部２１に転送される（＃１１１）。画像入力処理部２１に転送された画像データは画像編集部２１ａの働きで、モニタ４でその画像イメージを確認しながらトリミングや拡大縮小等の編集処理を行うことができる（＃１１２）。編集処理された画像データは、一旦画像データ格納部５１に格納される（＃１１３）。
【００３９】
続いて、今回、音声コードイメージ化するためのソースデータとしてキーボード入力によるテキストデータを選択しているので、音声付き写真に組み込みたい音声メッセージを文としてキーボード８ａから入力する（＃１１４）。テキスト編集部２３ａはテキストエディタとしての機能を有するので、キーボード８ａを通じて入力されたデータから文章を作成し、最終的にこのテキストデータを編集合成に適したフォーマットに変換して音声合成部３０に送り出す（＃１１５）。
【００４０】
編集合成プロセスでは、まず、この顧客が予め音声登録しているかどうかをチェックする（＃１２１）。音声登録している場合、その顧客の登録音声エレメントファイルがロードされる（＃１２２）。この登録音声エレメントファイルのロードに関して種々の形態があるが、ここでは代表的な２つの形態を紹介する。
【００４１】
第１のものは、顧客が、予め音声エレメント登録装置によって、必要な語彙・フレーズを肉声で登録し、その登録された語彙・フレーズを編集合成に適したフォマットでファイル化することによって得られた音声エレメントファイルをメモリカードに記録しておく形態である。音声登録しているかどうかのチェック段階でカードリーダ７ａに該当メモリカードを挿入することにより、登録音声エレメントファイルが音声合成部３０の登録音声エレメント辞書３６にロードされる。第２のものは、予め音声エレメント登録装置によって作成された音声エレメントファイルを顧客ＩＤをキーとして登録音声エレメント辞書３６に格納しておく形態であり、音声登録しているかどうかのチェック段階で顧客ＩＤを入力することにより、この顧客の登録音声エレメントファイルが以後の編集合成作業における登録音声エレメント辞書３６として使用されるように設定される。登録音声エレメント辞書３６は、この音声・画像処理装置に内蔵されるのではなく、通信回線でつながったサーバ内に設けられることが望ましい。つまり、顧客ＩＤを入力すると、通信回線を通じて該当顧客の登録音声エレメントファイルが音声合成部３０の登録音声エレメント辞書３６にロードされる構成とするのである。
【００４２】
音声登録していない場合、登録音声エレメント辞書３６に格納されている標準音声エレメントファイルが以後の編集合成作業における登録音声エレメント辞書３６として使用されるように設定される。（＃１２３）。
【００４３】
いずれにしても、編集合成プロセスでは、まず処理すべきテキストデータで表されいるメッセージ文を語彙・フレーズに分解し（＃１２４）、それぞれに、登録音声エレメント辞書３６としての音声エレメントファイルから抽出された断片的な音声エレメントを割り当て、合成音声データを生成する（＃１２５）。
【００４４】
この合成音声データに声質変形処理が要求されている場合（＃２５YES 分岐）、声質変形部３４によって周波数線形変換等が施され（＃２６）、要求されていない場合（＃２５NO分岐）、合成音声データはそのままコード変換部４０に送られ、以下＃３１〜＃３３で前述したように合成音声データの音声コードイメージ化が行われ、生成された音声コードイメージは音声コードイメージ格納部５２に一旦格納される。
【００４５】
画像データ格納部５１に記憶された画像データと、音声コードイメージ格納部５２に記憶された音声コードイメージは、予め選択されたテンプレートを用いて画像・音声コードイメージレイアウト編集部６１によってレイアウト編集処理がなされる（＃４０）。
【００４６】
画像・音声コードイメージ合成処理部６２は、画像イメージのデータと音声コードイメージのデータをレイアウト情報に基づいて一体化し、プリントデータを生成する（＃４１）。このプリントデータがプリンタ３に送信されることにより、画像イメージと音声コードイメージが専用シートにプリントされ、図３で示されるような音声付き写真２として、装置前面に設けられたプリント取り出し口に排出される（＃５０）。
【００４７】
上述した実施の形態では、画像データと音声コードイメージは画像・音声合成処理部６０によって合成されていたが、画像・音声合成処理部６０を省略して、このプリンタ３によってプリント出力されていたが、画像データと音声コードイメージを別々のプリンタでプリント出力してもよい。その際、音声コードイメージのプリント出力にシールプリンタで、音声コードイメージを形成したシールを画像を形成したシート、例えば写真プリントに貼り付けるように構成するとよい。
【００４８】
さらに上述した全ての実施の形態では、入力されたテキストデータは、いったん音声合成部３０で合成音声データ化され、この合成音声データが音声コードイメージに変換されていたが、テキスト入力処理部２３で処理されたテキストデータを直接音声コードイメージに変換することも可能である。そのような音声・画像処理装置は、図８で示すように、音声合成部３０が省略された代わりに、コード変換部４０に、テキストデータを所定の要素に断片化して得られたテキストエレメントに順次対応する音声コードイメージを割り当てていくテキスト／音声コードイメージ置換部４４と、テキストエレメントに対応する音声コードイメージを登録した音声コードイメージ辞書４５を備えている。つまり、テキストを構成する語彙やフレーズに対応する音声コードイメージを当てはめながら順次つなぎ合わせていくことにより最終的な音声コードイメージを作り出すのである。
【図面の簡単な説明】
【図１】本発明による音声・画像処理装置の１つの実施形態を示す外観図
【図２】図１による音声・画像処理装置の機能ブロック図
【図３】音声・画像処理装置によって作成された音声付き写真から音声を再生する様子を示す説明図
【図４】図２に示された音声・画像処理装置を用いた音声付き写真の作成手順を示すフローチャート
【図５】本発明による音声・画像処理装置の別実施形態を示す機能ブロック図
【図６】図５による音声・画像処理装置の外観図
【図７】図５に示された音声・画像処理装置を用いた音声付き写真の作成手順を示すフローチャート
【図８】本発明による音声・画像処理装置のさらに別な実施形態を示す機能ブロック図
【符号の説明】
２音声付き画像シート（音声付き写真）
３プリント部（銀塩写真プリンタ、昇華型熱転写プリンタ）
２１画像入力部
２２音声入力部
２３テキスト入力処理部
２４文字認識部
３０音声合成部
３１テキスト音声合成部
３２テキスト解析用辞書
３３合成音声エレメント辞書
３４声質変形部
３５音声編集合成部
３６登録音声エレメント辞書
６０画像音声合成処理部[0001]
BACKGROUND OF THE INVENTION
The present invention provides a code conversion unit that converts sound data into a sound code image that is encoded so as to be optically readable, and an image image based on the sound code image and the image data to create an image sheet with sound. The present invention relates to an audio / image processing apparatus provided with a printing unit for printing.
[0002]
[Prior art]
In recent years, with the advent of the multimedia era, it has been actively attempted to use the auditory sense instead of relying solely on the visual transmission of information. For example, in Japanese Patent Application Laid-Open No. 6-231466 and Japanese Patent Application Laid-Open No. 7-181606, a dot code obtained by converting a voice in addition to a figure, a photograph, and a character to be optically readable An image sheet with sound is disclosed, in which (sound code image) is printed on the same sheet and sound is heard by a dedicated scanner that reads the dot code. Such an image sheet with sound is used for special occasions such as language teaching materials for repetitive study of pronunciation, photo pictorial books that record animal calls, picture books that make sounds, or commemorative events such as weddings, adult ceremonies, and Shichigosan. It is applied as a photograph to record with accompanying sound.
[0003]
[Problems to be solved by the invention]
Recently, a new photo card with a voice-coded message attached to an appropriate photo has been attracting attention. If you request a DP shop to create such a message card, print it out. In addition to submitting photographic film and digital camera recording media that contain the images you want, you also need to send voice messages through the microphone at the store. This is a very embarrassing act for the general public regardless of the content of the message. For this reason, many people are hesitant to create a message card. To avoid blowing voice messages at the store, you may bring a cassette tape or MD that has been voiced at home in advance, but there are many cases to play for confirmation, and for simple messages each time. Recording at home is annoying.
In view of the above situation, an object of the present invention is to improve an input method of audio data when requesting creation of an image sheet with audio.
[0004]
[Means for Solving the Problems]
In order to solve the above-described problems, a code conversion unit that converts audio data into an audio code image that is encoded so as to be optically readable, and the audio code image and the image data to generate an image sheet with audio. In a voice / image processing apparatus provided with a printing unit for printing an image based on the text input processing unit for processing input text data according to the present invention A speech synthesizer that generates synthesized speech data based on the text data processed by the text input processor; Is provided, The code converter uses the synthesized voice data generated by the voice synthesizer as source voice data for the voice code image, and Voice code image for playing text data voice The Generation You It is characterized by that.
[0005]
In this configuration, text data is input as the source data of the audio code image necessary for creating the image sheet with audio. Synthesized speech data using speech synthesis technology The customer does not need to blow the message content with the voice. As the text data input, for example, the message content may be directly typed using a keyboard connected to the sound / image processing apparatus, or a message previously created using a word processor or the like is recorded on a recording medium such as a floppy. And you can bring it into the store. Furthermore, it is also possible to send the message content to the store via e-mail. At that time, if the image data for the sound-added image sheet to be created is sent as an attached file, the customer can order the image sheet with sound. No longer need to go to the store.
[0008]
As an example of a speech synthesizer, in a preferred embodiment of the present invention, a phoneme sequence obtained by analyzing input text data using a text analysis dictionary and identifying its reading and further setting accents and prosody A text-to-speech synthesizer for generating synthesized speech data using a synthesized speech element dictionary. In this configuration, a text analysis dictionary is created for text data input by reading text data mixed with kana and kanji input from the keyboard, text documents saved on recording media, and text documents sent via e-mail. Use it to give accents to readings and phrases, and to adjust the sound power and fundamental frequency through the prosody parameter editing process while accessing the synthesized speech element dictionary, it is possible to set a certain tone. Accordingly, final synthesized voice data can be created by selecting a female tone, a male tone, or an angry tone or a joyful tone according to a customer's request. As a more preferable form in this regard, if the synthesized speech element stored in the synthesized speech element dictionary is created based on the personal voice data registered individually, the synthesized speech is familiar to the customer's real voice. It is also possible to use audio.
[0009]
Since the text-to-speech synthesizer as described above requires a high level of technology and is a heavy burden both in terms of apparatus and operation, as a simple speech synthesizer to replace this, another text-to-speech synthesizer is provided. In some embodiments, the input text data is provided with a voice editing / synthesizing unit that sequentially converts the input text data into synthesized voice data using a registered voice element dictionary that stores pre-registered vocabulary and phrase real voice data. is there. This is a method called edit synthesis that generates sentence speech from limited vocabulary / phrase fragments. Synthetic speech data is generated by registering text data fragments using a registered speech element dictionary. Therefore, high-speed processing can be performed and the apparatus cost can be reduced.
[0010]
In the present invention, the registered voice element dictionary stores the voice data registered for each individual so that the synthesized voice data of lower quality can be heard more intimately than the real voice obtained by such editing synthesis. Proposed to do. That is, the voice elements necessary for editing and synthesis are registered in advance with the customer's own voice for the customer of the image sheet with voice. When ordering an image sheet with audio, if you submit text data and appropriate image data as the source of the audio message, synthesized audio data is created by editing synthesis using the registered audio element of the person, so you can use fluent audio Even if the person's real voice fragment is used, it will be played as familiar voice.
[0011]
Furthermore, as a preferred embodiment of the present invention, if the speech synthesizer includes a voice quality transformation unit that transforms the voice quality of the synthesized speech data, it is possible to create voice data that does not exist in the world, It is suitable for the purpose such as exchanging image sheets with sound by feeling. Such sound quality deformation can be easily performed by, for example, linearly converting the frequency of the audio data. At this time, if the sound quality modification parameters are managed for each customer, the customer can secure the sound data modification parameters having unique sound characteristics as their own.
[0012]
As a measure for a customer who does not like to directly type in message contents using a keyboard, in one of the preferred embodiments of the present invention, a character recognition device is additionally provided, which is output by the character recognition device. The text data is used for voice code image conversion. Here, the character recognition device is a general term for devices that read characters handwritten on paper with OCR and convert them into text data, or read characters written with a predetermined pen on the touch panel into text data. The audio data input method for requesting creation of an image sheet with audio is further simplified, and the input forms are diversified.
[0013]
As is clear from the above description, an important feature of the present invention is that the text data is voiced. However, the text data handled in the present invention includes characters, numbers and symbols as printed text. It is interpreted in a broad sense, as represented by a collection of data read by a scanner for printed materials, a collection of characters, numbers, and symbols as digitized text, and a character code group that is sequentially input through an input device. It should include all data that is recognized in some way as character information in the media conversion technology of computers.
Other features and advantages of the present invention will become apparent from the following description of embodiments with reference to the drawings.
[0014]
DETAILED DESCRIPTION OF THE INVENTION
One embodiment of a sound / image processing apparatus according to the present invention for creating a sound-added image sheet is shown by the external view of FIG. 1 and the functional block diagram of FIG. The core of this sound / image processing apparatus is a general-purpose computer 1, and various functions required for creating the sound-added image sheet shown in FIG. 2 are constructed by hardware and software. This audio / image processing apparatus is of a type that is arranged at a store such as a DP shop and provides a service for creating a photo with audio at the request of a customer.
[0015]
Various input devices and output devices are connected to the computer 1 via the I / O interface unit 10. As an output device, a silver halide photographic printer 3 (which is also used for printing a silver halide photographic film, etc.) is used as a print unit for finally outputting a photograph 2 with sound as an image sheet with sound. Examples include a monitor 4 for confirming an image inside and a speaker 5 for checking input voice data. Input devices include a microphone 6a and a cassette player 6b for directly inputting sound into the computer 1, a card reader 7a for capturing a captured image by a digital camera, and a captured image from a silver halide film. Film scanner 7b. Furthermore, examples of devices for inputting text data to the computer include a keyboard 8a, a flatbed scanner 8c for reading handwritten or printed characters, and a communication device 8d for receiving text data sent through the Internet.
[0016]
The computer 1 also includes a floppy drive 8e and an MO drive 8f that are often used for inputting and outputting audio data and image data. The keyboard 8a is used together with the mouse 8b to give commands to the functions shown in FIG. 3, and the communication device 8d can naturally receive not only text data but also image data.
[0017]
The mechanism for creating the photograph 2 with sound using the input image data and sound data will be described in detail later. In the photograph 2 with sound output from the silver halide photographic printer 3, the photograph image is shown in FIG. When the audio code image area 2b arranged around the area 2a is scanned by the dedicated reading scanner 90, an audio signal corresponding to the audio code image is output by the operation of the audio reproduction circuit built in the reading scanner 90. For example, the sound can be heard with the earphone 91 or the like.
[0018]
As can be understood from FIG. 2, the sound / image processing apparatus is an image input processing unit 21 that receives image data as a source of a photographic image in the sound-added photo 2, and the sound in the sound-added photo 2 as main functional units. A voice input processing unit 22 that directly receives voice data as a source of a code image from outside, a text input processing unit 23 that receives text data to be a source of voice data to be converted into a voice code image, and based on the input text data A speech synthesis unit 30 for generating synthesized speech data, a code conversion unit 40 for converting speech data into a speech code image coded so as to be optically readable, an image data storage unit 51, and a speech code image storage unit 52. And audio from properly processed image data and audio code images And an image voice synthesizing section 60 for generating print data for photographic 2 come.
[0019]
The image input processing unit 21 includes an image editing unit 21a and an image selection unit 21b, and image data input from the card reader 7a, the film scanner 7b, the communication device 8d, the floppy drive 8e, the MO drive 8f, or the like is necessary. Accordingly, the image selection unit 21b selects the selected image data, and the image editing unit 21a performs editing processing such as color correction and resolution conversion.
[0020]
The audio input processing unit 22 is used when audio data as a source of an audio code image formed in the photo 2 with audio is directly given from a customer. The audio input processing unit 22 uses a microphone 6a, a cassette player 6b, and a card reader 7a. Audio data input from (when using a digital voice recorder memory card) is selected by the audio selection unit 22b as necessary, and editing processing is performed by the audio editing unit 22a.
[0021]
The text input processing unit 23 is used when a customer gives text data as a source of an audio code image formed in the sound-added photo 2, and is a text file or e-mail stored on a floppy disk brought by the customer. After the text data sent in the form of is received through the floppy drive 8e or the communication device 8d, the text editing unit 23a performs necessary text editing. Further, text data directly input by the customer or operator through the keyboard 8a is also processed by the text editing unit 23a. Furthermore, as an option, a character recognition unit 24 can be provided to provide an OCR function. When the OCR function is provided, the message document presented by the customer is read by the flatbed scanner 8c, and then converted into text data by the character recognition unit 24. That is, the flat bed scanner 8c and the character recognition unit 24 constitute a character recognition device.
[0022]
A speech synthesizer 30 that converts text data that has undergone the editing processing required by the text input processor 22 into synthesized speech data includes a text speech synthesizer 31, a text analysis dictionary 32, and a synthesized speech element dictionary 33. The text-to-speech synthesizer 31 analyzes the input text data using the text analysis dictionary 32 to identify the reading, and further uses the synthesized speech element dictionary 33 from the phoneme sequence obtained by setting the accent and prosody. Generate synthesized speech data. The voice as the source of the synthesized voice element dictionary 33 may be either a female voice or a male voice, but it is also possible to select both. Furthermore, it is possible to prepare a large number of synthesized speech element dictionaries 33 that use a specific person's voice as a source and use them by arbitrarily switching them.
[0023]
Furthermore, the voice synthesis unit 30 is also accompanied by a voice quality transformation unit 34 that transforms the voice quality of the synthesized voice data created as described above. By changing the frequency linearly by down-sampling and adjusting the time axis, the tape recorder is transformed and output in a manner similar to fast or slow rotation. The voice transformation unit 34 can perform voice transformation processing on the voice data sent from the voice input processing unit 22.
[0024]
The voice code conversion unit 40 that converts the voice data of the real voice sent from the voice input processing unit 22 and the synthesized voice data sent from the voice synthesis unit 30 into a voice code image includes waveform coding, analysis synthesis coding, and the like. An audio data compression encoding unit 41 constructed by appropriately selecting from known encoding methods, and an audio code image generation unit 42 for expanding the audio code data encoded thereby into a two-dimensional code image; A pre-calculation of the size (outside dimension) of the audio code image formed on the audio-attached photo 2 is convenient for layout editing of the audio-attached photo of the image image and audio code image based on the image data to be performed later. And a code image generation unit 43.
[0025]
The image data edited by the image input processing unit 21 is temporarily stored as an image image in the image data storage unit 51, and the voice code image converted by the code conversion unit 21 is temporarily stored in the voice code image storage unit 52. The processing unit 60 converts the data into print data so as to be printed out by the printer 3 with a desired layout. For this reason, the image / speech synthesis processing unit 60 performs an image / sound code image layout editing unit 61 that performs layout processing of the image image stored in the image data storage unit 51 and the sound code image stored in the sound code image storage unit 52. And an image / audio code image synthesis processing unit 62 for synthesizing both images with the determined layout to generate print data. At the time of editing the layout, a dummy box area based on the size of the audio code image calculated by the precode image generation unit 43 is displayed on the monitor 4, and an accurate comparison is made while comparing the positional relationship with the displayed image image. Layout work is possible.
[0026]
A typical procedure for creating the photograph 2 with sound by the sound / image processing apparatus described above will be described with reference to the flowchart of FIG. Here, it is assumed that the order of the photo with sound 2 is made by e-mail.
When the e-mail arrives (# 1), image data as an attached file of the e-mail is input to the image input processing unit 21 (# 11), and the image data is operated by the image editing unit 21a, and the monitor 4 Then, editing processing such as color tone / gradation conversion and enlargement / reduction is performed through the operation of the operator while confirming the image (# 12). When there are a plurality of input images, the editing process is performed after the images are selected by the image selection unit 21b. The edited image data is temporarily stored in the image data storage unit 51 (# 13).
[0027]
On the other hand, the e-mail file including the text data as the source of the audio code image is sent to the text editing unit 23a of the text input processing unit 23 (# 14). The text data including only the message to be fetched is cut out (# 15).
[0028]
The text data sent to the speech synthesizer 30 as text mixed with kanji characters is analyzed by the text speech synthesizer 31 while accessing the text analysis dictionary 32 (# 21), read while identifying words, and accented. (# 22). Next, the breath connection position is set and the intonation of the whole sentence is determined, and a phoneme sequence including phoneme symbols and prosodic parameters is created (# 23). The synthesized speech elements are sequentially connected to the generated phoneme sequence while accessing the synthesized speech element dictionary 33 to generate synthesized speech data (# 24).
[0029]
When voice quality modification processing is required for this synthesized voice data (# 25 YES branch), frequency linear transformation or the like is performed by the voice quality transformation unit 34 (# 26), and when not required (# 25 NO branch), synthesized voice The data is sent to the code conversion unit 40 as it is.
[0030]
First, the synthesized speech data is sent to the speech data compression encoding unit 41, subjected to compression processing, and then converted into an optically readable speech code image by the speech code image generation unit 42 (#). 31). Further, the size (outer dimension) of the voice code image is calculated by the precode image generation unit 43 (# 32), and the size data together with the voice code image data is temporarily stored in the voice code image storage unit 52 (# 33). ).
[0031]
The image data stored in the image data storage unit 51 and the audio code image stored in the audio code image storage unit 52 are respectively taken into the image / audio code image layout editing unit 61 of the image audio synthesis processing unit 60 to generate an image. Image and voice code image layout editing processing is performed (# 40). In the actual layout editing process, a layout editing screen is displayed on the screen of the monitor 4, and a dummy box area that shows a pseudo image image and audio code image is edited in accordance with an instruction from the cursor. This layout editing can adopt a method of automatically flowing an image image and a voice code image using a preselected template. At this time, for example, when the length of the voice code image exceeds the printable length, re-editing is performed such that the voice code image is separated into two parts to form a two-stage configuration.
[0032]
Upon receiving the layout information from the image / sound code image layout editing unit 61, the image / sound code image synthesis processing unit 62 transmits a request signal to the image data storage unit 51 and the sound code image storage unit 52, respectively. Receive data and voice code image data. The received image image data and audio code image data are integrated based on the layout information and generated as print data (# 41). When this print data is transmitted to the printer 3, the image image and the audio code image are exposed on the photographic paper, and the exposed photographic paper is developed to produce a photo 2 with audio as shown in FIG. (# 50).
[0033]
[Another embodiment]
In the functional block diagram of another embodiment of the present invention shown in FIG. 5, the speech synthesizer 30 replaces the text-to-speech synthesizer 31 with a speech editing synthesizer, as compared with the previous embodiment shown in FIG. It is different in that it is constituted by the portion 35.
[0034]
In order to generate synthesized speech data using this method, which is called edit synthesis that generates sentence speech from limited vocabulary / phrase fragments, registered speech that stores pre-registered vocabulary and phrase speech data The element dictionary 36 is necessary, and the speech editing / synthesizing unit 35 fragments the text data sent from the text input processing unit 23 and replaces the fragment with a fragment of speech data using the registered speech element dictionary.
[0035]
In this embodiment, the registered voice element dictionary 36 can store real voice data registered for each individual. In other words, the voice elements required for editing and synthesis are registered and stored in advance with the customer's own voice for the customer of the image sheet with sound, and the person is registered when ordering the image sheet with sound. Synthetic speech data is created by editing synthesis using speech elements. For customers who are not registered, the standard voice elements are used.
[0036]
The audio / image processing apparatus of this embodiment has a box-shaped appearance such as an ID photo apparatus or a photo booth apparatus as shown in FIG. After charging the fee, in accordance with the instruction message displayed on the monitor 4, take a picture with the provided digital camera and input using the touch panel keyboard 8 a provided with text data to be converted into a voice message Or, the voice is input through the microphone 6a. Further, a sublimation type thermal transfer printer is employed as the printing unit 3.
[0037]
A typical procedure for creating the photograph with sound 2 by the sound / image processing apparatus according to another embodiment will be described with reference to the flowchart of FIG. Here, the image source for the photograph 2 with sound is captured image data of the digital camera, and the sound source is text data directly input from the provided keyboard 8a.
[0038]
A customer who wishes to create a photograph 2 with sound inserts a designated coin into the coin slot (# 101), and according to the menu displayed on the monitor 4, first the ID photograph apparatus or the The subject is photographed in the same procedure as the photo booth apparatus (# 110). Since this digital camera is directly connected to the I / O interface 10, the image data acquired by the digital camera is immediately transferred to the image input processing unit 21 (# 111). The image data transferred to the image input processing unit 21 can be subjected to editing processing such as trimming and enlargement / reduction while confirming the image image on the monitor 4 by the function of the image editing unit 21a (# 112). The edited image data is temporarily stored in the image data storage unit 51 (# 113).
[0039]
Subsequently, since text data by keyboard input is selected as source data for converting into a voice code image this time, a voice message to be incorporated into a photograph with sound is input from the keyboard 8a as a sentence (# 114). Since the text editing unit 23a has a function as a text editor, a text is created from data input through the keyboard 8a, and finally the text data is converted into a format suitable for editing and synthesis and sent to the speech synthesis unit 30. (# 115).
[0040]
In the editing / synthesizing process, first, it is checked whether or not the customer has registered in advance (# 121). If the voice is registered, the registered voice element file of the customer is loaded (# 122). There are various modes for loading the registered voice element file. Here, two typical modes are introduced.
[0041]
The first one was obtained by the customer registering the necessary vocabulary / phrase with the voice element registration device in advance with the voice, and file the registered vocabulary / phrase in a format suitable for editing and synthesis. In this mode, an audio element file is recorded on a memory card. By inserting the corresponding memory card into the card reader 7a at the stage of checking whether or not the voice is registered, the registered voice element file is loaded into the registered voice element dictionary 36 of the voice synthesizer 30. The second is a form in which a voice element file created in advance by a voice element registration device is stored in the registered voice element dictionary 36 using a customer ID as a key. Is set so that the registered voice element file of the customer is used as the registered voice element dictionary 36 in the subsequent editing and synthesis work. The registered voice element dictionary 36 is preferably not provided in the voice / image processing apparatus but provided in a server connected by a communication line. That is, when the customer ID is input, the registered voice element file of the corresponding customer is loaded into the registered voice element dictionary 36 of the voice synthesizer 30 through the communication line.
[0042]
When the voice is not registered, the standard voice element file stored in the registered voice element dictionary 36 is set to be used as the registered voice element dictionary 36 in the subsequent editing / synthesis work. (# 123).
[0043]
In any case, in the editing / synthesizing process, the message sentence represented by the text data to be processed is first decomposed into vocabulary / phrases (# 124), and each is extracted from the speech element file as the registered speech element dictionary 36. The fragmented speech elements are assigned to generate synthesized speech data (# 125).
[0044]
When voice quality modification processing is required for this synthesized voice data (# 25 YES branch), frequency linear transformation or the like is performed by the voice quality transformation unit 34 (# 26), and when not required (# 25 NO branch), synthesized voice The data is sent to the code conversion unit 40 as it is, and the voice code image of the synthesized voice data is converted as described above in steps # 31 to # 33, and the generated voice code image is temporarily stored in the voice code image storage unit 52. Is done.
[0045]
The image data stored in the image data storage unit 51 and the audio code image stored in the audio code image storage unit 52 are subjected to layout editing processing by the image / audio code image layout editing unit 61 using a template selected in advance. (# 40).
[0046]
The image / audio code image synthesis processing unit 62 integrates the image image data and the audio code image data based on the layout information, and generates print data (# 41). When this print data is transmitted to the printer 3, the image image and the voice code image are printed on a dedicated sheet, and discharged as a photograph 2 with sound as shown in FIG. (# 50).
[0047]
In the above-described embodiment, the image data and the voice code image are synthesized by the image / speech synthesis processing unit 60. However, the image / speech synthesis processing unit 60 is omitted and printed out by the printer 3. The image data and the voice code image may be printed out by different printers. At this time, it is preferable that the sticker on which the voice code image is printed is pasted on the sheet on which the voice code image is formed, for example, a photographic print, with a sticker printer.
[0048]
Furthermore, in all the embodiments described above, the input text data is once converted into synthesized speech data by the speech synthesizer 30 and this synthesized speech data is converted into a speech code image. It is also possible to convert the processed text data directly into a voice code image. As shown in FIG. 8, such a voice / image processing apparatus converts the text data obtained by fragmenting text data into predetermined elements in the code conversion unit 40 instead of omitting the voice synthesis unit 30. A text / voice code image replacement unit 44 for sequentially assigning corresponding voice code images and a voice code image dictionary 45 in which voice code images corresponding to text elements are registered are provided. In other words, a final speech code image is created by sequentially joining together while applying speech code images corresponding to vocabulary and phrases constituting the text.
[Brief description of the drawings]
FIG. 1 is an external view showing one embodiment of an audio / image processing apparatus according to the present invention.
2 is a functional block diagram of the audio / image processing apparatus according to FIG. 1;
FIG. 3 is an explanatory diagram showing a state in which sound is reproduced from a sound-added photo created by the sound / image processing apparatus.
FIG. 4 is a flowchart showing a procedure for creating a photograph with sound using the sound / image processing apparatus shown in FIG. 2;
FIG. 5 is a functional block diagram showing another embodiment of the sound / image processing apparatus according to the present invention.
6 is an external view of the sound / image processing apparatus according to FIG. 5;
FIG. 7 is a flowchart showing a procedure for creating a photograph with sound using the sound / image processing apparatus shown in FIG. 5;
FIG. 8 is a functional block diagram showing still another embodiment of the sound / image processing apparatus according to the present invention.
[Explanation of symbols]
2 Image sheet with sound (photo with sound)
3 Print section (silver salt photo printer, sublimation thermal transfer printer)
21 Image input section
22 Voice input part
23 Text input processing section
24 character recognition part
30 Speech synthesis unit
31 Text-to-speech synthesizer
32 Text analysis dictionary
33 Synthetic Speech Element Dictionary
34 Voice quality transformation
35 Voice Editing / Synthesizer
36 Registered Voice Element Dictionary
60 Image / sound synthesis processor

Claims

Code conversion unit for converting audio data into an audio code image encoded so as to be optically readable, and a print for printing an image image based on the audio code image and the image data in order to create an image sheet with audio In the sound / image processing apparatus provided with
A text input processing unit for processing input text data ;
A speech synthesizer that generates synthesized speech data based on the text data processed by the text input processor ,
Using the synthetic speech data to which the code converter is generated by the speech synthesizer as the source audio data for the audio code image, characterized that you generate the audio code image for reproducing audio of the text data An audio / image processing apparatus.

The speech synthesizer analyzes input text data using a text analysis dictionary to identify the reading, and further uses a synthesized speech element dictionary from a phoneme sequence obtained by setting accents and prosody. The speech / image processing apparatus according to claim 1 , further comprising a text-to-speech synthesis unit that generates data.

The voice / image processing apparatus according to claim 2, wherein the synthesized voice element stored in the synthesized voice element dictionary is created based on real voice data registered for each individual.

The speech synthesizer includes a speech editing / synthesizing unit that sequentially converts input text data into synthesized speech data using a registered speech element dictionary storing pre-registered vocabulary and phrase real voice data. The audio / image processing apparatus according to claim 1 .

5. The sound / image processing apparatus according to claim 4 , wherein the registered sound element dictionary stores real voice data registered for each individual.

The voice / image processing apparatus according to claim 1 , wherein the voice synthesizer includes a voice quality transformation unit that transforms voice quality of the synthesized voice data.

Character recognition apparatus is provided with additionally, speech and image processing according to any one of claims 1 to 6, the text data outputted by the character recognition apparatus, characterized by being used in the voice code image converting apparatus.