JP7072390B2

JP7072390B2 - Sign language translator and program

Info

Publication number: JP7072390B2
Application number: JP2018007445A
Authority: JP
Inventors: 翼内田; 太郎宮▲崎▼
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2018-01-19
Filing date: 2018-01-19
Publication date: 2022-05-20
Anticipated expiration: 2038-01-19
Also published as: JP2019124901A

Description

本発明は、手話翻訳装置及びプログラムに関する。 The present invention relates to a sign language translator and a program.

手話のＣＧ（Computer Graphics）アニメーション生成においては、日本語などの音声言語から手話言語へリアルタイム翻訳する手法が用いられている。この手法として、日本語と手話の対訳データであるコーパスを利用した機械翻訳などを用いるものがある（例えば、特許文献１、２参照）。 In sign language CG (Computer Graphics) animation generation, a method of real-time translation from a voice language such as Japanese to a sign language language is used. As this method, there is a method using machine translation using a corpus, which is bilingual data of Japanese and sign language (see, for example, Patent Documents 1 and 2).

現在、手話のＣＧアニメーションを生成するためには、まず上述した翻訳手法を用いて翻訳結果となる手話単語列を出力する。次に、手話単語列の各単語に対応する手話のモーションデータを読み込み、文章単位でそれらモーションデータを合成したものをＣＧモデルで再生する手法が一般的である。 Currently, in order to generate a sign language CG animation, first, a sign language word string that is a translation result is output using the above-mentioned translation method. Next, a method of reading the sign language motion data corresponding to each word in the sign language word string, synthesizing the motion data for each sentence, and reproducing it with a CG model is common.

特開２０１３－１８６６７３号公報Japanese Unexamined Patent Publication No. 2013-186673 特開２０１４－２１１８０号公報Japanese Unexamined Patent Publication No. 2014-21180

音声言語と手話言語では同じ語彙を表す単語であっても発話長が異なることから、アニメーションとして生成された手話と元の音声の長さを完全に一致させることは困難である。さらに、視覚言語であり複雑な身体動作を伴う手話言語の方が、聴覚言語である音声言語に比べ発話長が長くなる場合が多々ある。そのため、テキスト化された音声を手話へリアルタイムに翻訳した場合に、元となる音声から手話を提示するまでに遅延が生じ、発話が重なるにつれその遅延が蓄積される。例えば、テレビ番組に対して、番組音声からリアルタイム翻訳した手話ＣＧアニメーションを付与する場合、メインとなるテレビ番組と音声情報を補間しているはずの手話との間に遅延が生じ、視聴するユーザにとっては大きな負担となる。 It is difficult to completely match the length of the sign language generated as an animation with the length of the original voice because the utterance lengths of words representing the same vocabulary are different between the voice language and the sign language language. Furthermore, sign language, which is a visual language and involves complicated physical movements, often has a longer utterance length than speech language, which is an auditory language. Therefore, when the textualized voice is translated into sign language in real time, there is a delay from the original voice to the presentation of the sign language, and the delay is accumulated as the utterances overlap. For example, when a sign language CG animation that is real-time translated from a program sound is given to a TV program, a delay occurs between the main TV program and the sign language that should be interpolating the sound information, and the user who watches the program Is a big burden.

本発明は、このような事情を考慮してなされたもので、音声からの遅延を低減した手話翻訳を行うことができる手話翻訳装置及びプログラムを提供する。 The present invention has been made in consideration of such circumstances, and provides a sign language translation device and a program capable of performing sign language translation with reduced delay from speech.

本発明の一態様は、発話内容のテキストを、手話単語を表す手話の動きをフレーム毎に示すモーションデータを並べた手話翻訳情報に変換する手話翻訳部と、前記発話内容の音声の長さである音声発話長と、前記手話翻訳情報に含まれる前記モーションデータの再生時間を合計した手話発話長とを取得する発話長取得部と、前記手話発話長が前記音声発話長に近くなるように、前記手話翻訳情報に含まれる前記モーションデータの一部のフレームを削除するデータ編集部と、を備えることを特徴とする手話翻訳装置である。 One aspect of the present invention is a sign language translation unit that converts the text of the utterance content into sign language translation information in which motion data indicating the movement of the sign language representing the sign language word is arranged for each frame, and the length of the voice of the utterance content. A speech length acquisition unit that acquires a certain voice speech length and a sign language speech length that is the sum of the reproduction times of the motion data included in the sign language translation information, and a sign language speech length so as to be close to the voice speech length. The sign language translation device is characterized by comprising a data editing unit for deleting a part of frames of the motion data included in the sign language translation information.

本発明の一態様は、上述する手話翻訳装置であって、前記データ編集部は、前記手話翻訳情報に含まれる複数の前記モーションデータのそれぞれから前記フレームを削除する。 One aspect of the present invention is the sign language translation device described above, and the data editing unit deletes the frame from each of the plurality of motion data included in the sign language translation information.

本発明の一態様は、上述する手話翻訳装置であって、前記データ編集部は、前記手話翻訳情報に含まれる複数の前記モーションデータそれぞれが表す前記手話単語の重要度を取得し、複数の前記モーションデータのそれぞれから対応する前記手話単語の重要度に基づいた割合の前記フレームを削除する。 One aspect of the present invention is the sign language translation device described above, wherein the data editing unit acquires the importance of the sign language word represented by each of the plurality of motion data included in the sign language translation information, and the plurality of the above. The frame of the ratio based on the importance of the corresponding sign language word is deleted from each of the motion data.

本発明の一態様は、上述する手話翻訳装置であって、前記データ編集部は、前記手話翻訳情報に含まれる複数の前記モーションデータそれぞれが表す前記手話単語の重要度を取得し、取得した重要度に基づいて前記手話翻訳情報に含まれる複数の前記モーションデータから前記手話単語の単位で前記フレームを削除する。 One aspect of the present invention is the sign language translation device described above, and the data editing unit acquires and acquires the importance of the sign language word represented by each of the plurality of motion data included in the sign language translation information. The frame is deleted in units of the sign language words from the plurality of motion data included in the sign language translation information based on the degree.

本発明の一態様は、コンピュータを、上述したいずれかの手話翻訳装置として機能させるためのプログラムである。 One aspect of the present invention is a program for making a computer function as any of the above-mentioned sign language translators.

本発明によれば、音声からの遅延を低減した手話翻訳を行うことができる。 According to the present invention, sign language translation with reduced delay from speech can be performed.

本発明の一実施形態による手話翻訳装置の構成例を示す機能ブロック図である。It is a functional block diagram which shows the structural example of the sign language translation apparatus by one Embodiment of this invention. 同実施形態による手話翻訳装置の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the processing procedure of the sign language translation apparatus by the same embodiment. 同実施形態による手話翻訳の具体的な例を示す図である。It is a figure which shows the specific example of the sign language translation by the same embodiment.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。
コンピュータグラフィックス（ＣＧ）アニメーションで利用するリアルタイム手話翻訳技術は、様々な分野において広く使用される可能性を持つ技術である。一般的なＣＧアニメーション向けの手話翻訳では、入力情報となるテキスト化した日本語などの音声言語を単純に全て手話言語へ翻訳する。本実施形態において、手話言語とは、手話の動き（モーション）を表すデータである。手話言語は、例えば、映像フレーム毎の体の各部の位置を三次元座標などで表すモーションデータでもよく、実際に人が手話の動きを行っている映像データでもよい。以下では、手話言語に、手話のＣＧに変換可能なモーションデータを用いる場合を例に説明する。以下では、モーションデータを構成する各フレームを、「モーションフレーム」と記載する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
The real-time sign language translation technique used in computer graphics (CG) animation is a technique that has the potential to be widely used in various fields. In general sign language translation for CG animation, all voice languages such as Japanese, which is textualized as input information, are simply translated into sign language. In the present embodiment, the sign language language is data representing the movement (motion) of sign language. The sign language language may be, for example, motion data representing the position of each part of the body in each video frame in three-dimensional coordinates or the like, or video data in which a person actually moves the sign language. In the following, a case where motion data that can be converted into sign language CG is used as the sign language language will be described as an example. In the following, each frame constituting the motion data will be referred to as a “motion frame”.

素材となるモーションデータを単純に並べて手話言語が生成される場合、手話言語の長さはそれらモーションデータそれぞれの再生時間長を全て足し合わせたものとなる。従来の手話翻訳では、音声言語の発話時間と手話言語の発話（映像によるモーションの表出）時間との間に差が生じる。そのため、実際の利用を想定した場合、発話の度に、手話への翻訳結果に音声言語からの遅延が生じ、さらに時間が経過するにつれその遅延が蓄積されていく。 When a sign language language is generated by simply arranging motion data as materials, the length of the sign language language is the sum of all the playback time lengths of each of the motion data. In conventional sign language translation, there is a difference between the utterance time of the spoken language and the utterance time of the sign language language (expression of motion by video). Therefore, assuming actual use, a delay occurs from the spoken language in the translation result into sign language each time the utterance is made, and the delay accumulates as time elapses.

そこで、本実施形態の手話翻訳装置は、上記の問題を改善するため、音声言語の時間的な発話長を考慮することで、音声言語から手話言語への柔軟な翻訳を実現する。１つ目の方法では、手話翻訳を表すモーションデータ列からモーションフレームを間引くことによって、再生速度を上げる。２つ目の方法では、モーションデータ列から重要度が低い手話単語のモーションデータそのものを削除することにより、詳細を省いた意訳の手話翻訳とする。つまり、手話単語の単位で、モーションフレームを削除する。また、１つ目の方法と２つ目の方法を組み合わせることもできる。これらの方法によって、本実施形態の手話翻訳装置は、入力された音声言語の発話長に応じて、日本語テキストから手話への翻訳結果であるモーションデータを最適な形に自動編集し、音声からの遅延を最小化する。本実施形態の手話翻訳装置により、番組の音声に対応した手話映像の遅延を抑えて提示することができ、視聴するユーザの負担を解消することが可能となる。 Therefore, in order to improve the above problem, the sign language translation device of the present embodiment realizes flexible translation from the speech language to the sign language by considering the temporal utterance length of the speech language. In the first method, the playback speed is increased by thinning out motion frames from the motion data string representing the sign language translation. In the second method, the motion data itself of the sign language word of low importance is deleted from the motion data string, so that the sign language translation is a free translation without details. In other words, the motion frame is deleted in units of sign language words. It is also possible to combine the first method and the second method. By these methods, the sign language translator of the present embodiment automatically edits the motion data, which is the translation result from the Japanese text to the sign language, in the optimum form according to the spoken length of the input voice language, and from the voice. Minimize the delay. With the sign language translation device of the present embodiment, it is possible to suppress the delay of the sign language video corresponding to the sound of the program and present it, and it is possible to eliminate the burden on the viewing user.

図１は、本発明の一実施形態による手話翻訳装置１００の構成例を示す機能ブロック図であり、本実施形態と関係する機能ブロックのみを抽出して示してある。手話翻訳装置１００は、音声認識結果有効部２０と、手話翻訳部１と、発話長比較部２と、データ編集部３とを備えて構成される。 FIG. 1 is a functional block diagram showing a configuration example of a sign language translation device 100 according to an embodiment of the present invention, and only functional blocks related to the present embodiment are extracted and shown. The sign language translation device 100 includes a voice recognition result effective unit 20, a sign language translation unit 1, an utterance length comparison unit 2, and a data editing unit 3.

音声認識結果有効部２０は、手話翻訳装置１００の内部又は外部に備えられた音声認識装置（図示せず）から音声認識結果を取得する。音声認識結果有効部２０は、音声認識結果の利用が有効に設定されている場合、音声認識結果として得られた日本語テキストを手話翻訳部１へ入力し、音声認識時に算出される音声の発話時間である音声発話長を発話長比較部２へ入力する。 The voice recognition result effective unit 20 acquires a voice recognition result from a voice recognition device (not shown) provided inside or outside the sign language translation device 100. When the use of the voice recognition result is enabled, the voice recognition result effective unit 20 inputs the Japanese text obtained as the voice recognition result into the handwriting translation unit 1, and the voice utterance calculated at the time of voice recognition is performed. The voice speech length, which is the time, is input to the speech length comparison unit 2.

手話翻訳部１は、日本語－手話翻訳部１１と、手話モーションデータ列生成部１２とを備える。手話翻訳部１は、入力された日本語テキストを手話翻訳する。手話翻訳とは、テキストの発話内容を表す手話モーションデータ列を生成することである。手話モーションデータ列は、１つの手話単語を表す手話モーションデータを１以上並べたものである。ＣＧアニメーションで用いるモーションデータは、手指や顔表情などを含んだ実際の人の動きをモーションキャプチャし、ＢＶＨ（Biovision Hierarchy）などの形式でモーションデータとして保存したものである。各手話モーションデータは、手話単語単位で手話の動きが収録されたモーションデータである。手話モーションデータを時系列に並べて手話単語間を接続することで、手話の文章を表す手話映像のモーションデータ列を生成する。 The sign language translation unit 1 includes a Japanese-sign language translation unit 11 and a sign language motion data string generation unit 12. The sign language translation unit 1 translates the input Japanese text into sign language. Sign language translation is the generation of a sign language motion data string that represents the utterance content of a text. The sign language motion data string is a sequence of one or more sign language motion data representing one sign language word. The motion data used in the CG animation is motion-captured of actual human movements including fingers and facial expressions, and saved as motion data in a format such as BVH (Biovision Hierarchy). Each sign language motion data is motion data in which the movement of sign language is recorded for each sign language word. By arranging sign language motion data in chronological order and connecting sign language words, a motion data string of sign language video representing a sign language sentence is generated.

日本語－手話翻訳部１１は、入力された日本語テキストを手話単語列に翻訳する。翻訳結果の手話単語列は、手話モーションデータのデータ番号が時系列に並んだデータである。例えば、日本語－手話翻訳部１１は、日本語テキストが示す発話内容「日本の山田選手が…」を、「日本」、「山」、「田」、「選手」、…の手話単語列に変換する。翻訳結果は、これら各手話単語を表す手話モーションデータを特定するデータ番号を、手話単語の出現順に並べたものとなる。日本語－手話翻訳部１１は、任意の従来技術により日本語テキストから手話単語列への変換を行う。従来技術として、例えば、特許文献１、２等の機械翻訳の技術を利用することができるが、この限りではない。手話モーションデータ列生成部１２は、日本語－手話翻訳部１１から出力された手話単語列に、手話単語列を構成するデータ番号それぞれにより特定される手話モーションデータを紐付けて生成した手話翻訳情報を発話長比較部２に出力する。手話モーションデータ列生成部１２は、データ番号に対応した手話モーションデータを、手話翻訳部１が備える記憶部（図示せず）又は外部の記憶装置から読み出す。 The Japanese-sign language translation unit 11 translates the input Japanese text into a sign language word string. The sign language word string of the translation result is data in which the data numbers of the sign language motion data are arranged in chronological order. For example, the Japanese-Sign Language Translation Department 11 puts the utterance content "Japanese Yamada player ..." indicated by the Japanese text into a sign language word string of "Japan", "mountain", "ta", "player", ... Convert. The translation result is obtained by arranging the data numbers that specify the sign language motion data representing each of these sign language words in the order of appearance of the sign language words. The Japanese-sign language translation unit 11 converts a Japanese text into a sign language word string by any conventional technique. As a conventional technique, for example, a machine translation technique such as Patent Documents 1 and 2 can be used, but the present invention is not limited to this. The sign language motion data string generation unit 12 is generated by associating the sign language word string output from the Japanese-sign language translation unit 11 with the sign language motion data specified by each of the data numbers constituting the sign language word string. Is output to the speech length comparison unit 2. The sign language motion data string generation unit 12 reads out the sign language motion data corresponding to the data number from the storage unit (not shown) included in the sign language translation unit 1 or an external storage device.

発話長比較部２は、音声発話長算出部２１と、手話発話長算出部２２と、音声・手話発話長比較部２３とを備える。発話長比較部２は、発話内容の音声の長さである音声発話長と、手話翻訳情報に含まれる手話モーションデータの再生時間を合計した手話発話長とを取得する発話長取得部として動作する。 The utterance length comparison unit 2 includes a voice utterance length calculation unit 21, a sign language utterance length calculation unit 22, and a voice / sign language utterance length comparison unit 23. The utterance length comparison unit 2 operates as an utterance length acquisition unit that acquires the voice utterance length, which is the length of the voice of the utterance content, and the sign language utterance length, which is the sum of the reproduction times of the sign language motion data included in the sign language translation information. ..

音声発話長算出部２１は、音声発話長を予測算出する。音声発話長算出部２１は、入力された日本語テキストから、発話内容を音声合成した際の総発話時間である音声発話長の予測を算出する。この算出には、既存の音声合成技術などを利用することで実現可能である。音声発話長算出部２１は、算出した予測の音声発話長を音声・手話発話長比較部２３に出力する。 The voice utterance length calculation unit 21 predicts and calculates the voice utterance length. The voice utterance length calculation unit 21 calculates the prediction of the voice utterance length, which is the total utterance time when the utterance content is voice-synthesized, from the input Japanese text. This calculation can be realized by using existing speech synthesis techniques and the like. The voice utterance length calculation unit 21 outputs the calculated predicted voice utterance length to the voice / sign language utterance length comparison unit 23.

手話発話長算出部２２は、手話翻訳部１が生成した手話翻訳情報を入力する。手話発話長算出部２２は、手話翻訳結果の手話単語列がＣＧアニメーション化された時の予測の総発話長である手話発話長を算出する。具体的には、手話発話長算出部２２は、手話翻訳情報に含まれる各手話単語の手話モーションデータそれぞれのフレーム長を足し合わせた結果と、１フレーム分の表示時間とを乗算し、予測の手話発話長を算出する。 The sign language utterance length calculation unit 22 inputs the sign language translation information generated by the sign language translation unit 1. The sign language utterance length calculation unit 22 calculates the sign language utterance length, which is the total utterance length of the prediction when the sign language word string of the sign language translation result is CG animated. Specifically, the sign language utterance length calculation unit 22 multiplies the result of adding the frame lengths of the sign language motion data of each sign language word included in the sign language translation information and the display time for one frame to make a prediction. Calculate the sign language utterance length.

音声・手話発話長比較部２３は、音声発話長算出部２１から入力した音声発話長と、手話発話長算出部２２から入力した手話発話長とを比較する。なお、既存の音声認識技術と組み合わせる際には、音声認識装置が音声認識を行った際の総発話長を保持し、音声発話長算出部２１から入力した音声発話長に代えて利用してもよい。その場合、音声・手話発話長比較部２３は、音声認識結果有効部２０から音声発話長を入力する。音声・手話発話長比較部２３は、音声発話長に対する手話発話長の差分である発話長差を算出する。音声・手話発話長比較部２３は、音声発話長の方が長い場合、データ編集部３を通過させずに、手話翻訳情報の手話モーションデータ列をそのまま手話翻訳装置１００の外部に出力する。音声・手話発話長比較部２３は、手話発話長の方が長い場合、手話翻訳情報と、音声発話長と、発話長差とをデータ編集部３に出力する。 The voice / sign language utterance length comparison unit 23 compares the voice utterance length input from the voice utterance length calculation unit 21 with the sign language utterance length input from the sign language utterance length calculation unit 22. When combined with the existing voice recognition technology, the total speech length when the voice recognition device performs voice recognition is retained, and it can be used instead of the voice speech length input from the voice speech length calculation unit 21. good. In that case, the voice / sign language utterance length comparison unit 23 inputs the voice utterance length from the voice recognition result effective unit 20. The voice / sign language utterance length comparison unit 23 calculates the utterance length difference, which is the difference between the voice utterance length and the sign language utterance length. When the voice / sign language utterance length comparison unit 23 is longer, the voice / sign language utterance length comparison unit 23 outputs the sign language motion data string of the sign language translation information as it is to the outside of the sign language translation device 100 without passing through the data editing unit 3. When the sign language utterance length is longer, the voice / sign language utterance length comparison unit 23 outputs the sign language translation information, the voice utterance length, and the utterance length difference to the data editing unit 3.

データ編集部３は、モード選択部３１と、間引きモード処理部３２と、要約モード処理部３３と、重み付け間引きモード処理部３４とを備える。モード選択部３１は、音声・手話発話長比較部２３から入力した発話長差と閾値との比較、または、ユーザによる事前設定に基づいて、手話翻訳情報を間引きモード処理部３２、要約モード処理部３３、及び、重み付け間引きモード処理部３４のうちいずれかに出力する。閾値は、ユーザが設定してもよく、予め決められた値を用いてもよい。 The data editing unit 3 includes a mode selection unit 31, a thinning mode processing unit 32, a summary mode processing unit 33, and a weighted thinning mode processing unit 34. The mode selection unit 31 thins out the sign language translation information based on the comparison between the utterance length difference input from the voice / sign language utterance length comparison unit 23 and the threshold value, or the user's preset setting, and the summary mode processing unit 32. It is output to either 33 or the weighted thinning mode processing unit 34. The threshold value may be set by the user or a predetermined value may be used.

間引きモード処理部３２は、手話翻訳情報に含まれる手話モーションデータそれぞれから、音声・手話発話長比較部２３から入力した発話長差に応じて一定のフレーム間隔のモーションフレームを全て削除する。これにより、各手話単語のモーションの長さを短縮し、必然的に全体の長さが短縮される。間引きモード処理部３２は、フレーム長削減率計算部３２１と、フレーム長変換処理部３２２とを備える。フレーム長削減率計算部３２１は、手話翻訳情報に含まれる手話モーションデータ列から削除するモーションフレームの割合である削減率を計算する。フレーム長変換処理部３２２は、手話翻訳情報に含まれる手話モーションデータそれぞれから、フレーム長削減率計算部３２１が計算した削減率に応じてモーションフレームを削除する。 The thinning mode processing unit 32 deletes all motion frames having a constant frame interval from each of the sign language motion data included in the sign language translation information according to the difference in utterance length input from the voice / sign language utterance length comparison unit 23. This shortens the motion length of each sign language word and inevitably shortens the overall length. The thinning mode processing unit 32 includes a frame length reduction rate calculation unit 321 and a frame length conversion processing unit 322. The frame length reduction rate calculation unit 321 calculates the reduction rate, which is the ratio of motion frames to be deleted from the sign language motion data string included in the sign language translation information. The frame length conversion processing unit 322 deletes motion frames from each of the sign language motion data included in the sign language translation information according to the reduction rate calculated by the frame length reduction rate calculation unit 321.

要約モード処理部３３は、手話翻訳情報に含まれる手話モーションデータ列から、手話単語の重要度に応じて、手話単語の単位で手話モーションデータを削除する。これにより、手話翻訳結果の全体の長さを短縮する。要約モード処理部３３は、手話単語重要度計算部３３１と、手話単語省略処理部３３２とを備える。手話単語重要度計算部３３１は、手話翻訳情報に含まれる各手話モーションデータが表す手話単語それぞれの重要度を計算する。例えば、手話単語重要度計算部３３１は、手話単語列を形態素解析し、各手話単語の重要度を品詞に基づいて決定する。手話単語省略処理部３３２は、手話翻訳情報に含まれる手話モーションデータ列から、手話単語重要度計算部３３１が計算した手話単語の重要度に基づいて一部の手話モーションデータを削除する。 The summary mode processing unit 33 deletes the sign language motion data from the sign language motion data string included in the sign language translation information in units of the sign language words according to the importance of the sign language words. This shortens the overall length of the sign language translation result. The summary mode processing unit 33 includes a sign language word importance calculation unit 331 and a sign language word omission processing unit 332. The sign language word importance calculation unit 331 calculates the importance of each sign language word represented by each sign language motion data included in the sign language translation information. For example, the sign language word importance calculation unit 331 morphologically analyzes a sign language word string and determines the importance of each sign language word based on part of speech. The sign language word omission processing unit 332 deletes a part of the sign language motion data from the sign language motion data string included in the sign language translation information based on the importance of the sign language word calculated by the sign language word importance calculation unit 331.

重み付け間引きモード処理部３４は、手話単語の重要度に応じて手話モーションデータからモーションフレームを削除する。重み付け間引きモード処理部３４は、手話単語重要度計算部３４１と、フレーム長削減率計算部３４２と、フレーム長変換処理部３４３とを備える。手話単語重要度計算部３４１は、手話単語重要度計算部３３１と同様に、手話翻訳情報に含まれる各手話モーションデータが表す手話単語それぞれの重要度を計算する。フレーム長削減率計算部３４２は、手話単語の重要度に応じて各手話モーションデータの削減率を計算する。最も低い削減率は０であってもよい。フレーム長変換処理部３４３は、手話翻訳情報に含まれる各手話モーションデータから、フレーム長削減率計算部３２１が計算した削減率に応じてモーションフレームを削除する。 The weighted thinning mode processing unit 34 deletes motion frames from the sign language motion data according to the importance of the sign language word. The weighted thinning mode processing unit 34 includes a sign language word importance calculation unit 341, a frame length reduction rate calculation unit 342, and a frame length conversion processing unit 343. The sign language word importance calculation unit 341 calculates the importance of each sign language word represented by each sign language motion data included in the sign language translation information, similarly to the sign language word importance calculation unit 331. The frame length reduction rate calculation unit 342 calculates the reduction rate of each sign language motion data according to the importance of the sign language word. The lowest reduction rate may be zero. The frame length conversion processing unit 343 deletes motion frames from each sign language motion data included in the sign language translation information according to the reduction rate calculated by the frame length reduction rate calculation unit 321.

図２は，本実施形態の手話翻訳装置１００の処理手順の一例を示すフローチャートである。音声認識なしの場合（ステップＳ１０５：ＮＯ）、手話翻訳装置１００は、日本語テキストを読み込む（ステップＳ１１０）。日本語－手話翻訳部１１は、読み込まれた日本語テキストを、手話モーションデータのデータ番号を時系列に並べた手話単語列に翻訳する（ステップＳ１１５）。発話長比較部２の音声発話長算出部２１は、ステップＳ１１０において読み込まれた日本語テキストから、発話内容を音声合成した際の音声発話長の予測を算出する（ステップＳ１２０）。 FIG. 2 is a flowchart showing an example of the processing procedure of the sign language translator 100 of the present embodiment. When there is no voice recognition (step S105: NO), the sign language translator 100 reads the Japanese text (step S110). The Japanese-sign language translation unit 11 translates the read Japanese text into a sign language word string in which the data numbers of the sign language motion data are arranged in chronological order (step S115). The voice utterance length calculation unit 21 of the utterance length comparison unit 2 calculates the prediction of the voice utterance length when the utterance content is voice-synthesized from the Japanese text read in step S110 (step S120).

手話モーションデータ列生成部１２は、日本語－手話翻訳部１１から出力された手話単語列に、各手話単語のデータ番号により特定される手話モーションデータを紐付け、テキストファイルや配列データなどの定型化されたフォーマットにより手話翻訳情報として出力する（ステップＳ１２５）。手話発話長算出部２２は、手話翻訳情報に含まれる手話モーションデータそれぞれのフレーム長の合計に基づいて、手話単語列がＣＧアニメーション化された時の手話発話長の予測を算出する（ステップＳ１３０）。 The sign language motion data string generation unit 12 associates the sign language motion data specified by the data number of each sign language word with the sign language word string output from the Japanese-sign language translation unit 11, and forms a standard such as a text file or array data. It is output as sign language translation information in the converted format (step S125). The sign language utterance length calculation unit 22 calculates the prediction of the sign language utterance length when the sign language word string is CG animated based on the total frame length of each sign language motion data included in the sign language translation information (step S130). ..

一方、音声認識ありの場合（ステップＳ１０５：ＹＥＳ）、手話翻訳装置１００の内部又は外部の音声認識装置が音声認識を行う（ステップＳ１３５）。音声認識結果有効部２０は、音声認識結果が有効であることが設定されている場合（ステップＳ１４０：ＹＥＳ）、音声認識装置による音声認識結果を取得し、手話翻訳部１に出力する。さらに、音声認識結果有効部２０は、音声認識装置から音声認識を行った音声の総発話長を取得し、音声・手話発話長比較部２３に出力する。 On the other hand, when there is voice recognition (step S105: YES), the voice recognition device inside or outside the sign language translation device 100 performs voice recognition (step S135). When the voice recognition result valid unit 20 is set to be valid (step S140: YES), the voice recognition result effective unit 20 acquires the voice recognition result by the voice recognition device and outputs it to the sign language translation unit 1. Further, the voice recognition result effective unit 20 acquires the total speech length of the voice recognized by the voice recognition device and outputs it to the voice / handwriting speech length comparison unit 23.

手話翻訳部１の日本語－手話翻訳部１１は、音声認識結果有効部２０から音声認識結果が示す日本語テキストを読み込む（ステップＳ１４５）。日本語－手話翻訳部１１は、ステップＳ１１５と同様の処理を行い、音声認識結果の日本語テキストを手話単語列に翻訳する（ステップＳ１５０）。ステップＳ１５０の処理の後、手話翻訳装置１００は、ステップＳ１２５及びステップＳ１３０の処理を行う。なお、音声認識結果有効部２０に音声認識結果が有効であることが設定されていない場合（ステップＳ１４０：ＮＯ）、手話翻訳装置１００はステップＳ１１０からの処理を行う。 The Japanese-sign language translation unit 11 of the sign language translation unit 1 reads the Japanese text indicated by the voice recognition result from the voice recognition result valid unit 20 (step S145). The Japanese-sign language translation unit 11 performs the same processing as in step S115, and translates the Japanese text of the voice recognition result into a sign language word string (step S150). After the process of step S150, the sign language translator 100 performs the processes of steps S125 and S130. If the voice recognition result valid unit 20 is not set to have the voice recognition result valid (step S140: NO), the sign language translator 100 performs the process from step S110.

ステップＳ１３０の処理の後、音声・手話発話長比較部２３は、音声発話長と手話発話長を比較する（ステップＳ１５５）。この音声発話長は、音声認識が有効ではない場合は音声発話長算出部２１が算出した音声発話長であり、音声認識が有効の場合は音声認識結果有効部２０が音声認識装置から取得した音声認識対象の音声の総発話長である。音声・手話発話長比較部２３は、手話発話長が音声発話長以下であると判断した場合（ステップＳ１５５：音声≧手話）、手話翻訳情報の手話モーションデータ列をそのまま手話翻訳装置１００の外部に出力する。 After the process of step S130, the voice / sign language utterance length comparison unit 23 compares the voice utterance length with the sign language utterance length (step S155). This voice utterance length is the voice utterance length calculated by the voice utterance length calculation unit 21 when the voice recognition is not effective, and the voice acquired by the voice recognition result effective unit 20 from the voice recognition device when the voice recognition is enabled. It is the total speech length of the voice to be recognized. When the voice / sign language utterance length comparison unit 23 determines that the sign language utterance length is less than or equal to the voice utterance length (step S155: voice ≥ sign language), the sign language motion data string of the sign language translation information is directly sent to the outside of the sign language translation device 100. Output.

音声・手話発話長比較部２３は、手話発話長が音声発話長よりも長いと判断した場合、手話翻訳情報と、手話発話長から音声発話長を減算した発話長差と、音声発話長とをモード選択部３１に出力する。モード選択部３１は、モードを判断する（ステップＳ１６０）。モード選択部３１は、発話長差が閾値よりも小さい場合、間引きモードと判断する（ステップＳ１６０：間引きモード）。モード選択部３１は、手話翻訳情報、発話長差及び音声発話長を間引きモード処理部３２に出力する。間引きモード処理部３２のフレーム長削減率計算部３２１は、手話発話長が、音声発話長に近くなるように手話モーションデータ列から削除するモーションフレームの割合である削減率を計算する（ステップＳ１６５）。フレーム長削減率計算部３２１は、発話長差が大きいほど、削除するモーションフレームの間隔が短くなるように削減率を高くするよう調整する。以下に、削除するフレーム間隔を決定する一例を挙げるがこの限りではない。 When the voice / handwriting utterance length comparison unit 23 determines that the voice utterance length is longer than the voice utterance length, the utterance translation information, the utterance length difference obtained by subtracting the voice utterance length from the voice utterance length, and the voice utterance length are obtained. Output to the mode selection unit 31. The mode selection unit 31 determines the mode (step S160). When the difference in utterance length is smaller than the threshold value, the mode selection unit 31 determines that the mode is thinned out (step S160: thinned out mode). The mode selection unit 31 outputs sign language translation information, utterance length difference, and voice utterance length to the thinning mode processing unit 32. The frame length reduction rate calculation unit 321 of the thinning mode processing unit 32 calculates the reduction rate, which is the ratio of the motion frames to be deleted from the sign language motion data string so that the sign language utterance length is close to the voice utterance length (step S165). .. The frame length reduction rate calculation unit 321 adjusts so that the larger the difference in utterance length, the shorter the interval between the motion frames to be deleted, and the higher the reduction rate. The following is an example of determining the frame interval to be deleted, but this is not the case.

手話翻訳の結果が手話モーションデータＡ及び手話モーションデータＢから構成されており、手話モーションデータＡ、Ｂはそれぞれ、１５０フレームのモーションフレームにより構成されるとする。また、手話モーションデータが６０ｆｐｓ（フレーム毎秒）であり、発話長差が＋２．０秒であるとする。この場合、発話長差に相当するフレーム数は、（＋２．０秒）×（６０ｆｐｓ）＝１２０フレームとなる。つまり、手話翻訳情報に含まれる手話モーションデータ列の全体から１２０フレームのモーションフレームを削除することにより、音声発話長と手話発話長とが等しくなる。 It is assumed that the result of the sign language translation is composed of the sign language motion data A and the sign language motion data B, and the sign language motion data A and B are each composed of 150 motion frames. Further, it is assumed that the sign language motion data is 60 fps (frames per second) and the utterance length difference is +2.0 seconds. In this case, the number of frames corresponding to the difference in utterance length is (+2.0 seconds) × (60 fps) = 120 frames. That is, by deleting 120 frames of motion frames from the entire sign language motion data string included in the sign language translation information, the voice utterance length and the sign language utterance length become equal.

手話翻訳の結果は２つの手話モーションデータから構成されるため、各手話モーションデータそれぞれから１２０／２＝６０フレームを削除すればよい。ここで、削除するフレーム数６０を単純に手話モーションデータのフレーム数１５０で除算すると６０／１５０＝１／２．５となり、削減率１／２（２フレームのうち１フレームを削除）と削減率１／３（３フレームのうち１フレームを削除）の間である。ここでは、高い方の削減率１／２を選択し、フレーム削除間隔を「１」と決定する。１５０フレームからフレーム削除間隔「１」でフレームを削除すると、１５０×（削減率１／２）＝７５フレームを削除することになる。この場合、厳密には６０フレーム分を削除することはできないが、手話発話長を音声発話長に近づけることが可能となる。 Since the result of sign language translation is composed of two sign language motion data, 120/2 = 60 frames may be deleted from each sign language motion data. Here, if the number of frames to be deleted 60 is simply divided by the number of frames of sign language motion data 150, 60/150 = 1 / 2.5, and the reduction rate is 1/2 (1 frame out of 2 frames is deleted). It is between 1/3 (1 frame out of 3 frames is deleted). Here, the higher reduction rate 1/2 is selected, and the frame deletion interval is determined to be "1". If a frame is deleted from 150 frames at a frame deletion interval "1", 150 × (reduction rate 1/2) = 75 frames will be deleted. In this case, strictly speaking, 60 frames cannot be deleted, but the sign language utterance length can be brought closer to the voice utterance length.

フレーム長変換処理部３２２は、手話翻訳情報に含まれる各手話モーションデータＡ、Ｂのそれぞれから、フレーム長削減率計算部３２１が算出したフレーム削除間隔に従ってモーションフレームを削除し、手話翻訳装置１００の外部に出力する（ステップＳ１７０）。なお、フレーム長変換処理部３２２は、手話発話長を音声発話長に近づけるように、手話翻訳情報に含まれる各手話モーションデータを並べたときの全体のモーションフレームから、均等間隔でモーションフレームを削除してもよい。 The frame length conversion processing unit 322 deletes motion frames from each of the sign language motion data A and B included in the sign language translation information according to the frame deletion interval calculated by the frame length reduction rate calculation unit 321. Output to the outside (step S170). The frame length conversion processing unit 322 deletes motion frames at equal intervals from the entire motion frame when the sign language motion data included in the sign language translation information is arranged so that the sign language utterance length is closer to the voice utterance length. You may.

モード選択部３１は、発話長差が閾値以上である場合、要約モードと判断する（ステップＳ１６０：要約モード）。モード選択部３１は、手話翻訳情報及び音声発話長を要約モード処理部３３に出力する。要約モード処理部３３の手話単語重要度計算部３３１は、手話翻訳情報に含まれる各手話モーションデータが表す単語それぞれの重要度を計算する（ステップＳ１７５）。重要度の計算には、例えば、以下のように、一般的な形態素解析技術を用いる方法があるが、この限りではない。 When the difference in utterance length is equal to or greater than the threshold value, the mode selection unit 31 determines that the mode is a summary mode (step S160: summary mode). The mode selection unit 31 outputs the sign language translation information and the voice utterance length to the summary mode processing unit 33. The sign language word importance calculation unit 331 of the summary mode processing unit 33 calculates the importance of each word represented by each sign language motion data included in the sign language translation information (step S175). For the calculation of importance, for example, there is a method using a general morphological analysis technique as follows, but the present invention is not limited to this.

手話単語重要度計算部３３１は、手話翻訳情報に含まれる各手話モーションデータが表す単語列に、一文単位で一般的な形態素解析技術を適用して単語単位に分割すると共に、それら各単語に品詞を付与する。手話単語重要度計算部３３１は、事前に設定された品詞と重要度との対応付けを表すデータ及び単語と重要度との関係を表すデータに基づいて、各単語の重要度を得る。品詞ごとの重要度では、例えば、名詞、固有名詞、動詞等の重要度を高く、助詞や形容詞、副詞は重要度を低くする。単語の重要度は、その単語の前後の単語や、前後の単語の品詞によって決まるものであってもよい。単語の重要度と、単語が分類された品詞の重要度とが異なる場合は、単語の重要度を優先する。これにより、翻訳対象に合わせて重要な単語をチューニングすることができる。例えば、スポーツの番組の音声を手話翻訳する場合、一般的な規則では名詞の優先度が高い場合でも、名詞の単語「選手」の前に名前を表す固有名詞が付加されていた場合は、優先度を低くすることができる。なお、手話単語重要度計算部３３１は、日本語テキストに形態素解析を行ってもよい。この場合、日本語テキストに含まれる単語の重要度が、その単語に対応した手話単語の重要度として用いられる。 The sign language word importance calculation unit 331 applies a general morphological analysis technique for each sentence to the word string represented by each sign language motion data included in the sign language translation information, divides it into word units, and divides each word into part speech. Is given. The sign language word importance calculation unit 331 obtains the importance of each word based on the preset data representing the correspondence between the part of speech and the importance and the data representing the relationship between the word and the importance. Regarding the importance of each part of speech, for example, nouns, proper nouns, verbs, etc. are of high importance, and particles, adjectives, and adverbs are of low importance. The importance of a word may be determined by the words before and after the word and the part of speech of the words before and after the word. If the importance of a word differs from the importance of the part of speech to which the word is classified, the importance of the word is prioritized. This makes it possible to tune important words according to the translation target. For example, when translating the audio of a sports program into sign language, even if the general rule is that the noun has a high priority, if the noun word "player" is preceded by a proper noun that represents the name, the priority is given. The degree can be lowered. The sign language word importance calculation unit 331 may perform morphological analysis on Japanese text. In this case, the importance of the word included in the Japanese text is used as the importance of the sign language word corresponding to the word.

手話単語省略処理部３３２は、音声発話長に近くなるように、手話翻訳情報が示す手話モーションデータ列から、手話単語重要度計算部３３１が計算した優先度が低い手話単語から順に１以上の手話モーションデータを削除する（ステップＳ１８０）。手話単語省略処理部３３２は、間引きモード処理部３２における処理と同様に、音声・手話発話長比較部２３が出力した発話長差が大きいほど、削除する手話モーションデータの数が多くなるように、削除する手話モーションデータの数を決定する。その際、手話単語省略処理部３３２は、削除の結果残った手話モーションデータ列が予め設定された最低限の単語数以下とならないようにする。削除されない最低限の単語数の閾値を予め設定しておくことで、意味を持たない翻訳結果が出力されることを防ぐことが可能である。手話単語省略処理部３３２は、手話翻訳情報が示す手話モーションデータ列から優先度が低い手話モーションデータを削除した結果を、手話翻訳装置１００の外部に出力する。 The sign language word omission processing unit 332 has one or more sign language in order from the sign language motion data string indicated by the sign language translation information, in order from the sign language word with the lowest priority calculated by the sign language word importance calculation unit 331 so as to be closer to the voice speech length. The motion data is deleted (step S180). Similar to the processing in the thinning mode processing unit 32, the sign language word omission processing unit 332 deletes the number of sign language motion data as the difference in the utterance length output by the voice / sign language utterance length comparison unit 23 increases. Determine the number of sign language motion data to delete. At that time, the sign language word omission processing unit 332 prevents the sign language motion data string remaining as a result of the deletion from becoming less than the preset minimum number of words. By setting a threshold value for the minimum number of words that cannot be deleted in advance, it is possible to prevent the output of meaningless translation results. The sign language word omission processing unit 332 outputs the result of deleting the sign language motion data having a low priority from the sign language motion data string indicated by the sign language translation information to the outside of the sign language translation device 100.

また、モード選択部３１は、予め入力された設定に応じて、間引きモードに代えて重み付け間引きモードと判断してもよい（ステップＳ１６０：重み付け間引きモード）。モード選択部３１は、手話翻訳情報及び音声発話長を重み付け間引きモード処理部３４に出力する。重み付け間引きモード処理部３４の手話単語重要度計算部３４１は、手話単語重要度計算部３３１と同様の処理により、手話翻訳情報に含まれる各手話モーションデータが表す単語それぞれの重要度を計算する（ステップＳ１８５）。フレーム長削減率計算部３４２は、手話単語重要度計算部３４１が算出した手話単語の重要度に応じて、手話モーションデータ単位で異なる削減数を算出する（ステップＳ１９０）。フレーム長削減率計算部３４２は、重要度が高い手話単語を表す手話モーションデータほど、フレーム長削減率を低くする。これにより、さらに柔軟に発話長を変更することができる。フレーム長変換処理部３４３は、フレーム長削減率計算部３４２が計算した各手話モーションデータの削減率に従って、手話モーションデータ列のモーションフレームを削除し、手話翻訳装置１００の外部に出力する（ステップＳ１９５）。 Further, the mode selection unit 31 may determine the weighted thinning mode instead of the thinning mode according to the setting input in advance (step S160: weighted thinning mode). The mode selection unit 31 outputs the sign language translation information and the voice utterance length to the weighted thinning mode processing unit 34. The sign language word importance calculation unit 341 of the weighted thinning mode processing unit 34 calculates the importance of each word represented by each sign language motion data included in the sign language translation information by the same processing as the sign language word importance calculation unit 331 (. Step S185). The frame length reduction rate calculation unit 342 calculates a different number of reductions for each sign language motion data unit according to the importance of the sign language word calculated by the sign language word importance calculation unit 341 (step S190). The frame length reduction rate calculation unit 342 lowers the frame length reduction rate as the sign language motion data representing a sign language word of higher importance is represented. This makes it possible to change the utterance length more flexibly. The frame length conversion processing unit 343 deletes the motion frame of the sign language motion data string according to the reduction rate of each sign language motion data calculated by the frame length reduction rate calculation unit 342, and outputs the motion frame to the outside of the sign language translation device 100 (step S195). ).

モード選択部３１は、発話長差の値に応じてハイブリッドモードと判断することもできる。例えば、モード選択部３１は、第一の閾値と第一の閾値より大きな第二の閾値とを用い、発話長差が第一の閾値未満のときは要約モード又は重み付け間引きモードと判断し、第一の閾値以上第二の閾値未満であるときには要約モードと判断する。モード選択部３１は、発話長差が第二の閾値以上であるときにはハイブリッドモードと判断する。あるいは、モード選択部３１は、予め入力された設定に応じて、要約モードとハイブリッドモードのいずれを行うかを可変にしてもよい。 The mode selection unit 31 can also determine that it is a hybrid mode according to the value of the difference in utterance length. For example, the mode selection unit 31 uses the first threshold value and the second threshold value larger than the first threshold value, and when the difference in utterance length is less than the first threshold value, it determines that it is the summary mode or the weighted thinning mode, and the first When it is equal to or more than one threshold value and less than the second threshold value, it is determined to be in summary mode. The mode selection unit 31 determines that the hybrid mode is used when the difference in utterance length is equal to or greater than the second threshold value. Alternatively, the mode selection unit 31 may change whether to perform the summary mode or the hybrid mode according to the setting input in advance.

モード選択部３１が、ハイブリッドモードと判断した場合（ステップＳ１６０：ハイブリッドモード）、まず、要約モード処理部３３は、ステップＳ１７５～ステップＳ１８０と同様の処理を行い、手話モーションデータ単位で大まかに手話発話長を削減する（ステップＳ２００、Ｓ２０５）。要約モード処理部３３は、削減後の手話モーションデータ列に更新された手話翻訳情報を間引きモード処理部３２に入力する。間引きモード処理部３２は、更新された手話翻訳情報を用いてステップＳ１６５～ステップＳ１７０と同様の処理を行う（ステップＳ２１０、Ｓ２１５）。ただし、間引きモード処理部３２は、発話長差として、音声発話長と、要約モード処理部３３が出力した手話翻訳情報の手話発話長との差分を用いる。これにより、フレーム単位で細かく発話長を調整することが可能となる。 When the mode selection unit 31 determines that the mode is a hybrid mode (step S160: hybrid mode), first, the summary mode processing unit 33 performs the same processing as in steps S175 to S180, and roughly speaks sign language in units of sign language motion data. Reduce the length (steps S200, S205). The summary mode processing unit 33 inputs the updated sign language translation information into the reduced sign language motion data string to the thinning mode processing unit 32. The thinning mode processing unit 32 performs the same processing as in steps S165 to S170 using the updated sign language translation information (steps S210 and S215). However, the thinning mode processing unit 32 uses the difference between the voice utterance length and the sign language utterance length of the sign language translation information output by the summary mode processing unit 33 as the utterance length difference. This makes it possible to finely adjust the utterance length on a frame-by-frame basis.

手話翻訳部１は、ステップＳ１７０、ステップＳ１８０、ステップＳ１９５又はステップＳ２１５の処理の後、次の入力を検出したかを判断する（ステップＳ２２０）。手話翻訳部１が次の入力を検出した場合（ステップＳ２２０：ＹＥＳ）、手話翻訳装置１００は、ステップＳ１０５からの処理を繰り返す。手話翻訳部１が次の入力を検出しなかった場合（ステップＳ２２０：ＮＯ）、手話翻訳装置１００は、図２の処理を終了する。手話翻訳装置１００から出力された手話モーションデータ列は、ＣＧグラフィック処理により、手話のモーションを映像化したＣＧアニメーションに変換される。なお、手話翻訳部１は、ステップＳ１０５～ステップＳ２１５までの処理を１文の発話内容ごとに行ってもよく、複数文まとめて行ってもよい。 The sign language translation unit 1 determines whether or not the next input has been detected after the processing of step S170, step S180, step S195 or step S215 (step S220). When the sign language translation unit 1 detects the next input (step S220: YES), the sign language translation device 100 repeats the process from step S105. When the sign language translation unit 1 does not detect the next input (step S220: NO), the sign language translation device 100 ends the process of FIG. The sign language motion data string output from the sign language translator 100 is converted into a CG animation that visualizes the sign language motion by CG graphic processing. The sign language translation unit 1 may perform the processes from step S105 to step S215 for each utterance content of one sentence, or may perform a plurality of sentences at once.

図３は、手話翻訳の具体的な例を示す図である。図３（ａ）は、従来の手話翻訳技術の例を示す。番組の音声や、テキストから合成した合成音声の発話時間に対し、同じ内容をそのまま翻訳した手話の発話時間には遅延が生じる傾向がある。例えば、早めの口調で“日本の山田選手が・・・”と発話された音声をリアルタイムで手話翻訳し、ＣＧアニメーションを用いて手話言語を提示する。この場合、手話翻訳や、ＣＧアニメーションの合成処理における遅延を考慮しなかったとしても、手話翻訳結果は、“日本”、“山”、“田”、“選手”、・・・、の順に、手を加えていないモーションキャプチャ時の平均的な速度で淡々と手話を発話する手話モーションデータの接続となる。そのため、手話翻訳の対象となった音声からの遅延が徐々に増加してしまう。本実施形態では、図３（ｂ）に示す間引きモード、図３（ｃ）に示す要約モードなどにより手話翻訳の発話長を短くし、その問題を解決している。 FIG. 3 is a diagram showing a specific example of sign language translation. FIG. 3A shows an example of a conventional sign language translation technique. There is a tendency for a delay to occur in the utterance time of sign language, which is a translation of the same content as it is, with respect to the utterance time of the voice of the program or the synthetic voice synthesized from the text. For example, the voice spoken in an early tone, "Yamada of Japan ..." is translated into sign language in real time, and the sign language language is presented using CG animation. In this case, even if the delay in sign language translation and CG animation composition processing is not taken into consideration, the sign language translation results are in the order of "Japan", "mountain", "field", "player", ... It is a connection of sign language motion data that speaks sign language lightly at the average speed at the time of untouched motion capture. Therefore, the delay from the voice targeted for sign language translation gradually increases. In the present embodiment, the utterance length of the sign language translation is shortened by the thinning mode shown in FIG. 3 (b), the summary mode shown in FIG. 3 (c), and the like, and the problem is solved.

手話翻訳装置１００に、手話翻訳の対象となる日本語テキスト“日本の山田選手が・・・”が入力された場合、手話翻訳部１は、手話単語列“日本”、“山”、“田”、“選手”、・・・、を表す手話モーションデータ列を生成する。次に、発話長比較部２は、日本語テキストを音声発話した際の音声発話長と、手話単語列の手話発話長との予測をそれぞれ算出する。発話長比較部２は、それらの発話長を比較し、音声からの手話の遅延を表す発話長差を算出してデータ編集部３へ出力する。データ編集部３のモード選択部３１は、例えば発話長差の閾値を１０秒に設定していた場合、発話長差が１０秒以内の場合は間引きモードを選択し、１０秒以上の場合は要約モードを選択する。なお、ユーザが事前に、モード選択部３１にモードを設定しておくことで、固定のモードを選択することも可能である。 When the Japanese text "Japanese Yamada player ..." is input to the sign language translation device 100, the sign language translation unit 1 has the sign language word strings "Japan", "mountain", and "field". Generates a sign language motion data string representing "," player ", ...". Next, the utterance length comparison unit 2 calculates the predictions of the voice utterance length when the Japanese text is spoken and the sign language utterance length of the sign language word string, respectively. The utterance length comparison unit 2 compares those utterance lengths, calculates the utterance length difference representing the delay of sign language from the voice, and outputs it to the data editing unit 3. For example, when the threshold value of the utterance length difference is set to 10 seconds, the mode selection unit 31 of the data editing unit 3 selects the thinning mode when the utterance length difference is within 10 seconds, and summarizes when the utterance length difference is 10 seconds or more. Select a mode. It is also possible for the user to select a fixed mode by setting the mode in the mode selection unit 31 in advance.

データ編集部３は、間引きモードを選択した場合、図３（ｂ）に示すように、各手話単語“日本”、“山”、“田”、“選手”、・・・、のそれぞれに対応する手話モーションデータのモーションフレームを、発話長差の値に基づき均等間隔で削除する。モーションフレームの削除によって各手話単語のモーションの再生速度を上げることで、手話の発話長を短縮することが可能となる。 When the thinning mode is selected, the data editing unit 3 corresponds to each sign language word "Japan", "mountain", "field", "player", ..., As shown in FIG. 3 (b). The motion frames of the sign language motion data to be used are deleted at equal intervals based on the value of the utterance length difference. By deleting the motion frame and increasing the playback speed of the motion of each sign language word, it is possible to shorten the utterance length of the sign language.

データ編集部３は、要約モードを選択した場合、図３（ｃ）に示すように一つの発話に含まれる各手話単語の重要度を算出する。同図では、“日本”の重要度は４、“山”の重要度は５、“田”の重要度は５、“選手”の重要度は２である。データ編集部３は、発話長差が最小になるように、手話単語列において重要度が低い手話単語“日本”や“選手”の手話モーションデータから先に削除していき、重要度が高い“山”や“田”のみの手話モーションデータ列を手話翻訳情報とする。ディテールを省き意訳の手話翻訳とすることで、手話の発話長を短縮する。 When the summarization mode is selected, the data editing unit 3 calculates the importance of each sign language word included in one utterance as shown in FIG. 3 (c). In the figure, the importance of "Japan" is 4, the importance of "mountain" is 5, the importance of "field" is 5, and the importance of "player" is 2. The data editing unit 3 deletes the sign language motion data of the sign language words "Japan" and "player", which are less important in the sign language word string, first so that the difference in speech length is minimized, and the sign language word "high importance" is used. Sign language motion data strings of only "mountain" and "field" are used as sign language translation information. By omitting the details and using a sign language translation as a free translation, the length of the sign language utterance is shortened.

データ編集部３は、上述の２つのモードを組み合わせたハイブリッドモードとする場合、手話単語列“日本”、“山”、“田”、“選手”、・・・、において最も重要度が低い“選手”のみを要約モードで削除する。データ編集部３は、残った手話単語列“日本”，“山”，“田”については間引きモードでフレーム長を短縮する。 The data editing unit 3 has the lowest importance in the sign language word strings "Japan", "mountain", "field", "player", ..., When the hybrid mode is a combination of the above two modes. Delete only "players" in summary mode. The data editing unit 3 shortens the frame length of the remaining sign language word strings "Japan", "mountain", and "field" in the thinning mode.

以上説明した実施形態によれば、手話翻訳装置は、手話翻訳部と、発話長取得部と、データ編集部とを備える。発話長取得部は、実施形態の発話長比較部２に対応する。手話翻訳部は、発話内容のテキストを、手話単語を表す手話の動きをフレーム毎に示すモーションデータを並べた手話翻訳情報に変換する。発話長取得部は、発話内容の音声の長さである音声発話長と、手話翻訳情報に含まれるモーションデータの再生時間を合計した手話発話長とを取得する。データ編集部は、手話発話長が音声発話長に近くなるように、手話翻訳情報に含まれるモーションデータの一部のフレームを削除する。例えば、データ編集部は、手話翻訳情報に含まれる複数のモーションデータのそれぞれから均等にフレームを削除する。あるいは、データ編集部は、手話翻訳情報に含まれる複数のモーションデータそれぞれが表す手話単語の重要度を取得し、複数のモーションデータのそれぞれから対応する手話単語の重要度に基づいた割合のフレームを削除する。またあるいは、データ編集部は、手話翻訳情報に含まれる複数のモーションデータそれぞれが表す手話単語の重要度を取得し、取得した重要度に基づいて手話翻訳情報に含まれる複数のモーションデータから手話単語の単位でフレームを削除する。 According to the embodiment described above, the sign language translation device includes a sign language translation unit, an utterance length acquisition unit, and a data editing unit. The utterance length acquisition unit corresponds to the utterance length comparison unit 2 of the embodiment. The sign language translation unit converts the text of the utterance content into sign language translation information in which motion data indicating the movement of the sign language representing the sign language word is arranged for each frame. The utterance length acquisition unit acquires the voice utterance length, which is the length of the voice of the utterance content, and the sign language utterance length, which is the sum of the reproduction times of the motion data included in the sign language translation information. The data editorial department deletes some frames of the motion data included in the sign language translation information so that the sign language utterance length is close to the voice utterance length. For example, the data editing unit evenly deletes frames from each of the plurality of motion data included in the sign language translation information. Alternatively, the data editorial department acquires the importance of the sign language word represented by each of the plurality of motion data included in the sign language translation information, and obtains a frame of a ratio based on the importance of the corresponding sign language word from each of the plurality of motion data. delete. Alternatively, the data editorial department acquires the importance of the sign language word represented by each of the plurality of motion data included in the sign language translation information, and the sign language word is obtained from the plurality of motion data included in the sign language translation information based on the acquired importance. Delete frames in units of.

本実施形態の手話翻訳装置は、入力された音声言語の発話長に応じて、手話単語列や各手話モーションデータを最適な形に自動編集することによって、翻訳時の音声からの遅延を最小化し、より手話通訳者などに近い形の自動翻訳を実現することが可能となる。 The sign language translation device of the present embodiment minimizes the delay from the voice during translation by automatically editing the sign language word string and each sign language motion data in the optimum form according to the spoken length of the input voice language. , It is possible to realize automatic translation in a form closer to that of a sign language interpreter.

なお、上述の手話翻訳装置１００は、内部にコンピュータシステムを有している。そして、手話翻訳装置１００の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 The sign language translator 100 described above has a computer system inside. The operation process of the sign language translator 100 is stored in a computer-readable recording medium in the form of a program, and the above processing is performed by the computer system reading and executing this program. The computer system referred to here includes hardware such as a CPU, various memories, an OS, and peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 Further, the "computer system" includes the homepage providing environment (or display environment) if the WWW system is used.
Further, the "computer-readable recording medium" refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, or a CD-ROM, and a storage device such as a hard disk built in a computer system. Further, a "computer-readable recording medium" is a communication line for transmitting a program via a network such as the Internet or a communication line such as a telephone line, and dynamically holds the program for a short period of time. In that case, it also includes those that hold the program for a certain period of time, such as the volatile memory inside the computer system that is the server or client. Further, the above-mentioned program may be for realizing a part of the above-mentioned functions, and may be further realized by combining the above-mentioned functions with a program already recorded in the computer system.

１…手話翻訳部
２…発話長比較部
３…データ編集部
１１…日本語－手話翻訳部
１２…手話モーションデータ列生成部
２０…音声認識結果有効部
２１…音声発話長算出部
２２…手話発話長算出部
２３…音声・手話発話長比較部
３１…モード選択部
３２…間引きモード処理部
３３…要約モード処理部
３４…重み付け間引きモード処理部
１００…手話翻訳装置
３２１、３４２…フレーム長削減率計算部
３２２、３４３…フレーム長変換処理部
３３１、３４１…手話単語重要度計算部
３３２…手話単語省略処理部 1 ... Sign language translation unit 2 ... Speaking length comparison unit 3 ... Data editing unit 11 ... Japanese-Sign language translation unit 12 ... Sign language motion data string generation unit 20 ... Voice recognition result effective unit 21 ... Voice speech length calculation unit 22 ... Sign language speech Length calculation unit 23 ... Voice / sign language speech length comparison unit 31 ... Mode selection unit 32 ... Thinning mode processing unit 33 ... Summary mode processing unit 34 ... Weighted thinning mode processing unit 100 ... Sign language translation device 321, 342 ... Frame length reduction rate calculation Units 322, 343 ... Frame length conversion processing units 331, 341 ... Sign language word importance calculation unit 332 ... Sign language word omission processing unit

Claims

A sign language translation unit that converts the text of the utterance content into sign language translation information in which motion data indicating the movement of the sign language representing the sign language word is arranged for each frame.
An utterance length acquisition unit that acquires a voice utterance length that is the length of the voice of the utterance content and a sign language utterance length that is the sum of the reproduction times of the motion data included in the sign language translation information.
A data editing unit that deletes a part of the frame of the motion data included in the sign language translation information so that the sign language utterance length is close to the voice utterance length.
Equipped with
The data editing unit acquires the importance of the sign language word represented by each of the plurality of motion data included in the sign language translation information, and is based on the importance of the corresponding sign language word from each of the plurality of motion data. Delete the percentage of the frame,
A sign language translator characterized by that.

A sign language translation unit that converts the text of the utterance content into sign language translation information in which motion data indicating the movement of the sign language representing the sign language word is arranged for each frame.
An utterance length acquisition unit that acquires a voice utterance length that is the length of the voice of the utterance content and a sign language utterance length that is the sum of the reproduction times of the motion data included in the sign language translation information.
A data editing unit that deletes a part of the frame of the motion data included in the sign language translation information so that the sign language utterance length is close to the voice utterance length.
Equipped with
The data editing unit acquires the importance of the sign language word represented by each of the plurality of motion data included in the sign language translation information, and the plurality of motion data included in the sign language translation information based on the acquired importance. Delete the frame in units of the sign language word from
A sign language translator characterized by that.

The data editing unit deletes the frame from each of the plurality of motion data included in the sign language translation information.
The sign language translation device according to claim 2 , wherein the sign language translator is characterized by the above.

A program for making a computer function as a sign language translator according to any one of claims 1 to 3 .