JP6044490B2

JP6044490B2 - Information processing apparatus, speech speed data generation method, and program

Info

Publication number: JP6044490B2
Application number: JP2013179785A
Authority: JP
Inventors: 典昭阿瀬見
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2013-08-30
Filing date: 2013-08-30
Publication date: 2016-12-14
Anticipated expiration: 2033-08-30
Also published as: JP2015049311A

Description

本発明は、映像に合わせて出力される音声の発声時間を表す話速データを生成する情報処理装置、話速データ生成方法、及びプログラムに関する。 The present invention relates to an information processing apparatus, a speech speed data generation method, and a program that generate speech speed data representing the utterance time of audio output in accordance with video.

従来、映画やテレビ番組などの映像を含むコンテンツにおいて、映像の出力に合わせて音声合成にて生成された合成音声を出力することがなされている。
この映像に合わせて合成音声を出力する装置として、合成音声の発声時間長を番組放送時間に一致させるように当該音声の伸縮率を決定し、その決定した伸縮率に基づいて合成音声における話速を変換する話速調整装置（即ち、情報処理装置）が提案されている（特許文献１参照）。 2. Description of the Related Art Conventionally, in content including video such as movies and television programs, synthesized voice generated by voice synthesis is output in accordance with video output.
As a device for outputting synthesized voice in accordance with this video, the expansion rate of the voice is determined so that the utterance time length of the synthesized audio matches the program broadcast time, and the speech speed in the synthesized speech is determined based on the determined expansion rate. Has been proposed (see Patent Document 1).

特開２０１２−０７８７５５号公報JP 2012-078755 A

この特許文献１に記載された装置にて話速を変換した場合、合成音声の全体が伸縮されるため、合成音声にて発声される文章に含まれる各単語の発声時間も伸縮される。
そして、発声時間が伸縮される際に発声時間が短縮されると、文章中に含まれる単語は、聴き取りにくくなる可能性がある。このため、発声時間が短縮された単語を聞いた人物は、発声の内容全体を理解することが困難となるという課題があった。 When the speech speed is converted by the device described in Patent Document 1, since the entire synthesized speech is expanded and contracted, the utterance time of each word included in the sentence uttered by the synthesized speech is also expanded and contracted.
If the utterance time is shortened when the utterance time is expanded or contracted, words included in the sentence may be difficult to hear. For this reason, there has been a problem that it is difficult for a person who has heard a word whose utterance time is shortened to understand the entire content of the utterance.

つまり、従来の技術では、合成音声において、発声の内容が理解しやすくなるように話速を調整できないという課題があった。
そこで、本発明は、合成音声において、発声の内容が理解しやすくなるように話速を調整可能とすることを目的とする。 In other words, the conventional technique has a problem that the speech speed cannot be adjusted in the synthesized speech so that the content of the utterance can be easily understood.
Therefore, an object of the present invention is to make it possible to adjust the speech speed so that the content of the utterance can be easily understood in the synthesized speech.

上記目的を達成するためになされた本発明は、テキスト取得手段と、解析手段と、親密度取得手段と、話速決定手段と、識別情報取得手段と、履歴取得手段と、履歴解析手段と、更新手段とを備えた情報処理装置である。 The present invention made in order to achieve the above object includes a text acquisition means, an analysis means, a closeness acquisition means, a speech speed determination means, an identification information acquisition means, a history acquisition means, a history analysis means, An information processing apparatus comprising update means.

本発明においては、テキスト取得手段が、映像に合わせて音声によって出力される情報の文字列を表すテキストデータを取得し、解析手段が、テキスト取得手段にて取得したテキストデータを解析し、テキストデータによって表される文字列に含まれる各単語を特定する。 In the present invention, the text acquisition means acquires text data representing a character string of information output by sound in accordance with the video, the analysis means analyzes the text data acquired by the text acquisition means, and the text data Each word included in the character string represented by is identified.

そして、親密度取得手段が、解析手段にて特定された各単語に対応する親密度を、親密度データベースから取得する。ここで言う親密度データベースとは、親密度情報が格納されたデータベースであり、親密度情報とは、単語それぞれと各単語の認識度合いを表す親密度とが予め対応付けられた情報である。 Then, the familiarity acquisition unit acquires the familiarity corresponding to each word specified by the analysis unit from the familiarity database. The familiarity database referred to here is a database in which familiarity information is stored, and the familiarity information is information in which each word is associated with a familiarity representing the recognition degree of each word in advance.

さらに、話速決定手段は、親密度取得手段で取得した親密度が低いことを表している単語ほど、テキストデータによって表される情報全体の発声時間に占める、当該単語の発声時間の割合が長くなるように、当該単語の発声時間を調整した話速データを生成する。ここで言う話速データとは、音声合成によって出力される合成音声の発声時間を表すデータであり、かつ、テキストデータによって表される情報の文字列を構成する各音素の発声時間を表すデータである。 Furthermore, the speech rate determination means has a longer proportion of the utterance time of the word in the utterance time of the entire information represented by the text data, as the word indicating that the familiarity acquired by the familiarity acquisition means is lower. Thus, speech speed data in which the utterance time of the word is adjusted is generated. The speech speed data referred to here is data representing the utterance time of synthesized speech output by speech synthesis, and data representing the utterance time of each phoneme constituting the character string of information represented by the text data. is there.

また、本発明では、識別情報取得手段が、利用者を識別する利用者識別情報を取得する。そして、履歴取得手段が、利用者識別情報それぞれと、各利用者識別情報に対応する利用者が過去に視聴した映像を表す視聴情報とを対応付けた利用履歴から、識別情報取得手段で取得した利用者識別情報に対応する利用者の視聴情報を取得する。 In the present invention, the identification information acquisition unit acquires user identification information for identifying a user. Then, the history acquisition means is acquired by the identification information acquisition means from the use history in which each of the user identification information is associated with viewing information representing a video viewed by the user corresponding to each user identification information in the past. User viewing information corresponding to the user identification information is acquired.

さらに、本発明では、履歴解析手段が、履歴取得手段で取得した視聴情報によって表される各映像に対応するテキストデータを取得して解析し、各テキストデータによって表される文字列に含まれる各単語を特定する。更新手段は、その特定された単語それぞれの認識度合いが高くなるように、親密度データベースに格納されている親密度情報において当該単語と対応付けられた親密度を更新する。 Further, in the present invention, the history analysis unit acquires and analyzes text data corresponding to each video represented by the viewing information acquired by the history acquisition unit, and each text included in the character string represented by each text data Identify words. The updating unit updates the intimacy associated with the word in the intimacy information stored in the intimacy database so that the degree of recognition of each of the identified words is increased.

すなわち、映像に合わせて出力される音声に、認識度合い（即ち、親密度）が低い単語が含まれている場合、その単語の発声に掛ける時間長が短いと、その音声を聞いた人物は、音声によって表される情報の内容を認識できない可能性がある。 In other words, if the voice output in accordance with the video contains a word with a low recognition level (ie, familiarity), if the time length for uttering the word is short, the person who heard the voice will be There is a possibility that the content of information represented by speech cannot be recognized.

そこで、本発明の情報処理装置においては、親密度が低いことを表している単語ほど、情報の全発声時間に占める当該単語の発声時間の割合が長くなるように、当該単語に掛ける発声時間を調整した話速データを生成している。 Therefore, in the information processing apparatus of the present invention, the utterance time to be applied to the word is set so that the word representing lower intimacy has a longer proportion of the utterance time of the word in the total utterance time of the information. Adjusted speech speed data is generated.

このような話速データに基づいて合成音声の出力速度を決定すれば、その合成音声においては、情報の全発声時間に占める、親密度が低い単語の発声に掛ける時間長の割合を大きくできる。 If the output speed of the synthesized speech is determined based on such speech speed data, in the synthesized speech, the ratio of the time length for uttering words with low familiarity in the total utterance time of information can be increased.

この結果、その合成音声を聴いた人物は、親密度が低い単語であっても聴き取りやすくなり、発声によって表される情報の内容全体を認識することができる。
しかも、本発明の情報処理装置においては、利用者が過去に視聴した映像に対応するテキストデータを解析して親密度情報を更新している。 As a result, a person who has listened to the synthesized speech can easily hear even a word with low familiarity, and can recognize the entire content of the information represented by the utterance.
In addition, in the information processing apparatus of the present invention, the familiarity information is updated by analyzing text data corresponding to videos viewed by the user in the past.

このように、利用者が過去に視聴した映像における音声中の各単語は、利用者によって認識されている可能性が高い。
したがって、本発明の情報処理装置によれば、利用者ごとの単語の認識状態に合わせた親密度情報を用いることができ、利用者にとって、より適切な話速データを生成できる。 In this way, each word in the audio in the video viewed by the user in the past is likely to be recognized by the user.
Therefore, according to the information processing apparatus of the present invention, it is possible to use intimacy information that matches the recognition state of words for each user, and it is possible to generate more appropriate speech speed data for the user.

換言すれば、本発明の情報処理装置においては、合成音声において、発声の内容を理解しやすくなるように話速を調整することができる。
なお、ここで言う発声時間は、発声に要する時間を表すものであり、速度（話速）を含むものである。 In other words, in the information processing apparatus of the present invention, the speech speed can be adjusted in the synthesized speech so that the content of the utterance can be easily understood.
The utterance time referred to here represents the time required for utterance and includes speed (speech speed).

ところで、本発明の情報処理装置は、解析手段で特定した単語の中から、重要度が高い品詞として予め規定された重要品詞に対応する単語である重要単語を特定する単語特定手段を備えていても良い。 By the way, the information processing apparatus of the present invention includes word specifying means for specifying an important word that is a word corresponding to an important part of speech that is defined in advance as a part of speech with high importance from the words specified by the analyzing means. Also good.

この場合、本発明における話速決定手段は、単語特定手段で特定された重要単語に含まれる母音の発声時間が長くなるように、話速データを生成しても良い。
本発明の情報処理装置によれば、日本語の重要単語に対する発声時間が長くなるように話速データを生成することができる。 In this case, the speech speed determining means in the present invention may generate the speech speed data so that the vowel utterance time included in the important word specified by the word specifying means becomes longer.
According to the information processing apparatus of the present invention, speech speed data can be generated so that the utterance time for an important Japanese word is prolonged.

そして、本発明の情報処理装置にて生成された話速データに基づいて話速が調整された合成音声は、重要単語をより聴き取りやすくすることができ、発声の内容をより理解しやすくできる。 Then, the synthesized speech whose speech speed is adjusted based on the speech speed data generated by the information processing apparatus of the present invention can make it easier to hear important words and understand the content of the utterance more easily. .

さらに、本発明における単語特定手段は、名詞、及び動詞の少なくとも一方を重要品詞とし、重要品詞それぞれに対応する単語を重要単語として特定しても良い。
音声にて出力される情報においては、名詞及び動詞が大きな重みを有する。 Furthermore, the word specifying means in the present invention may specify at least one of a noun and a verb as an important part of speech, and specify a word corresponding to each of the important parts of speech as an important word.
In information output by voice, nouns and verbs have large weights.

このため、本発明においては、名詞及び動詞の少なくとも一方を重要品詞とし、重要品詞それぞれに対応する単語を重要単語として特定しても良い。
このような情報処理装置によれば、名詞及び動詞の少なくとも一方に対する発声時間が長くなるように話速データを生成することができる。 For this reason, in the present invention, at least one of a noun and a verb may be an important part of speech, and a word corresponding to each important part of speech may be specified as an important word.
According to such an information processing apparatus, speech speed data can be generated so that the utterance time for at least one of a noun and a verb becomes longer.

そして、本発明の情報処理装置にて生成された話速データに基づいて話速が調整された合成音声は、名詞及び動詞の少なくとも一方をより聴き取りやすくすることができる。
また、本発明における更新手段は、履歴解析手段にて特定した単語が出現した回数の増加に応じて、単語が出現したタイミングでの親密度が高くなるように、親密度情報において当該単語と対応付けられた親密度を更新しても良い。 The synthesized speech whose speech speed is adjusted based on the speech speed data generated by the information processing apparatus of the present invention can make it easier to hear at least one of a noun and a verb.
In addition, the updating unit according to the present invention corresponds to the word in the familiarity information so that the familiarity at the timing when the word appears increases as the number of times the word specified by the history analyzing unit increases. The attached intimacy may be updated.

このような情報処理装置によれば、映像全体に渡って登場する回数が多い単語ほど、親密度を高くでき、その映像に適した話速データを生成できる。
そして、本発明においては、話速決定手段にて生成された話速データに基づいて、音声合成手段が、各単語を構成する各音素の発声時間が話速データによって表された発声時間となるように音声合成して出力しても良い。 According to such an information processing device, a word having a greater number of appearances throughout the video can be made more intimate and speech speed data suitable for the video can be generated.
In the present invention, based on the speech speed data generated by the speech speed determining means, the speech synthesizing means becomes the utterance time represented by the speech speed data for each phoneme constituting each word. In this way, the voice may be synthesized and output.

このような情報処理装置によれば、発声の内容を理解しやすくなるように話速を調整した合成音声を出力することができる。
なお、本発明のテキストデータのそれぞれには、当該テキストデータによって表された文字列の発声に掛けることが可能な時間長として予め規定された要発声時間が含まれていても良い。 According to such an information processing apparatus, it is possible to output synthesized speech in which the speech speed is adjusted so that the content of the utterance can be easily understood.
Note that each of the text data of the present invention may include a required utterance time defined in advance as a time length that can be applied to the utterance of the character string represented by the text data.

この場合、本発明の話速決定手段は、テキストデータによって表される情報全体の発声時間が要発声時間に維持されるように正規化したデータを、話速データとして生成しても良い。 In this case, the speech speed determination means of the present invention may generate normalized data as speech speed data so that the utterance time of the entire information represented by the text data is maintained at the required utterance time.

このような情報処理装置によれば、情報の内容を発声するために要する時間長を変更することがないため、映像の進行に沿って適切なタイミングで発声させることができる。
ところで、本発明は、話速データを生成する話速データ生成方法としてなされていても良い。 According to such an information processing apparatus, since the time length required for uttering the content of information is not changed, it is possible to utter at an appropriate timing along the progress of the video.
By the way, the present invention may be implemented as a speech speed data generation method for generating speech speed data.

本発明の話速データ生成方法は、テキストデータを取得するテキスト取得過程と、その取得したテキストデータによって表される文字列に含まれる各単語を特定する解析過程と、その特定された各単語に対応する親密度を取得する親密度取得過程と、その取得した親密度が低いことを表している単語ほど、情報全体の発声時間に占める、当該単語の発声時間の割合が長くなるように、当該単語の発声時間を調整した話速データを生成する話速決定過程とを備えている。さらに、本発明の話速データ生成方法は、利用者識別情報を取得する識別情報取得過程と、その取得した利用者識別情報に対応する利用者の視聴情報を取得する履歴取得過程と、その取得した視聴情報によって表される各映像に対応するテキストデータを取得して解析し、各テキストデータによって表される文字列に含まれる各単語を特定する履歴解析過程と、その特定された単語それぞれの認識度合いが高くなるように、親密度データベースに格納されている親密度情報において当該単語と対応付けられた親密度を更新する更新過程とを備えている。 The speech speed data generation method of the present invention includes a text acquisition process for acquiring text data, an analysis process for specifying each word included in a character string represented by the acquired text data, and each of the specified words. The familiarity acquisition process of acquiring the corresponding familiarity, and the word representing that the acquired familiarity is low so that the proportion of the utterance time of the word in the utterance time of the entire information becomes longer A speech speed determination process for generating speech speed data in which the utterance time of the word is adjusted. Furthermore, the speech speed data generation method of the present invention includes an identification information acquisition process for acquiring user identification information, a history acquisition process for acquiring user viewing information corresponding to the acquired user identification information, and the acquisition Text data corresponding to each video represented by the viewing information obtained and analyzed, a history analysis process for identifying each word included in the character string represented by each text data, and each of the identified words An update process for updating the intimacy associated with the word in the intimacy information stored in the intimacy database so that the degree of recognition increases.

このような話速データ生成方法であれば、本発明の情報処理装置と同様の効果を得ることができる。
また、本発明は、コンピュータが実行するプログラムとしてなされていても良い。 With such a speech speed data generation method, the same effect as the information processing apparatus of the present invention can be obtained.
Further, the present invention may be made as a program executed by a computer.

本発明のプログラムでは、テキストデータを取得するテキスト取得手順と、そのテキストデータによって表される文字列に含まれる各単語を特定する解析手順と、その特定された各単語に対応する親密度を取得する親密度取得手順と、その取得した親密度が低いことを表している単語ほど、情報全体の発声時間に占める、当該単語の発声時間の割合が長くなるように、当該単語の発声時間を調整した話速データを生成する話速決定手順とをコンピュータに実行させる。 In the program of the present invention, a text acquisition procedure for acquiring text data, an analysis procedure for specifying each word included in the character string represented by the text data, and a closeness corresponding to each specified word are acquired. The utterance time of the word so that the proportion of the utterance time of the word occupies the utterance time of the entire information becomes longer for the word indicating that the acquired intimacy is lower The computer executes a speech speed determination procedure for generating the spoken speed data.

さらに、本発明のプログラムでは、利用者識別情報を取得する識別情報取得手順と、その取得した利用者識別情報に対応する利用者の視聴情報を取得する履歴取得手順と、その取得した視聴情報によって表される各映像に対応するテキストデータによって表される文字列に含まれる各単語を特定する履歴解析手順と、その特定された単語それぞれの認識度合いが高くなるように、親密度データベースに格納されている親密度情報において当該単語と対応付けられた親密度を更新する更新手順とをコンピュータに実行させる。 Furthermore, in the program of the present invention, an identification information acquisition procedure for acquiring user identification information, a history acquisition procedure for acquiring user viewing information corresponding to the acquired user identification information, and the acquired viewing information The history analysis procedure for identifying each word included in the character string represented by the text data corresponding to each represented image and the degree of recognition of each identified word is stored in the familiarity database. The computer is caused to execute an update procedure for updating the familiarity associated with the word in the familiarity information.

例えば、本発明がプログラムとしてなされていれば、記録媒体から必要に応じてコンピュータにロードさせて起動することや、必要に応じて通信回線を介してコンピュータに取得させて起動することにより用いることができる。そして、コンピュータに各手順を実行させることで、そのコンピュータを、本発明の情報処理装置として機能させることができる。 For example, if the present invention is implemented as a program, it can be used by being loaded from a recording medium into a computer as needed and being activated, or by being obtained and activated by a computer via a communication line as necessary. it can. Then, by causing the computer to execute each procedure, the computer can function as the information processing apparatus of the present invention .

なお、ここで言う記録媒体には、例えば、ＤＶＤ−ＲＯＭ、ＣＤ−ＲＯＭ、ハードディスク等のコンピュータ読み取り可能な電子媒体を含む。 The recording medium referred to here includes, for example, a computer-readable electronic medium such as a DVD-ROM, a CD-ROM, and a hard disk.

本発明が適用された情報処理装置及び情報処理装置の周辺の概略構成を示すブロック図である。1 is a block diagram illustrating an information processing apparatus to which the present invention is applied and a schematic configuration around the information processing apparatus. テキストデータの構造を説明する説明図である。It is explanatory drawing explaining the structure of text data. 話速データ生成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a speech speed data generation process. 話速データ生成処理の処理過程で生成される情報を説明する説明図である。It is explanatory drawing explaining the information produced | generated in the process of a speech speed data production | generation process. 親密度更新処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a closeness update process.

以下に本発明の実施形態を図面と共に説明する。
〈コンテンツ視聴システム〉
図１に示すコンテンツ視聴システム１は、予め用意されたコンテンツを利用者が視聴するシステムであり、情報処理サーバ１０と、少なくとも一つの情報処理装置３０とを備えている。
〈情報処理サーバ〉
情報処理サーバ１０は、各種データが格納されるサーバであり、通信部１２と、制御部１４と、記憶部２２とを備えている。 Embodiments of the present invention will be described below with reference to the drawings.
<Content viewing system>
A content viewing system 1 shown in FIG. 1 is a system in which a user views content prepared in advance, and includes an information processing server 10 and at least one information processing device 30.
<Information processing server>
The information processing server 10 is a server that stores various data, and includes a communication unit 12, a control unit 14, and a storage unit 22.

この情報処理サーバ１０に格納される各種データには、少なくとも、出力すべき映像と音声とを含むコンテンツデータＣＤと、予め入力された音声の音声特徴量を少なくとも含む音源データＳＶと、コンテンツ視聴システム１の利用者が視聴したコンテンツの履歴に関する利用者履歴データＨＤと、各単語の認識度合いを表す親密度を単語それぞれと対応付けた単語親密度データＤＤとを含む。 The various data stored in the information processing server 10 includes at least content data CD including video and audio to be output, sound source data SV including at least audio feature values of audio input in advance, and a content viewing system. User history data HD relating to the history of content viewed by one user, and word familiarity data DD in which familiarity indicating the recognition degree of each word is associated with each word.

通信部１２は、通信網を介して、情報処理サーバ１０が外部との間で通信を行う。本実施形態における通信網とは、例えば、公衆無線通信網やネットワーク回線である。
制御部１４は、ＲＯＭ１６と、ＲＡＭ１８と、ＣＰＵ２０とを少なくとも有した周知のコンピュータを中心に構成され、通信部１２や記憶部２２を制御する。 In the communication unit 12, the information processing server 10 communicates with the outside through a communication network. The communication network in this embodiment is, for example, a public wireless communication network or a network line.
The control unit 14 is configured around a known computer having at least a ROM 16, a RAM 18, and a CPU 20, and controls the communication unit 12 and the storage unit 22.

ＲＯＭ１６は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを格納する。ＲＡＭ１８は、処理プログラムやデータを一時的に格納する。ＣＰＵ２０は、ＲＯＭ１６やＲＡＭ１８に記憶された処理プログラムに従って各種処理を実行する。 The ROM 16 stores processing programs and data that need to retain stored contents even when the power is turned off. The RAM 18 temporarily stores processing programs and data. The CPU 20 executes various processes according to the processing program stored in the ROM 16 or the RAM 18.

記憶部２２は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。この記憶装置とは、例えば、ハードディスク装置やフラッシュメモリなどである。記憶部２２には、コンテンツデータＣＤと、音源データＳＶと、利用者履歴データＨＤと、単語親密度データＤＤとが格納されている。 The storage unit 22 is a non-volatile storage device configured to be able to read and write stored contents. The storage device is, for example, a hard disk device or a flash memory. The storage unit 22 stores content data CD, sound source data SV, user history data HD, and word familiarity data DD.

このうち、コンテンツデータＣＤは、コンテンツごとに予め用意されたデータである。
ここで言うコンテンツとは、少なくとも画像（映像）と音声とが時間軸に沿って出力される制作物である。この制作物の一例として、映画やテレビ番組が考えられる。 Among these, the content data CD is data prepared in advance for each content.
The content mentioned here is a product in which at least an image (video) and audio are output along the time axis. As an example of this product, a movie or a TV program can be considered.

このコンテンツデータＣＤは、映像データＩＭと、セリフ音声データＳＤと、セリフテキストデータＴＤとを含む。図１中の符号“ｍ”は、コンテンツデータＣＤそれぞれを識別する符号である。 This content data CD includes video data IM, speech audio data SD, and speech text data TD. A code “m” in FIG. 1 is a code for identifying each content data CD.

映像データＩＭは、コンテンツにおいて出力される映像（動画）を構成する複数の画像からなるデータである。
セリフ音声データＳＤは、映像データＩＭによって表される映像に合わせて出力される音声データである。このセリフ音声データＳＤは、例えば、映像に合わせて発せられるセリフやナレーションである。本実施形態におけるセリフ音声データＳＤは、映像におけるセリフやナレーションごとに用意されていても良いし、映像における時間軸に沿って予め規定された単位区間ごとに用意されていても良い。 The video data IM is data composed of a plurality of images constituting a video (moving image) output in the content.
The speech audio data SD is audio data that is output in accordance with the video represented by the video data IM. The speech audio data SD is, for example, speech or narration that is emitted in accordance with the video. The speech audio data SD in the present embodiment may be prepared for each speech or narration in the video, or may be prepared for each unit section defined in advance along the time axis in the video.

セリフテキストデータＴＤは、映像データＩＭによって表される映像に合わせて出力される音声の内容を表すテキストデータである。このセリフテキストデータＴＤには、図２に示すように、配役情報と、字幕情報と、タイミング情報とが含まれる。 The serif text data TD is text data representing the content of audio output in accordance with the video represented by the video data IM. As shown in FIG. 2, the serif text data TD includes casting information, caption information, and timing information.

このうち、字幕情報は、映像に合わせて出力される字幕（テキスト）である。この字幕は、セリフやナレーションなどの内容を文字列で表したものである。さらに、本実施形態における字幕の言語は、日本語である。 Among these, the caption information is a caption (text) output in accordance with the video. This subtitle is a character string representing contents such as lines and narration. Further, the subtitle language in the present embodiment is Japanese.

配役情報は、各字幕を読み上げるべき人物を識別する情報であり、字幕それぞれに規定されている。この配役情報は、人物そのものを特定する情報であっても良いし、性別や年齢などの人物の特徴を表す情報であっても良い。 The casting information is information for identifying a person who should read out each caption, and is defined for each caption. This casting information may be information that identifies the person itself, or information that represents the characteristics of the person such as gender and age.

タイミング情報は、字幕情報によって表される字幕を出力するタイミングが規定された開始タイミングと、その出力を終了するタイミングを表す終了タイミングとが、字幕それぞれに規定された情報である。これらの開始タイミング及び終了タイミングは、映像データＩＭにおける時間の進行と対応付けられている。 The timing information is information in which each of the subtitles includes a start timing in which the timing for outputting the subtitle represented by the subtitle information is defined and an end timing in which the output is terminated. These start timing and end timing are associated with the progress of time in the video data IM.

さらに、タイミング情報には、セリフテキストデータＴＤに含まれる字幕情報によって表された文字列全体を読み上げることに掛けることが可能な時間長として規定された要発声時間が含まれている。 Further, the timing information includes a required utterance time defined as a time length that can be spent reading out the entire character string represented by the subtitle information included in the serif text data TD.

なお、本実施形態におけるセリフテキストデータＴＤは、映像に合わせて出力される字幕ごとに用意されている。
音源データＳＶは、音声パラメータとタグデータとを音源ごとに対応付けたデータである。音声パラメータは、人が発した音の波形を表す少なくとも一つの特徴量である。この特徴量は、いわゆるフォルマント合成に用いる音声の特徴量であり、発声者ごと、かつ、音素ごとに用意される。音声パラメータにおける特徴量として、発声音声における各音素での基本周波数Ｆ０、メル周波数ケプストラム（ＭＦＣＣ）、音素長、パワー、及びそれらの時間差分を少なくとも備えている。 Note that the serif text data TD in this embodiment is prepared for each subtitle output in accordance with the video.
The sound source data SV is data in which sound parameters and tag data are associated with each sound source. The voice parameter is at least one feature amount representing a waveform of a sound emitted by a person. This feature amount is a feature amount of speech used for so-called formant synthesis, and is prepared for each speaker and for each phoneme. As a feature value in the speech parameter, at least a fundamental frequency F0, a mel frequency cepstrum (MFCC), a phoneme length, a power, and a time difference thereof in each phoneme in the uttered speech are provided.

タグデータは、音声パラメータによって表される音の性質を表すデータであり、少なくとも、発声者の特徴を表す発声者特徴データを含む。この発声者特徴データには、例えば、発声者の性別、年齢などを含む。 The tag data is data representing the nature of the sound represented by the speech parameters, and includes at least speaker feature data representing the features of the speaker. The speaker feature data includes, for example, the sex and age of the speaker.

さらに、タグデータには、当該音声が発声されたときの発声者の表情を表す表情データを含んでも良い。この表情データは、感情や情緒、情景、状況を少なくとも含む表情としての概念を表すデータであり、発声者の表情を推定するために必要な情報を含んでも良い。 Further, the tag data may include facial expression data representing the facial expression of the speaker when the voice is uttered. This facial expression data is data representing a concept as a facial expression including at least emotions, emotions, scenes, and situations, and may include information necessary for estimating the expression of the speaker.

これらの音声パラメータとタグデータとを対応付けた音源データＳＶは、例えば、周知のカラオケ装置を用いて楽曲が歌唱された際に、そのカラオケ装置にて予め規定された処理を実行することで生成され記憶部２２に登録されても良い。 The sound source data SV in which these voice parameters and tag data are associated with each other is generated, for example, by executing a process defined in advance in the karaoke device when a song is sung using a known karaoke device. And may be registered in the storage unit 22.

なお、図１中の符号“ｎ”は、音源データそれぞれＳＶを識別する符号である。
また、利用者履歴データＨＤは、コンテンツ視聴システム１の利用者が視聴したコンテンツの履歴を表すデータである。図１中の符号“Ｌ”は、利用者履歴データＨＤそれぞれを識別する符号である。 In addition, the code “n” in FIG. 1 is a code for identifying each sound source data SV.
The user history data HD is data representing the history of content viewed by the user of the content viewing system 1. The code “L” in FIG. 1 is a code for identifying each user history data HD.

この利用者履歴データＨＤは、利用者を識別する利用者ＩＤと、その利用者が視聴したコンテンツそれぞれを識別するコンテンツＩＤとが利用者ごとに対応付けられたデータである。 The user history data HD is data in which a user ID for identifying a user and a content ID for identifying each content viewed by the user are associated with each user.

この利用者履歴データＨＤは、コンテンツが閲覧されるごとに、そのコンテンツを視聴したものとして、当該コンテンツのコンテンツＩＤを利用者ＩＤと対応付けても良い。なお、コンテンツＩＤを利用者ＩＤと対応付けるタイミングは、コンテンツが閲覧されたタイミングに限るものではなく、コンテンツが購入されたタイミングであっても良い。 In this user history data HD, each time a content is browsed, the content ID of the content may be associated with the user ID, assuming that the content has been viewed. Note that the timing at which the content ID is associated with the user ID is not limited to the timing at which the content is browsed, but may be the timing at which the content is purchased.

また、単語親密度データＤＤは、単語それぞれと各単語の認識度合いを表す親密度とが予め対応付けられたデータである。ここで言う親密度は、認識度合いが高いほど大きな値である。すなわち、単語親密度データＤＤは、特許請求の範囲に記載された親密度情報の一例である。 The word familiarity data DD is data in which each word is associated with a familiarity representing the recognition degree of each word in advance. The familiarity here is a larger value as the recognition degree is higher. That is, the word familiarity data DD is an example of the familiarity information described in the claims.

なお、本実施形態における単語親密度データＤＤは、利用者ごとの各単語の認識度合いが記憶されたものでも良い。また、本実施形態においては、単語親密度データＤＤが記憶された記憶部２２は、親密度データベースとして機能する。
〈情報処理装置〉
情報処理装置３０は、通信部３１と、入力受付部３２と、表示部３３と、音入力部３４と、音出力部３５と、記憶部３６と、制御部４０とを備えている。 Note that the word familiarity data DD in the present embodiment may store the recognition degree of each word for each user. In the present embodiment, the storage unit 22 in which the word familiarity data DD is stored functions as a familiarity database.
<Information processing device>
The information processing apparatus 30 includes a communication unit 31, an input reception unit 32, a display unit 33, a sound input unit 34, a sound output unit 35, a storage unit 36, and a control unit 40.

本実施形態における情報処理装置３０として、例えば、周知の携帯端末を想定しても良いし、いわゆるパーソナルコンピュータといった周知の情報処理装置を想定しても良い。なお、携帯端末には、周知の電子書籍端末や、携帯電話、タブレット端末などの携帯情報端末を含む。 As the information processing apparatus 30 in the present embodiment, for example, a known portable terminal may be assumed, or a known information processing apparatus such as a so-called personal computer may be assumed. Note that portable terminals include well-known electronic book terminals, and portable information terminals such as mobile phones and tablet terminals.

通信部３１は、通信網を介して外部との間で情報通信を行う。入力受付部３２は、入力装置（図示せず）を介して入力された情報を受け付ける。表示部３３は、制御部４０からの信号に基づいて画像を表示する。 The communication unit 31 performs information communication with the outside via a communication network. The input receiving unit 32 receives information input via an input device (not shown). The display unit 33 displays an image based on a signal from the control unit 40.

音入力部３４は、音を電気信号に変換して制御部４０に入力する装置であり、例えば、マイクロホンである。音出力部３５は、音を出力する周知の装置であり、例えば、ＰＣＭ音源と、スピーカとを備えている。記憶部３６は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。記憶部３６には、各種処理プログラムや各種データが記憶される。 The sound input unit 34 is a device that converts sound into an electric signal and inputs the electric signal to the control unit 40, and is, for example, a microphone. The sound output unit 35 is a known device that outputs sound, and includes, for example, a PCM sound source and a speaker. The storage unit 36 is a non-volatile storage device configured to be able to read and write stored contents. The storage unit 36 stores various processing programs and various data.

また、制御部４０は、ＲＯＭ４１、ＲＡＭ４２、ＣＰＵ４３を少なくとも有した周知のコンピュータを中心に構成されている。ＲＯＭ４１は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを格納する。ＲＡＭ４２は、処理プログラムやデータを一時的に格納する。ＣＰＵ４３は、ＲＯＭ４１やＲＡＭ４２に記憶された処理プログラムに従って各種処理を実行する。 The control unit 40 is configured around a known computer having at least a ROM 41, a RAM 42, and a CPU 43. The ROM 41 stores processing programs and data that need to retain stored contents even when the power is turned off. The RAM 42 temporarily stores processing programs and data. The CPU 43 executes various processes according to the processing programs stored in the ROM 41 and the RAM 42.

すなわち、情報処理装置３０は、指定コンテンツに対応するコンテンツデータＣＤに基づいて、その指定コンテンツにおける映像を表示部３３に表示すると共に、映像における時間軸に合わせて音声を音出力部３５から出力する。ここで言う指定コンテンツとは、入力受付部３２にて受け付けた情報によって指定されたコンテンツである。 That is, the information processing apparatus 30 displays the video in the designated content on the display unit 33 based on the content data CD corresponding to the designated content, and outputs the sound from the sound output unit 35 in accordance with the time axis in the video. . The designated content referred to here is content designated by information received by the input receiving unit 32.

情報処理装置３０は、指定コンテンツにおける音声を出力する際に、セリフテキストデータＣＤによって表された日本語の字幕（テキスト）を、情報処理サーバ１０に格納されている音源データＳＶを用いて音声合成して合成音声を出力する。すなわち、本実施形態の情報処理装置３０は、声の吹き替えを実行可能に構成されている。 When the information processing apparatus 30 outputs sound in the designated content, the Japanese subtitles (text) represented by the serif text data CD are synthesized using the sound source data SV stored in the information processing server 10. To output synthesized speech. That is, the information processing apparatus 30 according to the present embodiment is configured to be able to execute voice-over.

情報処理装置３０のＲＯＭ４１には、音声合成によって出力される合成音声の発声時間を表す話速データを生成する話速データ生成処理を、制御部４０が実行するための処理プログラムが格納されている。
〈話速データ生成処理〉
情報処理装置３０の制御部４０が実行する話速データ生成処理は、起動指令が入力されると起動される。 The ROM 41 of the information processing apparatus 30 stores a processing program for the control unit 40 to execute speech speed data generation processing for generating speech speed data representing the speech time of synthesized speech output by speech synthesis. .
<Speech speed data generation processing>
The speech speed data generation process executed by the control unit 40 of the information processing apparatus 30 is activated when an activation command is input.

この話速データ生成処理では、図３に示すように、起動されると、制御部４０は、まず、指定コンテンツの日本語によるセリフテキストデータＣＤを取得する（Ｓ１１０）。続いて、制御部４０は、Ｓ１１０にて取得したセリフテキストデータＣＤによって表されるテキストを形態素解析し、形態素情報を導出する（Ｓ１２０）。このＳ１２０における形態素解析の手法として、周知の手法（例えば、“ＭｅＣａｂ”）を用いれば良い。 In the speech speed data generation process, as shown in FIG. 3, when activated, the control unit 40 first acquires Japanese text text data CD of designated content (S110). Subsequently, the control unit 40 performs morphological analysis on the text represented by the serif text data CD acquired in S110, and derives morpheme information (S120). A well-known method (for example, “MeCab”) may be used as the method of morphological analysis in S120.

また、形態素情報には、形態素ｍｏ（ｋ）と、形態素音素数ｐｈ＿ｎｕ（ｋ）と、音素ｐｈ（ｋ，ｊ）と、品詞フラグｐａ（ｋ）とが含まれる。
このうち、形態素ｍｏ（ｋ）は、セリフテキストデータＣＤによって表されるテキストに含まれる各形態素ｍｏである。符号“ｋ”は、テキストに含まれる形態素ｍｏそれぞれを識別するインデックス番号であり、セリフテキストデータＣＤにおける時間軸に沿って順に割り当てられる。 The morpheme information includes a morpheme mo (k), a morpheme phoneme number ph_nu (k), a phoneme ph (k, j), and a part of speech flag pa (k).
Among these, the morpheme mo (k) is each morpheme mo included in the text represented by the serif text data CD. The code “k” is an index number for identifying each morpheme mo included in the text, and is assigned in order along the time axis in the serif text data CD.

音素ｐｈ（ｋ，ｊ）は、形態素ｍｏ（ｋ）それぞれを構成する各音素である。符号“ｊ”は、各形態素ｍｏ（ｋ）に含まれる音素それぞれを識別するインデックス番号であり、テキストにおける時間軸に沿って割り当てられている。また、形態素音素数ｐｈ＿ｎｕ（ｋ）は、各形態素ｍｏ（ｋ）を構成する音素ｐｈの数である。 The phoneme ph (k, j) is each phoneme constituting each morpheme mo (k). The code “j” is an index number for identifying each phoneme included in each morpheme mo (k), and is assigned along the time axis in the text. The morpheme phoneme number ph_nu (k) is the number of phonemes ph constituting each morpheme mo (k).

さらに、品詞フラグｐａ（ｋ）は、各形態素ｍｏ（ｋ）（単語）に対応する品詞が、名詞または動詞であるか否かを表す。この品詞フラグｐａ（ｋ）は、品詞が名詞または動詞であれば「１」を設定し、品詞が名詞もしくは動詞でなければ「０」を設定する。 Further, the part of speech flag pa (k) indicates whether or not the part of speech corresponding to each morpheme mo (k) (word) is a noun or a verb. The part-of-speech flag pa (k) is set to “1” if the part-of-speech is a noun or a verb, and is set to “0” if the part-of-speech is a noun or a verb.

例えば、セリフテキストデータＣＤによって表されるテキストが「明日は晴れですね」である場合、そのテキストを形態素解析することで、図４に示す各形態素ｍｏ（ｋ）（図中，明日／は／晴れ／ですね）、及び音素ｐｈ（ｋ，ｊ）（図中，ａｓｕ／ｗａ／ｈａｒｅ／ｄｅｓｎｅ）を含む形態素情報が導出される。 For example, when the text represented by the serif text data CD is “Tomorrow is sunny”, the morpheme mo (k) shown in FIG. 4 (tomorrow / has / Morphological information including the phoneme ph (k, j) (asu / wa / hall / desne in the figure) is derived.

さらに、話速データ生成処理では、制御部４０が、情報処理サーバ１０の記憶部２２から、Ｓ１２０にて導出した各形態素情報に含まれる形態素（単語）ｍｏ（ｋ）それぞれに対応する親密度を取得する（Ｓ１３０）。 Further, in the speech speed data generation process, the control unit 40 determines the familiarity corresponding to each morpheme (word) mo (k) included in each morpheme information derived in S120 from the storage unit 22 of the information processing server 10. Obtain (S130).

続いて、話速データ生成処理では、制御部４０は、各音素ｐｈ（ｋ，ｊ）が母音であるか否かを判定し、母音フラグｖｗ（ｋ，ｊ）を設定する（Ｓ１４０）。このＳ１４０では、具体的には、図４に示すように、各形態素ｍｏ（ｋ）における音素ｐｈ（ｋ，ｊ）が母音であれば、母音フラグｖｗ（ｋ，ｊ）を「１」に設定し、音素ｐｈ（ｋ，ｊ）が子音であれば、母音フラグｖｗ（ｋ，ｊ）を「０」に設定する。 Subsequently, in the speech speed data generation process, the control unit 40 determines whether each phoneme ph (k, j) is a vowel, and sets a vowel flag vw (k, j) (S140). In S140, specifically, as shown in FIG. 4, if the phoneme ph (k, j) in each morpheme mo (k) is a vowel, the vowel flag vw (k, j) is set to “1”. If the phoneme ph (k, j) is a consonant, the vowel flag vw (k, j) is set to “0”.

さらに、話速データ生成処理では、制御部４０は、音素長比率Ｐｈ＿ｌｒ（ｋ，ｊ）の初期値を設定する（Ｓ１５０）。ここで言う音素長比率Ｐｈ＿ｌｒ（ｋ，ｊ）は、セリフテキストデータＣＤによって表されるテキスト全体を読み上げるために必要な時間長（発声時間長）に占める、各音素ｐｈ（ｋ，ｊ）の読み上げに必要な時間長の割合である。 Further, in the speech speed data generation process, the control unit 40 sets an initial value of the phoneme length ratio Ph_lr (k, j) (S150). The phoneme length ratio Ph_lr (k, j) referred to here is the reading of each phoneme ph (k, j) in the time length (speech time length) necessary for reading the entire text represented by the serif text data CD. It is the ratio of the time length required for

本実施形態におけるＳ１５０では、具体的には、音素ｐｈ（ｋ，ｊ）が母音であれば、音素長比率ｐｈ＿ｌｒ（ｋ，ｊ）の初期値を「１」に設定し、音素ｐｈ（ｋ，ｊ）が子音であれば、音素長比率ｐｈ＿ｌｒ（ｋ，ｊ）の初期値を「規定値ｐ」に設定する。なお、本実施形態における規定値ｐは、予め規定された値であり、「０」よりも大きく「１」よりも小さい値である。 In S150 in the present embodiment, specifically, if the phoneme ph (k, j) is a vowel, the initial value of the phoneme length ratio ph_lr (k, j) is set to “1”, and the phoneme ph (k, k, j) is set. If j) is a consonant, the initial value of the phoneme length ratio ph_lr (k, j) is set to the “specified value p”. The specified value p in the present embodiment is a value specified in advance, and is a value that is larger than “0” and smaller than “1”.

続いて、話速データ生成処理では、制御部４０は、形態素情報に含まれる品詞フラグに基づいて、Ｓ１２０で導出した各形態素ｍｏ（ｋ）（単語）の中から重要単語を特定する（Ｓ１６０）。ここで言う重要単語とは、重要度が高い品詞として予め規定された重要品詞に対応する単語である。そして、本実施形態における重要品詞には、動詞と名詞とが含まれる。 Subsequently, in the speech speed data generation process, the control unit 40 specifies an important word from each morpheme mo (k) (word) derived in S120 based on the part of speech flag included in the morpheme information (S160). . The important word here is a word corresponding to an important part of speech that is defined in advance as a part of speech having a high degree of importance. The important parts of speech in this embodiment include verbs and nouns.

そして、制御部４０は、Ｓ１６０にて重要単語であると特定された各形態素ｍｏ（ｋ）を構成する音素ｐｈ（ｋ，ｊ）それぞれの中で母音に対応する音素ｐｈ（ｋ，ｊ）の音素長比率Ｐｈ＿ｌｒ（ｋ，ｊ）を更新する（Ｓ１７０）。このＳ１７０における更新は、下記（１）式に従って実行され、重要単語に含まれる母音に対応する音素ｐｈ（ｋ，ｊ）の音素長比率Ｐｈ＿ｌｒ（ｋ，ｊ）だけが長くなる。なお、（１）式中のαは、予め規定された定数である。 And the control part 40 of phoneme ph (k, j) corresponding to a vowel in each phoneme ph (k, j) which comprises each morpheme mo (k) identified as an important word in S160. The phoneme length ratio Ph_lr (k, j) is updated (S170). The update in S170 is executed according to the following equation (1), and only the phoneme length ratio Ph_lr (k, j) of the phoneme ph (k, j) corresponding to the vowel included in the important word is lengthened. In the equation (1), α is a constant defined in advance.

すなわち、本実施形態のＳ１７０では、品詞フラグｐａ（ｋ）が「１」であり、かつ、母音フラグｖｗ（ｋ，ｊ）が「１」である音素ｐｈ（ｋ，ｊ）を発声する時間長が“１＋α／１００”倍される。 That is, in S170 of this embodiment, the time length for uttering the phoneme ph (k, j) whose part-of-speech flag pa (k) is “1” and whose vowel flag vw (k, j) is “1”. Is multiplied by “1 + α / 100”.

さらに、話速データ生成処理では、制御部４０は、まず、各形態素ｍｏ（ｋ）の親密度を情報処理サーバ１０から取得し、その取得した親密度に基づいて規格化親密度ｎｒ＿ｆａ（ｋ）を算出する（Ｓ１８０）。この規格化親密度ｎｒ＿ｆａ（ｋ）は、形態素ｍｏ（ｋ）ごとの親密度の平均が「１」、分散が「１」となるように、各形態素ｍｏ（ｋ）の親密度を規格化したものである。 Further, in the speech speed data generation process, the control unit 40 first acquires the familiarity of each morpheme mo (k) from the information processing server 10 and normalizes the familiarity nr_fa (k) based on the acquired familiarity. Is calculated (S180). This normalized familiarity nr_fa (k) has normalized the familiarity of each morpheme mo (k) so that the average of the familiarity for each morpheme mo (k) is “1” and the variance is “1”. Is.

このＳ１８０においては、さらに、制御部４０は、下記（２）式に従って倍率β（ｋ）を算出すると共に、下記（３）式に従って、各形態素に含まれる母音の音素長比率Ｐｈ＿ｌｒ（ｋ，ｊ）を補正する。 In S180, the control unit 40 further calculates the magnification β (k) according to the following equation (2), and the phoneme length ratio Ph_lr (k, j) included in each morpheme according to the following equation (3): ) Is corrected.

すなわち、Ｓ１８０によって、親密度が低いことを表している単語の母音の音素長比率Ｐｈ＿ｌｒ（ｋ，ｊ）は、情報全体の読み上げに要する時間に占める当該単語の読み上げに要する時間の割合が長くなるように補正される。 That is to say, in S180, the phoneme length ratio Ph_lr (k, j) of a word representing a low familiarity increases the ratio of the time required for reading the word to the time required for reading the entire information. It is corrected as follows.

続いて、話速データ生成処理では、制御部４０が、セリフテキストデータＣＤによって表されるテキスト全体の発声時間が要発声時間に維持されるように、各音素ｐｈ（ｋ，ｊ）の音素時間長Ｐｈ＿ｌｅ（ｋ，ｊ）を導出する（Ｓ１９０）。 Subsequently, in the speech speed data generation process, the control unit 40 keeps the phoneme time of each phoneme ph (k, j) so that the utterance time of the entire text represented by the serif text data CD is maintained at the required utterance time. The length Ph_le (k, j) is derived (S190).

具体的に、本実施形態のＳ１９０における各音素ｐｈ（ｋ，ｊ）の音素時間長Ｐｈ＿ｌｅ（ｋ，ｊ）の導出は、下記（４）式に従って実行される。 Specifically, the derivation of the phoneme time length Ph_le (k, j) of each phoneme ph (k, j) in S190 of the present embodiment is executed according to the following equation (4).

なお、（４）式における分母は、セリフテキストデータＣＤに含まれる全ての音素ｐｈ（ｋ，ｊ）音素長比率Ｐｈ＿ｌｒ（ｋ，ｊ）を積算した値（総和）である。そして、（４）式における符号“ｔｏｌ”は、要発声時間である。また、（４）式における符号“Ｎ”は、セリフテキストデータＣＤに含まれる音素ｐｈの個数である。 The denominator in the equation (4) is a value (total) obtained by integrating all phoneme ph (k, j) phoneme length ratios Ph_lr (k, j) included in the serif text data CD. The code “tol” in the equation (4) is a required utterance time. Further, the symbol “N” in the equation (4) is the number of phonemes ph included in the serif text data CD.

すなわち、音素時間長Ｐｈ＿ｌｅ（ｋ，ｊ）は、セリフテキストデータＣＤによって表される字幕を読み上げる全時間長が、当該セリフテキストデータＣＤにおける要発声時間に維持されるように正規化されている。 That is, the phoneme time length Ph_le (k, j) is normalized so that the total time length for reading the subtitles represented by the serif text data CD is maintained at the required utterance time in the serif text data CD.

続いて、話速データ生成処理では、制御部４０が、Ｓ１９０にて導出された音素時間長Ｐｈ＿ｌｅ（ｋ，ｊ）を、各形態素ｍｏ（ｋ）を構成する各音素ｐｈ（ｋ，ｊ）を読み上げるタイミングを表すデータとして規定した話速データを生成する（Ｓ２００）。 Subsequently, in the speech speed data generation process, the control unit 40 uses the phoneme time length Ph_le (k, j) derived in S190 as the phoneme ph (k, j) constituting each morpheme mo (k). Spoken speed data defined as data representing the read-out timing is generated (S200).

さらに、話速データ生成処理では、制御部４０が、Ｓ１１０にて取得したセリフテキストデータＣＤに含まれている配役情報それぞれに基づいて、各配役情報に最も適合する音源データＳＶを取得する（Ｓ２１０）。 Further, in the speech speed data generation process, the control unit 40 acquires sound source data SV that best matches each casting information based on each casting information included in the speech text data CD obtained in S110 (S210). ).

そして、話速データ生成処理では、制御部４０が、Ｓ２１０にて取得した音源データＳＶを用いて、Ｓ１１０にて取得したセリフテキストデータＣＤに含まれている字幕情報の内容を音声合成する（Ｓ２２０）。なお、本実施形態のＳ２２０では、Ｓ２００にて生成された話速データに基づいて、字幕情報によって表されるテキストを構成する各音素の読み上げタイミング（速度）が決定される。 In the speech speed data generation process, the control unit 40 uses the sound source data SV acquired in S210 to synthesize the content of the subtitle information included in the speech text data CD acquired in S110 (S220). ). In S220 of the present embodiment, the reading timing (speed) of each phoneme constituting the text represented by the caption information is determined based on the speech speed data generated in S200.

そして、本実施形態のＳ２２０では、制御部４０は、制御信号を音出力部３５に出力し、音声合成によって生成された合成音声を音出力部３５から出力する。
その後、本話速データ生成処理を終了する。そして、時間軸に沿って次の映像データＩＭが出力されるタイミングに合わせて、話速データ生成処理を起動し、その映像データＩＭの時間軸に沿った次のセリフテキストデータＴＤを取得する（Ｓ１１０）。その後、Ｓ１２０〜Ｓ２２０を実行する。 In S <b> 220 of this embodiment, the control unit 40 outputs a control signal to the sound output unit 35, and outputs a synthesized speech generated by speech synthesis from the sound output unit 35.
Thereafter, the present speech speed data generation process is terminated. Then, in accordance with the timing at which the next video data IM is output along the time axis, the speech speed data generation process is started, and the next serif text data TD along the time axis of the video data IM is acquired ( S110). Thereafter, S120 to S220 are executed.

つまり、本実施形態の話速データ生成処理では、指定コンテンツのセリフテキストデータＴＤを取得し、その取得したセリフテキストデータＴＤを形態素解析する。そして、情報処理サーバ１０に格納されている単語親密度データに基づいて、形態素解析にて特定された各形態素（単語）について親密度を特定する。 That is, in the speech speed data generation process of the present embodiment, the serif text data TD of the specified content is acquired, and the acquired serif text data TD is morphologically analyzed. Then, based on the word familiarity data stored in the information processing server 10, the familiarity is specified for each morpheme (word) specified in the morphological analysis.

さらに、話速データ生成処理では、親密度が低いことを表している単語ほど、情報全体の読み上げに要する時間に占める当該単語の読み上げに要する時間の割合が長くなるように、話速データを生成している。
〈親密度更新処理〉
情報処理サーバ１０の制御部１４が実行する親密度更新処理について説明する。 Furthermore, in the speech speed data generation process, the speech speed data is generated so that the word indicating that the familiarity is low, the ratio of the time required for reading the word to the time required for reading the entire information becomes longer. doing.
<Intimacy update processing>
A familiarity update process executed by the control unit 14 of the information processing server 10 will be described.

この親密度更新処理は、話速データ生成処理の起動タイミングに合わせて起動される。
この親密度更新処理では、起動されると、図５に示すように、まず、制御部１４が、情報処理装置３０の入力受付部３２を介して入力された利用者ＩＤを取得する（Ｓ３１０）。 This closeness update process is activated in synchronization with the activation timing of the speech speed data generation process.
In this familiarity update process, when started, as shown in FIG. 5, first, the control unit 14 acquires a user ID input via the input receiving unit 32 of the information processing apparatus 30 (S310). .

続いて、親密度更新処理では、制御部１４は、利用者履歴データＨＤにおいて、Ｓ３１０にて取得された利用者ＩＤと対応付けられている全てのコンテンツＩＤを取得する（Ｓ３２０）。 Subsequently, in the familiarity update process, the control unit 14 acquires all content IDs associated with the user ID acquired in S310 in the user history data HD (S320).

さらに、親密度更新処理では、制御部１４は、Ｓ３２０にて取得されたコンテンツＩＤそれぞれに対応し、かつ、日本語による全てのセリフテキストデータＴＤを取得する（Ｓ３３０）。 Further, in the familiarity update process, the control unit 14 acquires all the Japanese text text data TD corresponding to each content ID acquired in S320 (S330).

続いて、親密度更新処理では、制御部１４は、Ｓ３３０にて取得したセリフテキストデータＴＤそれぞれによって表されるテキストを形態素解析し、形態素情報を導出する（Ｓ３４０）。このＳ３４０における形態素解析の手法として、周知の手法（例えば、“ＭｅＣａｂ”）を用いれば良い。また、ここでの形態素情報には、少なくとも形態素ｍｏ（ｋ）（単語）が含まれる。 Subsequently, in the familiarity update process, the control unit 14 performs morphological analysis on the text represented by each of the serif text data TD acquired in S330, and derives morpheme information (S340). A well-known method (for example, “MeCab”) may be used as the morphological analysis method in S340. The morpheme information here includes at least morpheme mo (k) (word).

そして、親密度更新処理では、制御部１４は、Ｓ３４０にて導出した形態素ｍｏ（ｋ）に基づいて、単語親密度データＤＤを更新する（Ｓ３５０）。具体的に、本実施形態のＳ３５０では、同一内容の形態素ｍｏごとに出現回数をカウントし、その出現回数が多い形態素ｍｏ（単語）ほど親密度が高くなるように、単語親密度データＤＤを更新する。 In the familiarity update process, the control unit 14 updates the word familiarity data DD based on the morpheme mo (k) derived in S340 (S350). Specifically, in S350 of the present embodiment, the number of appearances is counted for each morpheme mo having the same content, and the word familiarity data DD is updated so that the morpheme mo (word) with the larger number of appearances has a higher familiarity. To do.

なお、親密度の更新は、出現回数に予め規定された係数を乗じた値を、更新前の親密度に加算することで実現すれば良い。また、親密度の更新は、形態素ｍｏの品詞が自立語であるものを対象とし、付属語は対象外としても良い。 The update of the familiarity may be realized by adding a value obtained by multiplying the number of appearances by a predetermined coefficient to the familiarity before the update. In addition, the update of the intimacy may be performed on the morpheme mo whose part of speech is an independent word, and the attached word may be excluded.

その後、親密度更新処理を終了する。
つまり、本実施形態の親密度更新処理においては、制御部１４は、利用者が視聴したコンテンツを通して出現回数が多い形態素ｍｏ（単語）ほど親密度が高くなるように、記憶部２２に格納されている単語親密度データＤＤを更新する。
［実施形態の効果］
以上説明したように、本実施形態の話速データ生成処理では、親密度が低い単語ほど、全読み上げ時間に占める当該単語の読み上げ時間の割合が長くなるように、話速データを生成している。 Thereafter, the closeness update process is terminated.
In other words, in the closeness update process of the present embodiment, the control unit 14 is stored in the storage unit 22 so that the closeness of the morpheme mo (word) that appears more frequently through the content viewed by the user is higher. Update word familiarity data DD.
[Effect of the embodiment]
As described above, in the speech speed data generation process according to the present embodiment, the speech speed data is generated so that the lower the familiarity of the word, the longer the ratio of the reading time of the word in the total reading time is. .

これは、認識度合い（即ち、親密度）が低い単語の読み上げに要する時間長が短いと、映像に合わせて出力される音声を聴いた人物は、その音声による情報の内容を認識できない可能性があるためである。 This is because if the time required to read a word with a low recognition level (ie, intimacy) is short, a person who listens to the sound output in accordance with the video may not be able to recognize the content of the information based on the sound. Because there is.

すなわち、本実施形態の話速データ生成処理によって生成された話速データに基づいて合成音声における各音素の開始タイミングを決定すれば、その合成音声においては、情報の全読み上げ時間に占める、親密度が低い単語の読み上げに要する時間長の割合を大きくできる。 That is, if the start timing of each phoneme in the synthesized speech is determined based on the speech rate data generated by the speech rate data generation process of the present embodiment, the familiarity that occupies the total reading time of information in the synthesized speech The ratio of the time length required to read out words with low can be increased.

この結果、親密度が低い単語であっても、合成音声を聴いた人物が聴き取りやすくなり、その人物は、発声によって表される情報の内容全体を認識することができる。
換言すれば、情報処理装置３０においては、合成音声において、発声の内容を理解しやすくなるように、読み上げ速度（即ち、話速）を調整できる。 As a result, even if the word has a low familiarity, it is easy for a person who has listened to the synthesized speech to hear, and the person can recognize the entire content of the information represented by the utterance.
In other words, the information processing apparatus 30 can adjust the reading speed (that is, speaking speed) so that the content of the utterance can be easily understood in the synthesized speech.

ところで、通常、日本語の音声にて表される情報では、名詞及び動詞が大きな重みを有する。このため、本実施形態の話速データ生成処理では、名詞及び動詞を重要品詞とし、重要品詞それぞれに対応する重要単語に対する読み上げ時間が長くなるように話速データを生成している。 By the way, normally, in information expressed in Japanese speech, nouns and verbs have large weights. For this reason, in the speech speed data generation processing according to the present embodiment, the noun and the verb are important parts of speech, and the speech speed data is generated so that the reading time for the important words corresponding to each of the important parts of speech becomes long.

このように生成された話速データに基づいて話速が調整された合成音声によれば、重要品詞をより聴き取りやすくすることができ、発声の内容をより理解しやすくできる。
また、本実施形態の話速データ生成処理では、一つのセリフテキストデータＣＤによって表される情報全体を読み上げるために必要な時間長が、要発声時間に維持されるように正規化したデータを話速データとして生成している。 According to the synthesized speech in which the speech speed is adjusted based on the speech speed data generated in this way, it is possible to make it easy to listen to important parts of speech and to understand the content of the utterance more easily.
Also, in the speech speed data generation process of the present embodiment, the data normalized so that the time length required to read out the entire information represented by one serif text data CD is maintained at the utterance time required. It is generated as speed data.

このため、話速データ生成処理によれば、字幕を読み上げる時間長が予め規定された時間長から変更されることを防止でき、映像の進行に合わせた適切なタイミングで字幕の読み上げを実現できる。 For this reason, according to the speech speed data generation process, it is possible to prevent the time length for reading out the subtitle from being changed from a predetermined time length, and to read out the subtitle at an appropriate timing according to the progress of the video.

なお、本実施形態では、親密度更新処理において、利用者が視聴したことのあるコンテンツの字幕に含まれる単語が出現した回数に応じて親密度が高くなるように、当該単語に対応する親密度を更新している。 In the present embodiment, in the familiarity update process, the familiarity corresponding to the word is increased so that the familiarity is increased according to the number of times the word included in the caption of the content that the user has viewed has appeared. Has been updated.

このような親密度更新処理によれば、利用者が視聴したコンテンツにて登場する回数が多いほど、親密度を高くできる。この結果、コンテンツ視聴システム１によれば、利用者ごとの認識度を反映した単語親密度データを生成でき、利用者の知識に応じた話速データを生成できる。
［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 According to such a familiarity update process, the familiarity can be increased as the number of appearances in the content viewed by the user increases. As a result, according to the content viewing system 1, word familiarity data reflecting the recognition degree for each user can be generated, and speech speed data according to the user's knowledge can be generated.
[Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

例えば、上記実施形態の話速データ生成処理では、名詞及び動詞の両方を重要品詞としていたが、重要品詞は、名詞及び動詞の少なくとも一方であっても良い。
また、上記実施形態では、話速データ生成処理を情報処理装置３０の制御部４０が実行していたが、話速データ生成処理を実行する装置は、情報処理装置３０に限るものではなく、情報処理サーバ１０であっても良い。 For example, in the speech speed data generation process of the above embodiment, both nouns and verbs are important parts of speech, but the important parts of speech may be at least one of nouns and verbs.
Moreover, in the said embodiment, although the control part 40 of the information processing apparatus 30 performed speech speed data generation processing, the apparatus which performs speech speed data generation processing is not restricted to the information processing apparatus 30, and information The processing server 10 may be used.

この場合、情報処理装置３０は、セリフテキストデータＴＤに基づく字幕を読み上げた音声合成を実行する際に、情報処理サーバ１０から話速データを取得して話速を決定すれば良い。 In this case, the information processing apparatus 30 may acquire the speech speed data from the information processing server 10 and determine the speech speed when executing speech synthesis that reads out the caption based on the serif text data TD.

また、上記実施形態では、親密度更新処理を情報処理サーバ１０が実行していたが、親密度更新処理を実行する装置は、情報処理サーバ１０に限るものではなく、情報処理装置３０であっても良い。 Moreover, in the said embodiment, although the information processing server 10 performed the closeness update process, the apparatus which performs a closeness update process is not restricted to the information processing server 10, and is the information processing apparatus 30. Also good.

なお、上記実施形態の構成の一部を、課題を解決できる限りにおいて省略した態様も本発明の実施形態である。また、上記実施形態と変形例とを適宜組み合わせて構成される態様も本発明の実施形態である。また、特許請求の範囲に記載した文言によって特定される発明の本質を逸脱しない限度において考え得るあらゆる態様も本発明の実施形態である。
［実施形態と特許請求の範囲との対応関係］
最後に、上記実施形態の記載と、特許請求の範囲の記載との関係を説明する。 In addition, the aspect which abbreviate | omitted a part of structure of the said embodiment as long as the subject could be solved is also embodiment of this invention. Further, an aspect configured by appropriately combining the above embodiment and the modification is also an embodiment of the present invention. Moreover, all the aspects which can be considered in the limit which does not deviate from the essence of the invention specified by the wording described in the claims are the embodiments of the present invention.
[Correspondence between Embodiment and Claims]
Finally, the relationship between the description of the above embodiment and the description of the scope of claims will be described.

話速データ生成処理のＳ１１０を実行することで得られる機能が、特許請求の範囲の記載におけるテキスト取得手段に相当し、話速データ生成処理のＳ１２０を実行することで得られる機能が、解析手段に相当する。また、話速データ生成処理のＳ１３０を実行することで得られる機能が、特許請求の範囲の記載における親密度取得手段に相当し、話速データ生成処理のＳ１４０〜Ｓ２００を実行することで得られる機能が、話速決定手段に相当する。 The function obtained by executing S110 of the speech speed data generation process corresponds to the text acquisition means in the claims, and the function obtained by executing S120 of the speech speed data generation process is the analysis means. It corresponds to. Further, the function obtained by executing S130 of the speech speed data generation process corresponds to the familiarity acquisition means in the claims, and can be obtained by executing S140 to S200 of the speech speed data generation process. The function corresponds to speech speed determining means.

そして、親密度更新処理のＳ３１０を実行することで得られる機能が、特許請求の範囲の記載における識別情報取得手段に相当し、親密度更新処理のＳ３２０を実行することで得られる機能が、履歴取得手段に相当する。親密度更新処理のＳ３３０，Ｓ３４０を実行することで得られる機能が、特許請求の範囲の記載における履歴解析手段に相当し、Ｓ３５０を実行することで得られる機能が、更新手段に相当する。 The function obtained by executing S310 of the intimacy update process corresponds to the identification information acquisition means in the claims, and the function obtained by executing S320 of the intimacy update process Corresponds to acquisition means. The function obtained by executing S330 and S340 of the familiarity update process corresponds to the history analysis means in the claims, and the function obtained by executing S350 corresponds to the update means.

さらに、話速データ生成処理のＳ１６０を実行することで得られる機能が、特許請求の範囲の記載における単語特定手段に相当し、話速データ生成処理のＳ２１０，Ｓ２２０を実行することで得られる機能が、音声合成手段に相当する。 Further, the function obtained by executing S160 of the speech speed data generation process corresponds to the word specifying means in the claims, and the function obtained by executing S210 and S220 of the speech speed data generation process. Corresponds to speech synthesis means.

１…コンテンツ視聴システム１０…情報処理サーバ１２…通信部１４…制御部１６…ＲＯＭ１８…ＲＡＭ２０…ＣＰＵ２２…記憶部２２…記憶装置３０…情報処理装置３１…通信部３２…入力受付部３３…表示部３４…音入力部３５…音出力部３６…記憶部４０…制御部４１…ＲＯＭ４２…ＲＡＭ４３…ＣＰＵ DESCRIPTION OF SYMBOLS 1 ... Content viewing system 10 ... Information processing server 12 ... Communication part 14 ... Control part 16 ... ROM 18 ... RAM 20 ... CPU 22 ... Memory | storage part 22 ... Memory | storage device 30 ... Information processing apparatus 31 ... Communication part 32 ... Input reception part 33 ... Display unit 34 ... Sound input unit 35 ... Sound output unit 36 ... Storage unit 40 ... Control unit 41 ... ROM 42 ... RAM 43 ... CPU

Claims

Text acquisition means for acquiring text data representing a character string of information output by sound according to video;
Analyzing the text data acquired by the text acquisition means, and specifying each word included in the character string represented by the text data;
A parent that acquires a familiarity corresponding to each word specified by the analysis means from a familiarity database in which familiarity information in which each word and a familiarity representing the recognition degree of each word are associated in advance is stored. Density acquisition means;
Used for speech synthesis, a data representative of the utterance time of synthesized speech, and data representing the utterance time of each phoneme constituting the string of information represented by the text data and speech speed data, the intimacy degree obtaining The utterance time of the word is set so that the proportion of the utterance time of the word that occupies the utterance time of the entire information represented by the text data becomes longer as the word indicating that the familiarity acquired by the means is lower. Speaking speed determining means for generating the adjusted speaking speed data;
Identification information acquisition means for acquiring user identification information for identifying a user;
From the usage history in which each of the user identification information is associated with viewing information representing videos viewed in the past by the user corresponding to each user identification information, the user identification information acquired by the identification information acquisition unit is used. History acquisition means for acquiring viewing information of corresponding users;
A history analysis unit that acquires and analyzes text data corresponding to each video represented by the viewing information acquired by the history acquisition unit, and identifies each word included in a character string represented by each text data;
Updating means for updating the intimacy associated with the word in the intimacy information stored in the intimacy database so that the degree of recognition of each word specified by the history analysis means is increased. An information processing apparatus characterized by the above.

Among the words specified by the analysis means, comprising word specifying means for specifying an important word that is a word corresponding to an important part of speech defined in advance as a part of speech with high importance,
The speech speed determining means is
The information processing apparatus according to claim 1, wherein the speech speed data is generated so that a vowel utterance time included in an important word specified by the word specifying unit becomes longer.

The word specifying means is
The information processing apparatus according to claim 2, wherein at least one of a noun and a verb is the important part of speech, and a word corresponding to each of the important parts of speech is specified as the important word.

The updating means includes
The familiarity associated with the word in the familiarity information so that the familiarity at the timing of the appearance of the word increases as the number of occurrences of the word specified by the history analysis unit increases. The information processing apparatus according to any one of claims 1 to 3, wherein the information processing apparatus is updated.

Based on the speech speed data generated by the speech speed determining means, speech synthesis is performed by synthesizing and outputting the speech time of each phoneme constituting each word to be the speech time represented by the speech speed data. The information processing apparatus according to claim 1, further comprising: means.

Each of the text data includes a required utterance time defined in advance as a time length that can be applied to the utterance of the character string represented by the text data,
The speech speed determining means is
The data normalized so that the utterance time of the entire information represented by the text data is maintained at the required utterance time is generated as the speech speed data. The information processing apparatus according to any one of claims.

A text acquisition process for acquiring text data representing a character string of information output by sound according to a video,
Analyzing the text data acquired in the text acquisition process, identifying each word included in the character string represented by the text data; and
A parent that acquires a familiarity corresponding to each word specified in the analysis process from a familiarity database that stores familiarity information in which each word and a familiarity representing the recognition degree of each word are associated in advance. Density acquisition process,
Used for speech synthesis, a data representative of the utterance time of synthesized speech, and data representing the utterance time of each phoneme constituting the string of information represented by the text data and speech speed data, the intimacy degree obtaining The utterance time of the word is set so that the word representing the intimacy acquired in the process has a longer proportion of the utterance time of the word in the utterance time of the entire information represented by the text data. A speech speed determination process for generating the adjusted speech speed data;
An identification information acquisition process for acquiring user identification information for identifying a user;
From the usage history in which each of the user identification information is associated with viewing information representing a video viewed by the user corresponding to each user identification information in the past, the user identification information acquired in the identification information acquisition process History acquisition process to acquire viewing information of the corresponding user,
Obtaining and analyzing text data corresponding to each video represented by the viewing information obtained in the history obtaining process, and identifying each word included in the character string represented by each text data;
An update process for updating the intimacy associated with the word in the intimacy information stored in the intimacy database so that the degree of recognition of each word specified in the history analysis process is increased. Speech speed data generation method characterized by the above.

A text acquisition procedure for acquiring text data representing a character string of information output by sound according to a video,
Analyzing the text data acquired in the text acquisition procedure, and specifying each word included in the character string represented by the text data; and
A parent that acquires a familiarity corresponding to each word specified in the analysis procedure from a familiarity database in which familiarity information in which each word and a familiarity representing the recognition degree of each word are associated in advance is stored. Density acquisition procedure;
Used for speech synthesis, a data representative of the utterance time of synthesized speech, and data representing the utterance time of each phoneme constituting the string of information represented by the text data and speech speed data, the intimacy degree obtaining The utterance time of the word is set so that the word representing the low intimacy acquired in the procedure becomes longer in the utterance time of the word in the utterance time of the entire information represented by the text data. A speech speed determination procedure for generating the adjusted speech speed data;
An identification information acquisition procedure for acquiring user identification information for identifying a user;
From the usage history in which each of the user identification information is associated with viewing information representing a video viewed by the user corresponding to each user identification information in the past, the user identification information acquired in the identification information acquisition procedure History acquisition procedure to acquire viewing information of the corresponding user,
A history analysis procedure for acquiring and analyzing text data corresponding to each video represented by the viewing information acquired in the history acquisition procedure, and identifying each word included in a character string represented by each text data;
An update procedure for updating the intimacy associated with the word in the intimacy information stored in the intimacy database so that the recognition degree of each word specified in the history analysis procedure is increased. A program characterized by being executed.