JP2009210942A

JP2009210942A - Voice reproduction system, voice reproduction method, and program

Info

Publication number: JP2009210942A
Application number: JP2008055577A
Authority: JP
Inventors: Seiichi Miki; 清一三木
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-03-05
Filing date: 2008-03-05
Publication date: 2009-09-17

Abstract

<P>PROBLEM TO BE SOLVED: To perform suitable voice reproduction processing depending on a content which is uttered, such as difficulty of included words, and the quality degree of vocalization, such as ambiguous vocalization. <P>SOLUTION: According to voice reproduction processing (for example, speech speed conversion and repeated reproduction of a specified section) corresponding to a predetermined recognition feature amount, and referring to a voice reproduction processing estimation model, voice recognition of an input voice is performed by a voice recognition means 101, and further featured values (for example, a recognition result character string regarding utterance content, and reliability on the quality degree of the vocalization) are extracted by a recognition featured value extracting means 102. On the basis of the recognition feature amount obtained as a result, voice reproduction processing to be executed is determined by a voice reproduction processing estimation means 103, and the voice reproduction processing is performed by a voice reproduction processing means 104. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声再生システム、その方法および音声再生用プログラムに関し、さらに詳述すると、本発明は、音声の再生速度の変更や音声再生を繰り返す区間の設定が可能な、音声再生システム、音声再生方法およびプログラムである。 The present invention relates to an audio reproduction system, a method thereof, and an audio reproduction program. More specifically, the present invention relates to an audio reproduction system, an audio reproduction capable of changing an audio reproduction speed and setting a section for repeating audio reproduction. Method and program.

従来における、音声再生システムは一般に、民生用の音響再生機器等で普遍的に用いられている。例えば近年、地方自治体等の公共的な会議の会議録や放送字幕の作成等において、人間の発話する音声を聞き取って、その内容を人手によってテキスト文章として書き起こす、いわゆる「書き起こし作業」において、用いられている。 Conventional audio reproduction systems are generally used universally in consumer audio reproduction equipment and the like. For example, in recent years, in the so-called "transcription work" of listening to the speech uttered by humans and writing the contents as text sentences manually in the minutes of public meetings such as local governments and the creation of broadcast subtitles, It is used.

特に、発話再生速度を変更したり、聞き難い箇所を繰り返し再生したりすることにより音声を聞き易くし、書き起こし作業を支援することが考えられる。このような目的のために提案された、従来の音声再生システムの一例が、下記の特許文献１及び特許文献２に記載されている。 In particular, it is conceivable to make the voice easier to hear by changing the utterance reproduction speed or repeatedly reproducing difficult parts and supporting the transcription work. An example of a conventional audio reproduction system proposed for such a purpose is described in Patent Document 1 and Patent Document 2 below.

特許文献１では、無音区間に挟まれるフレーズ区間を検出し、検出したフレーズ毎に、フレーズ区間内の最初の有声区間の開始点から所定の伸張倍率で、且つ所定の減少関数に基づき、一定時間にわたって話速を変換する。また、一定時間経過後には、一定時間経過時における減少関数に基付く話速とする、有声区間伸張部を備えることにより、音声を聞き取り易く再生する技術が、記載されている。これは、通常、音声の開始部分が早口であったり、最初にどういう言葉が出現してくるのか予測できず、音声の開始部分が比較的聞き取り難い、という経験則に基づき、音声の開始部分を低速化することにより、音声を聞き取り易くするものである。 In Patent Document 1, a phrase section sandwiched between silence sections is detected, and for each detected phrase, a predetermined time is determined based on a predetermined expansion function at a predetermined expansion ratio from the start point of the first voiced section in the phrase section. Transform speech speed over. In addition, a technique is described in which after a certain period of time has elapsed, a voice is easily reproduced by providing a voiced segment expansion unit that has a speech speed based on a decreasing function when a certain period of time has elapsed. This is based on an empirical rule that the beginning of the speech is usually fast or it is impossible to predict what word will appear first, and the beginning of the speech is relatively difficult to hear. By reducing the speed, the voice can be easily heard.

また、特許文献２では、音声データの一部である部分音声データについて、部分データ毎に話速を算出する話速算出部を備え、この話速算出部により算出された部分音声データの話速から、部分音声データを予め定められた所定の希望話速に変換する、話速変換部を備える。このことにより、音声を聞き取り易く再生する技術が、記載されている。これは、発話の速度が部分によって異なる音声を、全体として均一の話速で再生することにより、音声を聞き取り易くするものである。さらに、特許文献３では、囁き声のピッチ周波数情報を抽出し、これを通常の音声へ変換を行っており、特許文献４では、原音声のイントネーションに基づいて、話速の変更を行っている。
特開２００３−２２３２００号公報特開２００６−３１７７６８号公報特開２００６−１１９６４７号公報特開平０６−２８９８９５号公報 Further, in Patent Document 2, a speech speed calculation unit that calculates a speech speed for each partial data for partial speech data that is part of the speech data is provided, and the speech speed of the partial speech data calculated by the speech speed calculation unit is provided. To a speech speed conversion unit for converting the partial voice data to a predetermined desired speech speed. Thus, a technique for reproducing sound so that it can be easily heard is described. This is to make it easy to hear the voice by reproducing the voice whose utterance speed varies depending on the part at a uniform speaking speed as a whole. Furthermore, in Patent Document 3, the pitch frequency information of the whisper is extracted and converted into normal sound. In Patent Document 4, the speech speed is changed based on the intonation of the original sound. .
JP 2003-223200 A JP 2006-317768 A JP 2006-119647 A Japanese Patent Application Laid-Open No. 06-289895

しかしながら、従来の音声再生システムでは、発話の開始部分かどうかといった簡単な特徴を用いたり、発話全体の話速を均一にしたりといった、発話されている内容や発声の良し悪しに関係なく、固定化されたヒューリスティックに基づいた音声再生処理を行うのみであるため、複数の特徴量を複合して処理することができず、含まれている単語の難易等の発話されている内容や、曖昧な発声といった、発声の良し悪しに応じた適切な音声再生処理を行うことができない。 However, in the conventional audio playback system, it is fixed regardless of the content of the utterance and the quality of the utterance, such as using a simple feature such as whether it is the start part of the utterance or making the speech speed of the entire utterance uniform. Because it only performs audio playback processing based on the heuristics that have been made, it cannot process multiple feature quantities in combination, and the content of spoken words, such as difficulty of words included, or ambiguous utterances Thus, it is not possible to perform an appropriate sound reproduction process according to whether the utterance is good or bad.

そこで本発明は、発話の内容や発声の良し悪しに基づき、適切な音声再生処理を行うことができる、音声再生システム、音声再生方法および音声再生用プログラムを提供することを目的とする。 Accordingly, an object of the present invention is to provide an audio reproduction system, an audio reproduction method, and an audio reproduction program that can perform appropriate audio reproduction processing based on the content of an utterance and the quality of the utterance.

上記問題を解決するため、本発明における音声再生システムは、入力音声を認識する音声認識手段と、音声認識手段により認識した音声から、認識特徴量を抽出する認識特徴量抽出手段と、認識特徴量抽出手段により抽出された認識特徴量に応じて、予め決められた音声再生処理を選択する音声再生処理推定手段と、音声再生処理推定手段によって選択された音声再生処理を行う音声再生処理手段を備えることを特徴とする。 In order to solve the above problem, a speech reproduction system according to the present invention includes a speech recognition unit that recognizes input speech, a recognition feature amount extraction unit that extracts a recognition feature amount from speech recognized by the speech recognition unit, and a recognition feature amount. According to the recognition feature amount extracted by the extracting means, there are provided an audio reproduction process estimating means for selecting a predetermined audio reproduction process, and an audio reproduction processing means for performing the audio reproduction process selected by the audio reproduction process estimating means. It is characterized by that.

認識特徴量抽出手段は、発話内容に関する言語的な情報を特徴量として抽出することを特徴とする。 The recognition feature amount extraction means is characterized by extracting linguistic information about the utterance content as a feature amount.

認識特徴量抽出手段は、発音の曖昧さに関する音響的な情報を特徴量として抽出することを特徴とする。 The recognition feature quantity extraction means is characterized by extracting acoustic information relating to pronunciation ambiguity as a feature quantity.

認識特徴量抽出手段は、事後確率を特徴量として抽出することを特徴とする。 The recognition feature quantity extraction means is characterized by extracting the posterior probability as a feature quantity.

認識特徴量推定手段は、特徴量を複数組み合わせ、音声再生処理を推定することを特徴とする。 The recognition feature amount estimating means is characterized by combining a plurality of feature amounts and estimating a sound reproduction process.

さらに、ユーザの操作を取得するユーザ操作取得手段と、ユーザ操作取得手段により得られたユーザ操作の履歴を蓄積するユーザ操作履歴蓄積手段と、認識特徴量抽出手段にて抽出した認識特徴量抽出履歴を蓄積する認識特徴量履歴蓄積手段と、ユーザ操作履歴蓄積手段に格納されたユーザ操作履歴と、認識特徴量履歴蓄積手段に格納された認識特徴量との関係から音声再生処理推定モデルを生成する音声再生処理推定モデル学習手段と、音声再生処理推定モデル学習手段にて生成された音声再生処理推定モデルを格納する音声再生処理推定モデル記憶手段を備え、音声再生処理推定手段が、音声再生処理推定モデル記憶手段に格納された音声再生処理推定モデルを参照して音声再生処理を推定することを特徴とする。 Furthermore, a user operation acquisition means for acquiring a user operation, a user operation history storage means for storing a user operation history obtained by the user operation acquisition means, and a recognition feature quantity extraction history extracted by the recognition feature quantity extraction means A speech reproduction process estimation model is generated from the relationship between the recognition feature quantity history storage means for storing the user, the user operation history stored in the user operation history storage means, and the recognition feature quantity stored in the recognition feature history storage means An audio reproduction process estimation model learning means and an audio reproduction process estimation model storage means for storing the audio reproduction process estimation model generated by the audio reproduction process estimation model learning means are provided, and the audio reproduction process estimation means is an audio reproduction process estimation The sound reproduction process is estimated with reference to the sound reproduction process estimation model stored in the model storage means.

また、本発明における音声再生方法は、入力音声を認識する音声認識ステップと、音声認識ステップにより認識した音声から、認識特徴量を抽出する認識特徴量抽出ステップと、認識特徴量抽出ステップにより抽出された認識特徴量に応じて、予め決められた音声再生処理を選択する音声再生処理推定ステップと、音声再生処理推定ステップによって選択された音声再生処理を行う音声再生処理ステップを備えることを特徴とする。 The speech reproduction method according to the present invention is extracted by a speech recognition step for recognizing an input speech, a recognition feature amount extraction step for extracting a recognition feature amount from the speech recognized by the speech recognition step, and a recognition feature amount extraction step. An audio reproduction process estimation step for selecting a predetermined audio reproduction process according to the recognized feature amount; and an audio reproduction process step for performing the audio reproduction process selected by the audio reproduction process estimation step. .

認識特徴量抽出ステップは、発話内容に関する言語的な情報を特徴量として抽出することを特徴とする。 The recognition feature amount extraction step is characterized in that linguistic information about the utterance content is extracted as a feature amount.

認識特徴量抽出ステップは、発音の曖昧さに関する音響的な情報を特徴量として抽出することを特徴とする。 The recognition feature amount extraction step is characterized in that acoustic information relating to pronunciation ambiguity is extracted as a feature amount.

認識特徴量抽出ステップは、事後確率を特徴量として抽出することを特徴とする。 The recognition feature quantity extraction step is characterized by extracting the posterior probability as a feature quantity.

認識特徴量推定ステップは、特徴量を複数組み合わせ、音声再生処理を推定することを特徴とする。 The recognition feature amount estimating step is characterized by estimating a sound reproduction process by combining a plurality of feature amounts.

さらに、ユーザの操作を取得するユーザ操作取得ステップと、ユーザ操作取得ステップにより得られたユーザ操作の履歴を蓄積するユーザ操作履歴蓄積ステップと、認識特徴量抽出ステップにて抽出した認識特徴量抽出履歴を蓄積する認識特徴量履歴蓄積ステップと、ユーザ操作履歴蓄積ステップにて格納されたユーザ操作履歴と、認識特徴量履歴蓄積ステップにて格納された認識特徴量との関係から音声再生処理推定モデルを生成する音声再生処理推定モデル学習ステップと、音声再生処理推定モデル学習ステップにて生成された音声再生処理推定モデルを格納する音声再生処理推定モデル記憶ステップを備え、音声再生処理推定ステップが、音声再生処理推定モデル記憶ステップにて格納された音声再生処理推定モデルを参照して音声再生処理を推定することを特徴とする。 Further, a user operation acquisition step for acquiring a user operation, a user operation history storage step for storing a user operation history obtained by the user operation acquisition step, and a recognition feature amount extraction history extracted by the recognition feature amount extraction step A speech reproduction processing estimation model based on the relationship between the recognition feature value history accumulation step for accumulating the user, the user operation history stored in the user operation history accumulation step, and the recognition feature value stored in the recognition feature value history accumulation step. A voice reproduction process estimation model learning step to be generated; and a voice reproduction process estimation model storage step for storing the voice reproduction process estimation model generated in the voice reproduction process estimation model learning step. Refer to the speech reproduction processing estimation model stored in the processing estimation model storage step and And estimating a process.

また、本発明におけるプログラムは、入力音声を認識する処理と、認識する処理により認識した音声から、認識特徴量を抽出する処理と、認識特徴量を抽出する処理により抽出された認識特徴量に応じて、予め決められた音声再生処理を選択する処理と、音声再生処理を選択する処理によって選択された音声再生処理を行う処理とをコンピュータに実行させる。 The program according to the present invention corresponds to a process for recognizing an input voice, a process for extracting a recognition feature quantity from a voice recognized by the recognition process, and a recognition feature quantity extracted by a process for extracting a recognition feature quantity. Thus, the computer executes a process for selecting a predetermined audio reproduction process and a process for performing the audio reproduction process selected by the process for selecting the audio reproduction process.

認識特徴量を抽出する処理であって、発話内容に関する言語的な情報を特徴量として抽出する処理をコンピュータに実行させる。 A process for extracting a recognition feature value, and causing the computer to execute a process for extracting linguistic information about the utterance content as a feature value.

認識特徴量を抽出する処理であって、発音の曖昧さに関する音響的な情報を特徴量として抽出する処理をコンピュータに実行させる。 A process of extracting a recognition feature value, and causing a computer to execute a process of extracting acoustic information related to pronunciation ambiguity as a feature value.

認識特徴量を抽出する処理であって、事後確率を特徴量として抽出する処理をコンピュータに実行させる。 A process for extracting a recognition feature value, which extracts a posterior probability as a feature value, is executed by a computer.

音声再生処理を選択する処理であって、特徴量を複数組み合わせ、音声再生処理を推定する処理をコンピュータに実行させる。 A process for selecting a sound reproduction process, which is a combination of a plurality of feature amounts and for causing the computer to execute a process for estimating the sound reproduction process.

さらに、ユーザの操作を取得する処理と、ユーザの操作を取得する処理により得られたユーザ操作の履歴を蓄積する処理と、認識特徴量を抽出する処理にて抽出した認識特徴量抽出履歴を蓄積する処理と、ユーザ操作の履歴を蓄積する処理にて格納されたユーザ操作履歴と、認識特徴量抽出履歴を蓄積する処理にて格納された認識特徴量との関係から音声再生処理推定モデルを生成する処理と、音声再生処理推定モデルを生成する処理にて生成された音声再生処理推定モデルを格納する処理を備え、音声再生処理を選択する処理が、音声再生処理推定モデルを格納する処理にて格納された音声再生処理推定モデルを参照して音声再生処理を推定する処理をコンピュータに実行させる。 Furthermore, the process of acquiring user operations, the process of accumulating user operation history obtained by the process of acquiring user operations, and the recognition feature quantity extraction history extracted by the process of extracting recognition feature quantities are accumulated. A speech reproduction process estimation model from the relationship between the user operation history stored in the process of storing the user operation history and the recognition feature quantity stored in the process of accumulating the recognition feature quantity extraction history And a process for storing the sound reproduction process estimation model generated in the process for generating the sound reproduction process estimation model, and the process for selecting the sound reproduction process is a process for storing the sound reproduction process estimation model. The computer is caused to execute a process for estimating the sound reproduction process with reference to the stored sound reproduction process estimation model.

本発明によれば、発話の内容や発声の良し悪しに基づき、適切な音声再生処理を行うことができ、書き起こし支援のための音声再生装置や、また、テレビやラジオといった、音響機器にも適用できる。 According to the present invention, it is possible to perform appropriate audio reproduction processing based on the content of an utterance and the quality of the utterance, and to an audio reproduction apparatus for supporting transcription, and to an audio device such as a television or a radio. Applicable.

図１に示すように、本発明の実施形態では、音声認識手段１０１と、認識特徴量抽出手段１０２と、音声再生処理推定手段１０３と、音声再生処理手段１０４とから構成されている。 As shown in FIG. 1, the embodiment of the present invention includes a voice recognition unit 101, a recognition feature amount extraction unit 102, a voice reproduction processing estimation unit 103, and a voice reproduction processing unit 104.

これらの手段は、それぞれ次のように動作する。
音声認識手段１０１は、再生対象である音声を認識する。
認識特徴量抽出手段１０２は、音声認識手段１０１より得られる音声認識結果から、再生処理を推定するのに有用な特徴量を抽出する。このような特徴量としては、発話された内容を表す表記情報、読み情報、品詞情報、言語尤度、認識結果の他の部分や、外部リソースを用いて算出されたｔｆ−ｉｄｆ値、また、発声の良し悪しを示す、（単語や句、文といった）所定単位毎の事後確率（信頼度）や、音響尤度を用いることができる。これらは、単独でなく、組み合わせて特徴量として用いることもできる。 Each of these means operates as follows.
The voice recognition unit 101 recognizes a voice to be reproduced.
The recognition feature quantity extraction unit 102 extracts a feature quantity useful for estimating reproduction processing from the voice recognition result obtained from the voice recognition unit 101. Such feature amounts include notation information representing spoken content, reading information, part-of-speech information, language likelihood, other parts of recognition results, tf-idf values calculated using external resources, A posteriori probability (reliability) for each predetermined unit (such as a word, a phrase, or a sentence) indicating whether the utterance is good or bad, or an acoustic likelihood can be used. These can be used not only independently but also as a feature amount in combination.

また、所定単位毎の発話速度（単位時間当たりのモーラ数）、Ｓ／Ｎ比等、音声認識を用いなくても得られる特徴量と、組み合わせて用いることもできる。音声再生処理推定手段１０３は、入力された特徴量に対し、特徴量に応じて予め定められた再生処理を、決定する。例えば、特定の単語（固有名詞等）や読みの部分をゆっくり再生したり、信頼度がある閾値以下の場合には、ゆっくり再生したり、繰り返し再生したりといった、特徴量（及びその複数の組み合わせ）と、それに応じた好ましい音声再生処理が、選定される。音声再生処理手段１０４においては、再生すべき音声に対して、音声再生処理推定手段１０３で決定された音声再生処理が施される。 Moreover, it can also be used in combination with feature amounts obtained without using speech recognition, such as the speech rate per predetermined unit (number of mora per unit time), S / N ratio, and the like. The audio reproduction process estimation means 103 determines a reproduction process that is predetermined according to the feature quantity for the input feature quantity. For example, feature values (and their combinations) such as slow playback of a specific word (proprietary noun, etc.) or reading part, slow playback or repeated playback when the reliability is below a certain threshold ) And a preferred audio reproduction process corresponding thereto is selected. In the audio reproduction processing means 104, the audio reproduction processing determined by the audio reproduction process estimation means 103 is performed on the audio to be reproduced.

音声再生処理としては、例えば、下記がある。
１）話速変換（ゆっくり再生したり、速く再生したりする）、
２）区間繰り返し再生（聞き取り難い区間を指定し、その区間を繰り返し再生する）、
３）雑音抑圧処理（非音声雑音の抑圧処理を施す。音声部分が歪む場合がある）、
４）音量変換（音を大きくしたり、小さくしたりする）、
５）周波数変換（音を高くしたり、低くしたりする）、等がある。
上記等の音声再生処理された音声を、そのまま再生するのはもちろんのこと、音声再生処理が施された部分を、強調して表示するといった、視覚的手段を用いた出力も、行うことができる。 Examples of the audio reproduction process include the following.
1) Speak speed conversion (play slowly or fast),
2) Repeat section playback (specify a section that is difficult to hear and repeat the section playback),
3) Noise suppression processing (non-speech noise suppression processing is performed. The voice part may be distorted),
4) Volume conversion (to increase or decrease the sound),
5) Frequency conversion (to increase or decrease the sound), etc.
It is possible to perform output using visual means, such as displaying the sound-processed sound as it is, as well as displaying the part subjected to the sound-reproduction process in an emphasized manner. .

次に、図２のフローチャートを参照して、本実施形態の全体の動作例について、詳細に説明する。
まず、入力された音声に対し、音声認識手段１０１を用いて、音声認識処理を行う（ステップＡ１）。この結果、例えば入力音声に対し、音声区間と非音声区間の区別、音声区間に対する認識結果単語列、各単語の品詞・読み情報、音響尤度、言語尤度及び事後確率、下位候補単語列を含むワードラティス等を、取得することができる。次に、認識特徴量抽出手段１０２を用いて、得られた音声認識結果をもとに、適切な再生処理を決定するための認識特徴量を、抽出する（ステップＡ２）。この認識特徴量としては、発声の内容に関する特徴、例えば、認識結果の単語そのものや、品詞情報（固有名詞かどうか、等）、言語尤度、音声の聞き易さに関する特徴、例えば、読み情報や発声の良し悪しに関する特徴、例えば、音響尤度や事後確率を、用いることができる。これらの組み合わせや、認識結果以外から得られる別の特徴、例えば話速やＳ／Ｎ比といった特徴と、組み合わせることもできる。 Next, an overall operation example of the present embodiment will be described in detail with reference to the flowchart of FIG.
First, speech recognition processing is performed on the input speech using the speech recognition means 101 (step A1). As a result, for example, for input speech, the distinction between speech and non-speech segments, recognition result word sequence for speech segments, part of speech / reading information for each word, acoustic likelihood, language likelihood and posterior probability, subordinate candidate word sequence Including word lattices can be obtained. Next, using the recognition feature quantity extraction means 102, a recognition feature quantity for determining an appropriate reproduction process is extracted based on the obtained speech recognition result (step A2). The recognition feature amount includes a feature related to the content of the utterance, for example, the word itself of the recognition result, part of speech information (whether it is a proper noun, etc.), a language likelihood, a feature related to the ease of hearing of the speech, such as reading information and Features related to the quality of speech, such as acoustic likelihood and posterior probability, can be used. These combinations and other characteristics obtained from other than the recognition result, for example, characteristics such as speech speed and S / N ratio can also be combined.

続いて、音声再生処理推定手段１０３を用いて、認識特徴量に応じた音声再生処理を、決定する（ステップＡ３）。ここでは、予め決められた規則に基づき、得られた認識特徴量をもとに、適用すべき適切な音声再生処理を、決定する。予め決められた規則としては、例えば、事後確率が低い部分は、話速を遅くする再生処理を行うことや、固有名詞等の特定の品詞部分は、話速を遅くする再生処理を行うといった規則が考えられる。最後に、音声再生処理手段１０４を用いて、決定された音声再生処理を、入力音声に対して適用する（ステップＡ４）。 Subsequently, the audio reproduction process estimation means 103 is used to determine an audio reproduction process corresponding to the recognized feature amount (step A3). Here, based on a predetermined rule, an appropriate audio reproduction process to be applied is determined based on the obtained recognition feature amount. As a predetermined rule, for example, a rule in which a portion with a low posterior probability performs a reproduction process that slows down the speech speed, or a specific part-of-speech part such as a proper noun performs a reproduction process that slows down the speech speed. Can be considered. Finally, the determined audio reproduction process is applied to the input audio using the audio reproduction processing means 104 (step A4).

次に、本実施形態の効果について説明する。
本実施形態では、音声認識の結果得られる再生したい音声の内容や性質に応じた特徴量を用い、それら特徴量と、音声に対して行うべき処理が記述された音声再生処理推定モデルとを用いることで、発話の内容や特徴に基づいた、適切な音声再生処理を、行うことができる。 Next, the effect of this embodiment will be described.
In the present embodiment, feature quantities corresponding to the contents and properties of voice to be reproduced obtained as a result of voice recognition are used, and these feature quantities and a speech reproduction processing estimation model in which processing to be performed on the voice is described are used. Thus, it is possible to perform an appropriate sound reproduction process based on the content and characteristics of the utterance.

次に、本発明の別の実施形態について、図面を参照して詳細に説明する。
図３に示すように、本実施形態では、上記構成に、認識特徴量履歴蓄積手段２０６、ユーザ操作取得手段２０７、ユーザ操作履歴蓄積手段２０８、音声再生処理推定モデル学習手段２０９、及び音声再生処理推定モデル記憶手段２１０が追加された構成となっている。 Next, another embodiment of the present invention will be described in detail with reference to the drawings.
As shown in FIG. 3, in the present embodiment, in the above-described configuration, the recognized feature amount history storage unit 206, the user operation acquisition unit 207, the user operation history storage unit 208, the voice reproduction process estimation model learning unit 209, and the voice reproduction process The estimated model storage unit 210 is added.

認識特徴量履歴蓄積手段２０６は、音声認識手段１０１及び認識特徴量抽出手段１０２により入力された音声を、音声認識して、その結果得られる認識特徴量を、入力音声と対応付けて、時系列に蓄積する。ユーザ操作取得手段２０７は、入力音声のどの部分に、ユーザがどのような音声再生処理（例えば、ゆっくり再生したか速く再生したかや、音量を上げたか下げたか、繰り返し再生をしたか等）を施したかを、取得する。ユーザ操作履歴蓄積手段２０８は、ユーザ操作取得手段２０７により取得されたユーザ操作を、入力音声と対応付けて、時系列に蓄積する。 The recognition feature amount history storage unit 206 recognizes the speech input by the speech recognition unit 101 and the recognition feature amount extraction unit 102, and associates the recognition feature amount obtained as a result with the input speech in time series. To accumulate. The user operation acquisition unit 207 determines what part of the input voice the voice playback processing (for example, whether the playback was performed slowly or fast, whether the volume was increased or decreased, and repeated playback). Get what you gave. The user operation history accumulation unit 208 accumulates the user operations acquired by the user operation acquisition unit 207 in time series in association with the input voice.

音声再生処理推定モデル学習手段２０９は、認識特徴量履歴蓄積手段２０６に蓄積された入力音声に対する認識特徴量と、ユーザ操作履歴蓄積手段２０８に蓄積された入力音声に対する、ユーザが行った音声再生処理内容との、両者を用いて、認識特徴量に対してユーザが行った音声再生処理内容を関連付ける、音声再生処理推定モデルを生成する。 The voice reproduction process estimation model learning unit 209 performs a voice reproduction process performed by the user on the recognition feature quantity for the input voice stored in the recognition feature quantity history storage unit 206 and the input voice stored in the user operation history storage unit 208. Using both the content and the content, a speech reproduction processing estimation model is generated that associates the content of the speech reproduction processing performed by the user with the recognized feature value.

具体的には、例えば図４に示すように、音声再生処理操作と、それが行われた際の認識特徴量の組み合わせの頻度に基づいて、この頻度が一定値以上の組み合わせについて、その操作と特徴量を対応付ける。このことで、音声再生処理推定モデルを生成できる。この処理においては、例えば識別モデルや生成モデルを用いた機械学習手法を適用することができる。ユーザは、音声再生処理操作を施す際に、微妙に処理対象の音声部分の始終端を変えて、操作を行うことがある。これに対し、音声再生処理操作と、それが行われた音声部分の対応付けの際に、音声部分の始端・終端のずれかを許容して、同一の音声部分とみなす処理を、行うこともできる。 Specifically, for example, as shown in FIG. 4, based on the frequency of the combination of the voice reproduction processing operation and the recognition feature amount when the operation is performed, the operation of the combination having the frequency equal to or higher than a certain value Associate the feature quantity. This makes it possible to generate an audio reproduction process estimation model. In this process, for example, a machine learning method using an identification model or a generation model can be applied. When a user performs an audio reproduction processing operation, the user may perform an operation by slightly changing the start / end of the audio portion to be processed. On the other hand, when associating the audio playback processing operation with the audio part in which it is performed, it is also possible to perform a process that allows the start and end of the audio part to be shifted and is regarded as the same audio part. it can.

ここで生成される音声再生処理推定モデルは、ユーザ毎にすることもできるし、複数のユーザのモデルを統合して、共通モデルとして使用することも考えられる。これは、例えば、図４の対応付けを特定ユーザのものにするか、複数ユーザから集計するかによって、実現できる。音声再生処理推定手段１０３は、予め定められた規則の代わりに、あるいは同時に、音声再生処理推定モデル記憶手段２１０に格納された、音声再生処理推定モデルを参照して、音声再生処理を選定する。 The sound reproduction processing estimation model generated here can be set for each user, or a plurality of users' models can be integrated and used as a common model. This can be realized, for example, depending on whether the correspondence shown in FIG. 4 is that of a specific user or is counted from a plurality of users. The voice reproduction process estimation unit 103 selects a voice reproduction process by referring to the voice reproduction process estimation model stored in the voice reproduction process estimation model storage unit 210 instead of or simultaneously with a predetermined rule.

本実施形態によれば、ユーザ操作履歴とその時の認識特徴量を対応付け、操作者の好みや癖を反映したモデルを、作成・適用することができ、操作者の好みや癖を反映した、適切な音声再生処理を行うことができる。 According to the present embodiment, the user operation history and the recognized feature amount at that time can be associated, and a model reflecting the operator's preference and habit can be created and applied, reflecting the operator's preference and habit, Appropriate sound reproduction processing can be performed.

以上、本発明の好適な実施の形態により本発明を説明した。ここでは特定の具体例を示して本発明を説明したが、特許請求の範囲に定義された本発明の広範囲な趣旨および範囲から逸脱することなく、これら具体例に様々な修正および変更を加えることができることは明らかである。 The present invention has been described above by the preferred embodiments of the present invention. While the invention has been described with reference to specific embodiments, various modifications and changes may be made to the embodiments without departing from the broad spirit and scope of the invention as defined in the claims. Obviously you can.

本発明の第１の発明を実施するための、最良の形態の構成例を示すブロック図である。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing a configuration example of the best mode for carrying out a first invention of the present invention. 第１の発明を実施するための、最良の形態の動作例を示す流れ図である。It is a flowchart which shows the operation example of the best form for implementing 1st invention. 本発明の第２の発明を実施するための、最良の形態の構成を示すブロック図である。It is a block diagram which shows the structure of the best form for implementing 2nd invention of this invention. ユーザ操作と認識特徴量の、対応付けの例を示す図である。It is a figure which shows the example of matching with user operation and recognition feature-value.

Explanation of symbols

１０１音声認識手段
１０２認識特徴量抽出手段
１０３音声再生処理推定手段
１０４音声再生処理手段
２０６認識特徴量履歴蓄積手段
２０７ユーザ操作取得手段
２０８ユーザ操作履歴蓄積手段
２０９音声再生処理推定モデル学習手段
２１０音声再生処理推定モデル記憶手段 DESCRIPTION OF SYMBOLS 101 Voice recognition means 102 Recognition feature quantity extraction means 103 Voice reproduction processing estimation means 104 Voice reproduction processing means 206 Recognition feature quantity history accumulation means 207 User operation acquisition means 208 User operation history accumulation means 209 Voice reproduction processing estimation model learning means 210 Voice reproduction Processing estimation model storage means

Claims

Speech recognition means for recognizing input speech;
Recognition feature quantity extraction means for extracting a recognition feature quantity from the voice recognized by the voice recognition means;
A voice reproduction processing estimation unit that selects a predetermined voice reproduction process according to the recognition feature amount extracted by the recognition feature amount extraction unit;
An audio reproduction system comprising audio reproduction processing means for performing an audio reproduction process selected by the audio reproduction process estimation means.

The speech reproduction system according to claim 1, wherein the recognition feature amount extraction unit extracts linguistic information about the utterance content as a feature amount.

3. The voice reproduction system according to claim 1, wherein the recognition feature amount extraction unit extracts acoustic information about pronunciation ambiguity as a feature amount.

The voice reproduction system according to any one of claims 1 to 3, wherein the recognition feature amount extraction unit extracts a posterior probability as a feature amount.

The voice reproduction system according to any one of claims 1 to 4, wherein the recognition feature quantity estimation unit estimates a voice reproduction process by combining a plurality of feature quantities.

Furthermore, user operation acquisition means for acquiring a user operation;
User operation history storage means for storing a history of user operations obtained by the user operation acquisition means;
Recognition feature value history storage means for storing the recognition feature value extraction history extracted by the recognition feature value extraction means;
An audio reproduction process estimation model learning unit that generates an audio reproduction process estimation model from a relationship between a user operation history stored in the user operation history accumulation unit and a recognition feature amount stored in the recognition feature amount history accumulation unit;
Voice reproduction processing estimation model storage means for storing the voice reproduction processing estimation model generated by the voice reproduction processing estimation model learning means;
6. The voice reproduction process estimation unit estimates a voice reproduction process with reference to a voice reproduction process estimation model stored in the voice reproduction process estimation model storage unit. The audio playback system described in 1.

A speech recognition step for recognizing input speech;
A recognition feature amount extraction step for extracting a recognition feature amount from the speech recognized in the speech recognition step;
A voice reproduction process estimation step for selecting a predetermined voice reproduction process according to the recognition feature quantity extracted in the recognition feature quantity extraction step;
An audio reproduction method comprising: an audio reproduction process step for performing an audio reproduction process selected in the audio reproduction process estimation step.

8. The voice reproduction method according to claim 7, wherein the recognition feature amount extraction step extracts linguistic information about the utterance content as a feature amount.

9. The voice reproduction method according to claim 7, wherein the recognition feature amount extraction step extracts acoustic information relating to pronunciation ambiguity as a feature amount.

10. The audio reproduction method according to claim 7, wherein the recognition feature amount extraction step extracts a posterior probability as a feature amount.

The voice reproduction method according to any one of claims 7 to 10, wherein the recognition feature quantity estimation step estimates a voice reproduction process by combining a plurality of feature quantities.

Furthermore, a user operation acquisition step for acquiring a user operation;
A user operation history storage step for storing a history of user operations obtained by the user operation acquisition step;
A recognition feature value history storage step for storing the recognition feature value extraction history extracted in the recognition feature value extraction step;
A voice reproduction process estimation model learning step for generating a voice reproduction process estimation model from the relationship between the user operation history stored in the user operation history accumulation step and the recognition feature quantity stored in the recognition feature quantity history accumulation step. When,
A voice reproduction process estimation model storing step for storing the voice reproduction process estimation model generated in the voice reproduction process estimation model learning step;
12. The sound reproduction process estimating step estimates the sound reproduction process with reference to the sound reproduction process estimation model stored in the sound reproduction process estimation model storing step. The audio reproduction method according to item.

Processing to recognize input speech,
A process of extracting a recognition feature amount from the speech recognized by the recognition process;
A process of selecting a predetermined audio reproduction process according to the recognition feature quantity extracted by the process of extracting the recognition feature quantity;
The program which makes a computer perform the process which performs the audio | voice reproduction | regeneration process selected by the process which selects the said audio | voice reproduction | regeneration process.

The program according to claim 13, wherein the computer executes a process of extracting the recognized feature quantity, and extracting linguistic information about the utterance content as a feature quantity.

The program according to claim 13 or 14, wherein the computer executes a process of extracting the recognition feature value, and extracting a piece of acoustic information related to pronunciation ambiguity as a feature value.

The program according to any one of claims 13 to 15, which is a process of extracting the recognized feature quantity, and causing a computer to execute a process of extracting a posterior probability as a feature quantity.

The program according to any one of claims 13 to 16, which is a process of selecting the sound reproduction process, and causing a computer to execute a process of combining a plurality of feature amounts and estimating the sound reproduction process.

Furthermore, a process for acquiring a user operation,
A process of accumulating a history of user operations obtained by the process of acquiring the user operations;
Processing for accumulating the recognition feature value extraction history extracted in the processing for extracting the recognition feature value;
A process of generating a speech reproduction process estimation model from the relationship between the user operation history stored in the process of accumulating the user operation history and the recognition feature quantity stored in the process of accumulating the recognition feature quantity extraction history When,
A process of storing the sound reproduction process estimation model generated in the process of generating the sound reproduction process estimation model;
The process for selecting the sound reproduction process causes the computer to execute a process for estimating the sound reproduction process with reference to the sound reproduction process estimation model stored in the process for storing the sound reproduction process estimation model. 18. The program according to any one of items 17.