JP2016001235A

JP2016001235A - Information processor, terminal device and program

Info

Publication number: JP2016001235A
Application number: JP2014120628A
Authority: JP
Inventors: 悠哉藤田; Yuya Fujita
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2014-06-11
Filing date: 2014-06-11
Publication date: 2016-01-07

Abstract

PROBLEM TO BE SOLVED: To provide an information processor capable of suppressing a load in a terminal device for processing of separation into a background sound and a speech sound.SOLUTION: An information processor includes an aural signal acquisition part for acquiring an aural signal in which a speech sound uttered from a person and a background sound are mixed, a phoneme component information calculation part for calculating phoneme component information showing phoneme components included in the aural signal with reference to phoneme information for showing phonemes constituting the speech sound, and a notification part for notifying a device that adjusts a ratio between the speech sound and the background sound in the aural signal, of the phoneme component information.

Description

本発明は、情報処理装置、端末装置およびプログラムに関する。 The present invention relates to an information processing device, a terminal device, and a program.

従来、背景音と、人が発声した音声（以降、スピーチ音という）とが混合された音声信号から、それぞれの音を強調／抑圧するには、音源分離と呼ばれる信号処理技術が用いられている。この技術は、複数の音源に由来する音声信号が混合された信号を、個々の音源に由来する信号に分離するものである。しかし、混合過程によっては、分離が困難なことがある。例えば、テレビジョンやラジオなどで、スピーチ音に音楽などが混合されている場合は、音楽の種類により楽器の数は様々であるから、音源の数が分からないうえに、ステレオ放送では、左／右の２チャンネルのみの混合信号から分離しなければならない。 Conventionally, a signal processing technique called sound source separation has been used to emphasize / suppress each sound from a sound signal in which background sound and sound uttered by a person (hereinafter referred to as speech sound) are mixed. . In this technique, a signal obtained by mixing audio signals derived from a plurality of sound sources is separated into signals derived from individual sound sources. However, depending on the mixing process, separation may be difficult. For example, when music is mixed with speech sound on television or radio, the number of musical instruments varies depending on the type of music. It must be separated from the mixed signal of only the right two channels.

このような音源分離問題は、多くの場合で混合信号の数より音源数の方が多い不良設定問題となっている。そのため、厳密解を求めることができず、音源に関する何らかの事前知識などを用いることで、近似解を得るものがある。例えば、ステレオ放送においては、スピーチ音は、中心に定位させるために、左／右それぞれのチャンネルに同じレベルとなるように混合されることが多い性質を利用して、背景音成分と、スピーチ音成分とを分離した後、混合比率を再調整する方法がある（特許文献１参照）。また、非負値行列因子分解という手法を用いて、背景音とスピーチ音とに分離した後、混合比率を再調整する方法もある（非特許文献１参照）。 Such a sound source separation problem is often a defect setting problem in which the number of sound sources is larger than the number of mixed signals. For this reason, an exact solution cannot be obtained, and an approximate solution can be obtained by using some prior knowledge about the sound source. For example, in stereo broadcasting, the speech sound is localized to the center, and the background sound component and the speech sound are utilized by utilizing the property that the left / right channels are often mixed at the same level. There is a method of readjusting the mixing ratio after separating the components (see Patent Document 1). In addition, there is a method in which the mixing ratio is readjusted after the background sound and the speech sound are separated using a technique called non-negative matrix factorization (see Non-Patent Document 1).

特開２００９−０２５５００号公報JP 2009-025500 A

広畑誠、外２名、“多様な映像コンテンツに対応した遅延なし音源分離技術”、日本音響学会、２０１３年春季日本音響学会講演論文集、発表番号３−１０−１８、ｐ．７９９−８０２、２０１３年３月Makoto Hirohata and two others, “Non-delayed sound source separation technology for various video contents”, Acoustical Society of Japan, 2013 Spring Acoustical Society of Japan Proceedings, Presentation No. 3-10-18, p. 799-802, March 2013

しかしながら、従来の方法においては、放送などを受信する端末装置に用いると、背景音とスピーチ音とに分離する処理の負荷が大きいことがあるという問題がある。 However, the conventional method has a problem that when it is used for a terminal device that receives a broadcast or the like, a processing load for separating the background sound and the speech sound may be large.

本発明は、このような事情に鑑みてなされたもので、背景音とスピーチ音とに分離する処理の端末装置における負荷を抑えることができる情報処理装置、端末装置およびプログラムを提供する。 The present invention has been made in view of such circumstances, and provides an information processing device, a terminal device, and a program capable of suppressing a load on a terminal device for processing to separate background sound and speech sound.

この発明は上述した課題を解決するためになされたもので、本発明の一態様は、人が発声するスピーチ音と背景音とが混合された音声信号を取得する音声信号取得部と、スピーチ音を構成する音素を表す音素情報を参照して、前記音声信号に含まれる前記音素の成分を表す音素成分情報を算出する音素成分情報算出部と、前記音素成分情報を、前記音声信号におけるスピーチ音と背景音との比率を調整する装置に通知する通知部とを備えることを特徴とする情報処理装置である。 The present invention has been made to solve the above-described problems, and one aspect of the present invention is an audio signal acquisition unit that acquires an audio signal in which a speech sound uttered by a person and a background sound are mixed, and a speech sound. A phoneme component information calculation unit that calculates phoneme component information that represents a component of the phoneme included in the speech signal, and the phoneme component information is converted into a speech sound in the speech signal. And a notification unit that notifies the device that adjusts the ratio between the background sound and the background sound.

また、本発明の他の態様は、上述の情報処理装置であって、前記音素成分情報算出部は、前記音素情報を参照して、短時間フーリエ変換された音声信号である第１の非負値行列を、前記音素情報である第２の非負値行列と、前記音素成分情報である第３の非負値行列と、前記音声信号が表す音のうち、前記音素以外の成分に関する第４の非負値行列と第５の非負値行列とに分解することを特徴とする。 Another aspect of the present invention is the above-described information processing device, wherein the phoneme component information calculation unit refers to the phoneme information, and is a first non-negative value that is an audio signal subjected to a short-time Fourier transform. The matrix includes a second non-negative value matrix that is the phoneme information, a third non-negative value matrix that is the phoneme component information, and a fourth non-negative value related to a component other than the phoneme among the sounds represented by the speech signal. It is characterized by decomposing into a matrix and a fifth non-negative matrix.

また、本発明の他の態様は、人が発声するスピーチ音を構成する音素を表す音素情報を記憶する音素情報記憶部と、スピーチ音と背景音とが混合された音声信号を取得する音声信号取得部と、前記音声信号に含まれる前記音素の成分を表す音素成分情報を取得する音素成分情報取得部と、前記音素情報と前記音素成分情報とを参照して、前記音声信号に含まれている、スピーチ音と背景音とを分離する分離部と、前記分離部が分離した人が発声する音声と背景音との比率を調整して混合する混合部とを備えることを特徴とする端末装置である。 In another aspect of the present invention, a phoneme information storage unit that stores phoneme information representing a phoneme constituting a speech sound uttered by a person, and an audio signal that acquires an audio signal in which the speech sound and the background sound are mixed Referring to the acquisition unit, the phoneme component information acquisition unit that acquires phoneme component information representing the phoneme component included in the audio signal, the phoneme information and the phoneme component information, and included in the audio signal A terminal unit comprising: a separation unit that separates a speech sound and a background sound; and a mixing unit that adjusts and mixes a ratio of a voice uttered by a person separated by the separation unit and a background sound. It is.

また、本発明の他の態様は、上述の端末装置であって、前記分離部は、前記音素情報と前記音素成分情報とを参照して、前記音声信号に含まれているスピーチ音を表す情報を生成するスピーチ音生成部と、前記音声信号から、前記スピーチ音生成部が生成した情報が表す音を差し引いて、前記背景音を示す情報を生成する背景音分離部とを備えることを特徴とする。 Another aspect of the present invention is the above-described terminal device, wherein the separation unit refers to the phoneme information and the phoneme component information, and represents information representing a speech sound included in the audio signal. A speech sound generation unit that generates a sound, and a background sound separation unit that generates information indicating the background sound by subtracting the sound represented by the information generated by the speech sound generation unit from the audio signal. To do.

また、本発明の他の態様は、コンピュータを、人が発声するスピーチ音を構成する音素を表す音素情報を記憶する音素情報記憶部、スピーチ音と背景音とが混合された音声信号を取得する音声信号取得部、前記音声信号に含まれる前記音素の成分を表す音素成分情報を取得する音素成分情報取得部、前記音素情報と前記音素成分情報とを参照して、前記音声信号に含まれている、スピーチ音と背景音とを分離する分離部、前記分離部が分離した人が発声する音声と背景音との比率を調整して混合する混合部として機能させるためのプログラムである。 In another aspect of the present invention, a computer acquires a phoneme information storage unit that stores phoneme information representing a phoneme constituting a speech sound uttered by a person, and an audio signal in which the speech sound and the background sound are mixed. An audio signal acquisition unit; a phoneme component information acquisition unit that acquires phoneme component information representing a component of the phoneme included in the audio signal; and the phoneme information and the phoneme component information. A program for functioning as a separation unit that separates speech sound and background sound, and a mixing unit that adjusts and mixes the ratio of the sound produced by the person separated by the separation unit and the background sound.

この発明によれば、背景音とスピーチ音とに分離する処理の端末装置における負荷を抑えることができる。 According to the present invention, it is possible to suppress the load on the terminal device for the process of separating the background sound and the speech sound.

この発明の一実施形態による音声配信システムの構成を示す概略ブロック図である。It is a schematic block diagram which shows the structure of the audio | voice delivery system by one Embodiment of this invention. 同実施形態による音素情報配信装置１１の構成を示す概略ブロック図である。It is a schematic block diagram which shows the structure of the phoneme information delivery apparatus 11 by the embodiment. 同実施形態による音素成分情報配信装置１２の構成を示す概略ブロック図である。It is a schematic block diagram which shows the structure of the phoneme component information delivery apparatus 12 by the embodiment. 同実施形態による端末装置３１の構成を示す概略ブロック図である。It is a schematic block diagram which shows the structure of the terminal device 31 by the embodiment.

以下、図面を参照して、本発明の実施の形態について説明する。図１は、この発明の一実施形態による音声配信システムの構成を示す概略ブロック図である。音声配信システムは、音素情報配信装置１１、音素成分情報配信装置１２、音声信号配信装置１３、ネットワーク２１、複数の端末装置３１を含む。音素情報配信装置１１は、背景音を含まず、スピーチ音のみからなる大量の音声信号Ｓから、スピーチ音を構成する音である音素各々を表す音素情報Ｕｓを抽出する。ここで、音素情報Ｕｓは、音素各々のスペクトル分布である。音素情報配信装置１１は、抽出した音素情報Ｕｓを、ネットワーク２１を介して端末装置３１各々に配信する。また、音素情報配信装置１１は、抽出した音素情報Ｕｓを、音素成分情報配信装置１２に通知する。なお、音素情報配信装置１１による音素情報Ｕｓの抽出と配信とは、端末装置３１への音声の配信に先立って行われる。 Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a schematic block diagram showing the configuration of an audio distribution system according to an embodiment of the present invention. The audio distribution system includes a phoneme information distribution device 11, a phoneme component information distribution device 12, an audio signal distribution device 13, a network 21, and a plurality of terminal devices 31. The phoneme information distribution device 11 extracts phoneme information Us representing each phoneme that is a sound constituting the speech sound from a large amount of the speech signal S including only the speech sound without including the background sound. Here, the phoneme information Us is a spectrum distribution of each phoneme. The phoneme information distribution device 11 distributes the extracted phoneme information Us to each terminal device 31 via the network 21. In addition, the phoneme information distribution device 11 notifies the extracted phoneme information Us to the phoneme component information distribution device 12. Note that the extraction and distribution of the phoneme information Us by the phoneme information distribution device 11 is performed prior to the distribution of the sound to the terminal device 31.

音素成分情報配信装置１２（情報処理装置）は、音素情報配信装置１１から通知された音素情報Ｕｓを参照して、配信対象の音声信号Ｐのうち、音素情報Ｕｓが表す音素各々の成分を示す音素成分情報Ｈｐを算出する。音素成分情報配信装置１２は、算出した音素成分情報Ｈｐを、ネットワーク２１を介して端末装置３１各々に配信する。この音素成分情報配信装置１２による音素成分情報Ｈｐの算出と配信は、音声信号Ｐの配信と同時に並行して行われる。 The phoneme component information distribution device 12 (information processing device) refers to the phoneme information Us notified from the phoneme information distribution device 11 and indicates each component of the phoneme represented by the phoneme information Us in the audio signal P to be distributed. Phoneme component information Hp is calculated. The phoneme component information distribution device 12 distributes the calculated phoneme component information Hp to each terminal device 31 via the network 21. The calculation and distribution of the phoneme component information Hp by the phoneme component information distribution device 12 is performed simultaneously with the distribution of the audio signal P.

音声信号配信装置１３は、スピーチ音と背景音からなる音声信号Ｐを、ネットワーク２１を介して、端末装置３１各々に配信する。音声信号配信装置１３は、音声信号Ｐをそのまま配信してもよいし、圧縮符号化してから配信してもよい。 The audio signal distribution device 13 distributes an audio signal P composed of a speech sound and a background sound to each terminal device 31 via the network 21. The audio signal distribution device 13 may distribute the audio signal P as it is, or may distribute it after compression encoding.

ネットワーク２１は、例えば、インターネットのようなパケット交換ネットワークであってもよいし、放送波によるネットワークであってもよい。なお、音素情報配信装置１１、音素成分情報配信装置１２、音声信号配信装置１３の各々が端末装置３１に対する配信を行う際のネットワーク２１は、それぞれ異なっていてもよい。例えば、音素情報配信装置１１と音素成分情報配信装置１２とは、インターネットで端末装置３１への配信を行い、音声信号配信装置１３は、放送波で端末装置３１への配信を行うようにしてもよい。 The network 21 may be, for example, a packet switching network such as the Internet, or a network using broadcast waves. Note that the networks 21 when the phoneme information distribution device 11, the phoneme component information distribution device 12, and the audio signal distribution device 13 each distribute to the terminal device 31 may be different. For example, the phoneme information distribution device 11 and the phoneme component information distribution device 12 may distribute to the terminal device 31 via the Internet, and the audio signal distribution device 13 may distribute to the terminal device 31 via a broadcast wave. Good.

端末装置３１は、ネットワーク２１を介して予め配信された音素情報Ｕｓを受信し、記憶している。端末装置３１は、音声信号Ｐと、音素成分情報Ｈｐとを受信すると、予め記憶している音素情報Ｕｓと、受信した音素成分情報Ｈｐとを参照して、音声信号Ｐに混合されている背景音とスピーチ音との比率を調整し、比率を調整した音声を出力する。 The terminal device 31 receives and stores phoneme information Us distributed in advance via the network 21. When the terminal device 31 receives the speech signal P and the phoneme component information Hp, the terminal device 31 refers to the phoneme information Us stored in advance and the received phoneme component information Hp, and is mixed with the speech signal P. The ratio between the sound and the speech sound is adjusted, and the sound with the adjusted ratio is output.

図２は、音素情報配信装置１１の構成を示す概略ブロック図である。音素情報配信装置１１は、音声信号取得部１１１、短時間フーリエ変換部１１２、音素情報生成部１１３を含む。音声信号取得部１１１は、スピーチ音のみからなる音声信号Ｓを取得する。短時間フーリエ変換部１１２は、音声信号Ｓを短時間フーリエ変換する。 FIG. 2 is a schematic block diagram showing the configuration of the phoneme information distribution apparatus 11. The phoneme information distribution device 11 includes an audio signal acquisition unit 111, a short-time Fourier transform unit 112, and a phoneme information generation unit 113. The audio signal acquisition unit 111 acquires an audio signal S composed only of speech sounds. The short-time Fourier transform unit 112 performs a short-time Fourier transform on the audio signal S.

短時間フーリエ変換とは、以下のような変換である。１）音声の時間波形から、先頭から順番に一定の時間間隔でシフトさせながら、一定の時間幅の区間を切り出す。２）切り出した各区間を離散フーリエ変換する。３）離散フーリエ変換の結果の絶対値の２乗をとる。これにより、各区間について、周波数成分ごとの振幅の大きさを表すスペクトログラムが得られる。 The short-time Fourier transform is the following transformation. 1) A section with a certain time width is cut out from the time waveform of the sound while being shifted in order from the head at a certain time interval. 2) Discrete Fourier transform is performed on each extracted section. 3) Take the square of the absolute value of the result of the discrete Fourier transform. Thereby, the spectrogram showing the magnitude | size of the amplitude for every frequency component is obtained about each area.

信号ｇ（ｔ）の短時間フーリエ変換は、式（１）で表されるＧ（ｆ、ｎ）を用いて、式（２）で表される。なお、式（１）においてΔｔは、シフトの時間間隔である。ｎ＝１、２、…、Ｎは、シフトのインデックスである。Ｔは、切り出す区間の時間幅である。ＤＦＴ（）は、離散フーリエ変換である。ｆ＝１、２、…、Ｆは、離散フーリエ変換で得られる各周波数ビンのインデックスである。 The short-time Fourier transform of the signal g (t) is expressed by equation (2) using G (f, n) expressed by equation (1). In Expression (1), Δt is a shift time interval. n = 1, 2,..., N are shift indices. T is the time width of the section to be cut out. DFT () is a discrete Fourier transform. f = 1, 2,..., F is an index of each frequency bin obtained by the discrete Fourier transform.

以降、式（１）におけるｇ（ｔ）が、音声信号Ｓであるときの式（２）のスペクトログラム行列Ｘ、すなわち短時間フーリエ変換部１１２による音声信号Ｓの変換結果を、スペクトログラム行列Ｘｓという。 Hereinafter, the spectrogram matrix X of Equation (2) when g (t) in Equation (1) is the speech signal S, that is, the conversion result of the speech signal S by the short-time Fourier transform unit 112 is referred to as spectrogram matrix Xs.

音素情報生成部１１３は、短時間フーリエ変換部１１２による変換結果、すなわちスペクトログラム行列Ｘｓに対して、非負値行列因子分解を施す。これにより、音素情報生成部１１３は、音素情報Ｕｓを生成する。非負値行列因子分解とは、観測データなど、全ての要素が非負値である行列（非負値行列という）を、２つの非負値行列の積に分解することである。非負値行列因子分解の演算方法は、公知のものがいくつかあり、いずれを用いてもよいが、ここでは、ＫＬ−Ｄｉｖｅｒｇｅｎｃｅ（カルバック・ライブラー情報量）を用いる方法を説明する。 The phoneme information generation unit 113 performs non-negative matrix factorization on the conversion result by the short-time Fourier transform unit 112, that is, the spectrogram matrix Xs. Thereby, the phoneme information generation unit 113 generates phoneme information Us. Non-negative matrix factorization is a decomposition of a matrix in which all elements, such as observation data, are non-negative (referred to as a non-negative matrix) into a product of two non-negative matrices. There are some known methods for calculating non-negative matrix factorization, and any of them may be used. Here, a method using KL-Divergence (Calbach libra information amount) will be described.

行列Ｕｓ、Ｈｓの積と、スペクトログラム行列Ｘｓとの間のＫＬ−Ｄｉｖｅｒｇｅｎｃｅは、式（３）で表される。式（３）のＫＬ−Ｄｉｖｅｒｇｅｎｃｅを最小にする行列Ｕｓ、Ｈｓは、式（４）、（５）の演算を繰り返すことにより得られることが知られている。なお、ｘ_Ｓｉｊ、ｕ_Ｓｉｊ、ｈ_Ｓｉｊ、は、それぞれ行列Ｘｓ、Ｕｓ、Ｈｓのｉ行ｊ列の要素である。 The KL-Diverence between the product of the matrices Us and Hs and the spectrogram matrix Xs is expressed by Expression (3). It is known that the matrices Us and Hs that minimize the KL-Diverence of Expression (3) can be obtained by repeating the operations of Expressions (4) and (5). Note that x _Sij , u _Sij , and h _Sij are elements of i rows and j columns of the matrices Xs, Us, and Hs, respectively.

そこで、音素情報生成部１１３は、式（３）によるＫＬ−Ｄｉｖｅｒｇｅｎｃｅの算出と、式（４）、（５）の演算とを、交互に繰り返し行う。そして、音素情報生成部１１３は、式（４）、（５）の演算の前後で、式（３）によるＫＬ−Ｄｉｖｅｒｇｅｎｃｅの減少量が一定値以下になったときに、この繰り返しを終了し、そのときの行列Ｕｓを、音素情報Ｕｓとする。スピーチ音のみからなる音声信号Ｓのスペクトログラム行列Ｘｓを、このように非負値行列因子分解することで、行列Ｕｓには、スピーチ音に特徴的に現れるスペクトルパターンが学習されることが知られている（例えば、非特許文献１参照）。音素情報生成部１１３は、音素情報Ｕｓを、ネットワーク２１を介して、端末装置３１に配信する。また、音素情報生成部１１３は、音素情報Ｕｓを、音素成分情報配信装置１２に通知する。 Therefore, the phoneme information generation unit 113 alternately performs the calculation of KL-Divergence according to Equation (3) and the calculations according to Equations (4) and (5). Then, the phoneme information generation unit 113 ends the repetition when the amount of decrease in KL-Divergence according to the expression (3) becomes equal to or less than a predetermined value before and after the calculations of the expressions (4) and (5). The matrix Us at that time is assumed to be phoneme information Us. It is known that the spectral pattern characteristically appearing in the speech sound is learned in the matrix Us by performing the non-negative matrix factorization of the spectrogram matrix Xs of the speech signal S consisting only of the speech sound in this way. (For example, refer nonpatent literature 1). The phoneme information generation unit 113 distributes the phoneme information Us to the terminal device 31 via the network 21. The phoneme information generation unit 113 notifies the phoneme component information distribution device 12 of the phoneme information Us.

図３は、音素成分情報配信装置１２の構成を示す概略ブロック図である。音素成分情報配信装置１２は、音素情報記憶部１２１、音声信号取得部１２２、短時間フーリエ変換部１２３、音素成分情報算出部１２４、音素成分情報配信部１２５を含む。音素情報記憶部１２１は、音素情報配信装置１１により通知された音素情報Ｕｓを受信し、記憶する。音声信号取得部１２２は、端末装置３１に配信される音声信号Ｐを取得する。短時間フーリエ変換部１２３は、音声信号取得部１２２が取得した音声信号Ｐに対して、短時間フーリエ変換を施す。なお、短時間フーリエ変換部１２３は、音声信号Ｐを、時間幅Ｌ毎に分割し、分割した各区間に対して、図２の短時間フーリエ変換部１１２と同様の演算を行う。なお、以降、時刻ｔから時刻ｔ＋Ｌまでの区間に対して、短時間フーリエ変換をして得られるスペクトログラム行列を、スペクトログラム行列Ｘｐ^（ｔ）という。 FIG. 3 is a schematic block diagram showing the configuration of the phoneme component information distribution device 12. The phoneme component information distribution device 12 includes a phoneme information storage unit 121, an audio signal acquisition unit 122, a short-time Fourier transform unit 123, a phoneme component information calculation unit 124, and a phoneme component information distribution unit 125. The phoneme information storage unit 121 receives and stores the phoneme information Us notified by the phoneme information distribution device 11. The audio signal acquisition unit 122 acquires the audio signal P distributed to the terminal device 31. The short-time Fourier transform unit 123 performs short-time Fourier transform on the audio signal P acquired by the audio signal acquisition unit 122. The short-time Fourier transform unit 123 divides the audio signal P for each time width L, and performs the same calculation as the short-time Fourier transform unit 112 in FIG. 2 for each divided section. Hereinafter, the spectrogram matrix obtained by performing the short-time Fourier transform on the section from time t to time t + L is referred to as spectrogram matrix Xp ^(t) .

音素成分情報算出部１２４は、音素情報記憶部１２１が記憶する音素情報Ｕｓ（第２の非負値行列）を参照して、スペクトログラム行列Ｘｐ^（ｔ）（第１の非負値行列）が表す音声の各時刻において含まれるスピーチ音の比率を表す行列Ｈｐ^（ｔ）（第３の非負値行列）と、スピーチ音以外の音、すなわち背景音の比率を表す行列Ｈｎ^（ｔ）（第４の非負値行列）を算出する。これらの行列Ｈｐ^（ｔ）、Ｈｎ^（ｔ）の算出は、式（６）で表されるＫＬ−Ｄｉｖｅｒｇｅｎｃｅを最小にするＨｐ^（ｔ）、Ｕｎ^（ｔ）、Ｈｎ^（ｔ）を求める問題に相当する。この問題は、式（７）、（８）、（９）を繰り返し演算することで求めることができることが知られている。 The phoneme component information calculation unit 124 refers to the phoneme information Us (second non-negative matrix) stored in the phoneme information storage unit 121, and the sound represented by the spectrogram matrix Xp ^(t) (first non-negative matrix) A matrix Hp ^(t) (third non-negative matrix) representing the ratio of speech sounds included at each time, and a matrix Hn ^(t) (fourth non-negative value ⁾ representing the ratio of sounds other than speech sounds, that is, background sounds. Matrix). The calculation of these matrices Hp ^(t) and Hn ^(t) corresponds to the problem of obtaining Hp ^(t) , Un ^(t) and Hn ^(t) that minimizes KL-Divergence expressed by Equation (6). To do. It is known that this problem can be obtained by repeatedly calculating equations (7), (8), and (9).

そこで、音素成分情報算出部１２４は、式（６）によるＫＬ−Ｄｉｖｅｒｇｅｎｃｅの算出と、式（７）、（８）、（９）の演算とを、交互に繰り返し行う。そして、音素成分情報算出部１２４は、式（７）、（８）、（９）の演算の前後で、式（６）によるＫＬ−Ｄｉｖｅｒｇｅｎｃｅの減少量が一定値以下になったときに、この繰り返しを終了し、そのときの行列Ｈｐ^（ｔ）を、音素成分情報Ｈｐに含める。同様にして、音素成分情報算出部１２４は、行列Ｈｐ^{（ｔ＋Ｌ）}、Ｈｐ^{（ｔ＋２×Ｌ）}、・・・を算出し、音素成分情報Ｈｐに含める。
音素成分情報配信部１２５（通知部）は、音素成分情報算出部１２４が算出した音素成分情報Ｈｐを、ネットワーク２１を介して、端末装置３１に配信する。 Therefore, the phoneme component information calculation unit 124 alternately repeats the calculation of KL-Divergence by Expression (6) and the calculations of Expressions (7), (8), and (9). Then, the phoneme component information calculation unit 124, when the amount of decrease in KL-Diverence according to the equation (6) becomes equal to or less than a predetermined value before and after the calculations of the equations (7), (8), and (9), The repetition is terminated, and the matrix Hp ^(t) at that time is included in the phoneme component information Hp. Similarly, the phoneme component information calculation unit 124 calculates matrices Hp ^{(t + L)} , Hp ^{(t + 2 × L)} ,... And includes them in the phoneme component information Hp.
The phoneme component information distribution unit 125 (notification unit) distributes the phoneme component information Hp calculated by the phoneme component information calculation unit 124 to the terminal device 31 via the network 21.

図４は、端末装置３１の構成を示す概略ブロック図である。端末装置３１は、音素情報取得部３０１、音素情報記憶部３０２、音素成分情報取得部３０３、スピーチ音生成部３０４、音声信号取得部３０５、短時間フーリエ変換部３０６、背景音分離部３０７、スピーチ音・背景音混合部３０８、逆短時間フーリエ変換部３０９、音声出力部３１０を含む。なお、スピーチ音生成部３０４、短時間フーリエ変換部３０６、背景音分離部３０７とで、分離部３２０として機能する。 FIG. 4 is a schematic block diagram illustrating the configuration of the terminal device 31. The terminal device 31 includes a phoneme information acquisition unit 301, a phoneme information storage unit 302, a phoneme component information acquisition unit 303, a speech sound generation unit 304, an audio signal acquisition unit 305, a short-time Fourier transform unit 306, a background sound separation unit 307, a speech A sound / background sound mixing unit 308, an inverse short-time Fourier transform unit 309, and an audio output unit 310 are included. The speech sound generation unit 304, the short-time Fourier transform unit 306, and the background sound separation unit 307 function as the separation unit 320.

音素情報取得部３０１は、音素情報配信装置１１によりネットワーク２１を介して配信された音素情報Ｕｓを受信する。音素情報記憶部３０２は、音素情報取得部３０１が受信した音素情報Ｕｓを記憶する。音素成分情報取得部３０３は、音素成分情報配信装置１２によりネットワーク２１を介して配信された音素成分情報Ｈｐを受信する。スピーチ音生成部３０４は、音素成分情報取得部３０３が受信した音素成分情報Ｈｐに含まれている行列Ｈｐ^（ｔ）に、音素情報Ｕｓ（行列Ｕｓ）を乗じる。これにより、スピーチ音のスペクトログラム行列が算出される。 The phoneme information acquisition unit 301 receives the phoneme information Us distributed via the network 21 by the phoneme information distribution device 11. The phoneme information storage unit 302 stores the phoneme information Us received by the phoneme information acquisition unit 301. The phoneme component information acquisition unit 303 receives the phoneme component information Hp distributed by the phoneme component information distribution device 12 via the network 21. The speech sound generation unit 304 multiplies the matrix Hp ^(t) included in the phoneme component information Hp received by the phoneme component information acquisition unit 303 by the phoneme information Us (matrix Us). Thereby, the spectrogram matrix of the speech sound is calculated.

音声信号取得部３０５は、音声信号配信装置１３によりネットワーク２１を介して配信された音声信号Ｐを受信する。短時間フーリエ変換部３０６は、図３の短時間フーリエ変換部１２３と同様にして、音声信号Ｐに対して、短時間フーリエ変換を施す。これにより、スペクトログラム行列Ｘｐ^（ｔ）が算出される。背景音分離部３０７は、スペクトログラム行列Ｘｐ^（ｔ）から、スピーチ音生成部３０４が算出したスペクトログラム行列を引いて、背景音のスペクトログラム行列を算出する。このとき、スピーチ音生成部３０４が算出したスペクトログラム行列と、音声信号Ｐのスペクトログラム行列Ｘｐ^（ｔ）とは時刻が同期していなければならない。例えば、音素成分情報配信装置１２は音素成分情報Ｈｐにタイムスタンプを付加して配信し、音声信号配信装置１３は音声信号Ｐにタイムスタンプを付加して配信し、背景音分離部３０７は、これらのタイムスタンプを用いて同期させる。 The audio signal acquisition unit 305 receives the audio signal P distributed by the audio signal distribution device 13 via the network 21. The short-time Fourier transform unit 306 performs short-time Fourier transform on the audio signal P in the same manner as the short-time Fourier transform unit 123 of FIG. Thereby, the spectrogram matrix Xp ^(t) is calculated. The background sound separating unit 307 calculates a spectrogram matrix of the background sound by subtracting the spectrogram matrix calculated by the speech sound generating unit 304 from the spectrogram matrix Xp ^(t) . At this time, the spectrogram matrix calculated by the speech sound generation unit 304 and the spectrogram matrix Xp ^{(t) of} the speech signal P must be synchronized in time. For example, the phoneme component information distribution device 12 adds a time stamp to the phoneme component information Hp and distributes it, the audio signal distribution device 13 distributes the audio signal P with a time stamp, and the background sound separation unit 307 Synchronize using the time stamp.

スピーチ音・背景音混合部３０８は、スピーチ音生成部３０４が算出したスピーチ音のスペクトログラム行列に、予め設定された係数αを乗じる。同様に、スピーチ音・背景音混合部３０８は、背景音分離部３０７が算出した背景音のスペクトログラム行列に、予め設定された係数βを乗じる。スピーチ音・背景音混合部３０８は、係数αを乗じたスピーチ音のスペクトログラム行列と、係数βを乗じた背景音のスペクトログラム行列との和をとる。これにより、スピーチ音と背景音との比率を調整したスペクトログラム行列Ｙ^（ｔ）が算出される。なお、係数α、βは、例えば、端末装置３１のユーザにより設定されてもよい。スピーチ音生成部３０４、背景音分離部３０７、スピーチ音・背景音混合部３０８による処理は、式（１０）で表される。 The speech sound / background sound mixing unit 308 multiplies the spectrogram matrix of the speech sound calculated by the speech sound generation unit 304 by a preset coefficient α. Similarly, the speech sound / background sound mixing unit 308 multiplies the spectrogram matrix of the background sound calculated by the background sound separation unit 307 by a preset coefficient β. The speech sound / background sound mixing unit 308 calculates the sum of the spectrogram matrix of the speech sound multiplied by the coefficient α and the spectrogram matrix of the background sound multiplied by the coefficient β. Thereby, the spectrogram matrix Y ^(t) in which the ratio between the speech sound and the background sound is adjusted is calculated. The coefficients α and β may be set by the user of the terminal device 31, for example. The processing by the speech sound generation unit 304, the background sound separation unit 307, and the speech sound / background sound mixing unit 308 is expressed by Expression (10).

逆短時間フーリエ変換部３０９は、スピーチ音・背景音混合部３０８が算出したスペクトログラム行列Ｙ^（ｔ）に対して、逆短時間フーリエ変換を行い、スピーチ音と背景音の比率が調整された音声信号ｙを生成する。なお、この逆短時間フーリエ変換では、逆短時間フーリエ変換部３０９は、式（１１）、（１２）に示すように、スペクトログラム行列Ｙ^（ｔ）のｉ行ｊ列の要素ｙ_ｉｊ各々の平方根をとって、スペクトログラム行列Ｘｐ^（ｔ）のｉ行ｊ列の位相θ（ｉ，ｊ）を与えた行列Ｙ’の列毎に逆離散フーリエ変換することで、音声信号ｙを生成する。 The inverse short-time Fourier transform unit 309 performs inverse short-time Fourier transform on the spectrogram matrix Y ^(t) calculated by the speech sound / background sound mixing unit 308, and the speech in which the ratio of the speech sound and the background sound is adjusted. A signal y is generated. In this inverse short-time Fourier transform, the inverse short-time Fourier transform unit 309 performs the square root of each element y _{ij in} the i-th row and j-th column of the spectrogram matrix Y ^(t) as shown in equations (11) and (12). And the inverse discrete Fourier transform is performed for each column of the matrix Y ′ to which the phase θ (i, j ⁾ of i rows and j columns of the spectrogram matrix Xp ^(t) is given, thereby generating the audio signal y.

なお、式（１２）において、ＩＤＦＴ（）は、逆離散フーリエ変換である。
音声出力部３１０は、音声信号ｙに従い、音声を出力するスピーカである。
なお、本実施形態における音声配信システムは、音声信号Ｐを端末装置３１に配信しているが、音声信号Ｐだけでなく、映像信号も配信してもよい。
また、音声配信システムは、音素情報配信装置１１を有しておらず、音素成分情報配信装置１２と、端末装置３１とが、同一の音素情報Ｕｓを予め記憶していてもよい。 In Expression (12), IDFT () is an inverse discrete Fourier transform.
The audio output unit 310 is a speaker that outputs audio in accordance with the audio signal y.
In addition, although the audio | voice delivery system in this embodiment has delivered the audio | voice signal P to the terminal device 31, you may distribute not only the audio | voice signal P but a video signal.
Moreover, the audio | voice delivery system does not have the phoneme information delivery apparatus 11, and the phoneme component information delivery apparatus 12 and the terminal device 31 may memorize | store the same phoneme information Us previously.

また、音声信号配信装置１３は、音声信号Ｐを、例えば、ＡＡＣ（ＡｄｖａｎｃｅｄＡｕｄｉｏＣｏｄｉｎｇ）などの非可逆圧縮方式あるいは可逆圧縮方式にて符号化して配信してもよい。また、音素成分情報配信装置１２は、音素成分情報Ｈｐを非可逆圧縮にて符号化して配信してもよい。 Also, the audio signal distribution device 13 may encode and distribute the audio signal P by an irreversible compression method such as AAC (Advanced Audio Coding) or a reversible compression method. The phoneme component information distribution device 12 may encode and distribute the phoneme component information Hp by irreversible compression.

このように、音素成分情報配信装置１２は、スピーチ音を構成する音素を表す音素情報Ｕｓを参照して、音声信号Ｐに含まれる音素の成分を表す音素成分情報Ｈｐを算出する音素成分情報算出部１２４と、その音素成分情報Ｈｐを、端末装置３１に通知する音素成分情報配信部１２５とを備える。 In this way, the phoneme component information distribution device 12 refers to the phoneme information Us that represents the phonemes constituting the speech sound, and calculates the phoneme component information Hp that represents the phoneme components included in the speech signal P. And a phoneme component information distribution unit 125 that notifies the terminal device 31 of the phoneme component information Hp.

これにより、端末装置３１における背景音とスピーチ音とに分離する処理の負荷を抑えることができる。
また、スピーチ音と背景音とを別々に配信するよりも、配信する情報量を抑えることができる。サンプリング周波数４８ｋＨｚ、モノラルの音声信号を配信する場合を例に説明する。短時間フーリエ変換の時間シフトが、５１２０サンプル（約１００ｍｓ）であり、音素情報Ｕｓの行列の列数が１００であるときには、音素成分情報Ｈｐとして、１００ｍｓおきに、１００個の要素を持つ行列Ｈｐ（ｔ）を送信する。各要素が、単精度浮動小数点数（４ｂｙｔｅ）であれば、このビットレートは、１００×４ｂｙｔｅ×８ｂｉｔ／０．１ｓｅｃ＝３２ｋｂｐｓとなる。可逆圧縮により、その半分１６ｋｂｐｓ程度とすることができる。 Thereby, the load of the process which isolate | separates into the background sound and speech sound in the terminal device 31 can be suppressed.
Also, it is possible to reduce the amount of information to be distributed rather than distributing the speech sound and the background sound separately. An example in which a monaural audio signal is distributed with a sampling frequency of 48 kHz will be described. When the time shift of the short-time Fourier transform is 5120 samples (about 100 ms) and the number of columns of the phoneme information Us is 100, the matrix Hp having 100 elements every 100 ms as the phoneme component information Hp. (T) is transmitted. If each element is a single precision floating point number (4 bytes), the bit rate is 100 × 4 bytes × 8 bits / 0.1 sec = 32 kbps. The loss can be reduced to about 16 kbps by half.

一方、符号化方式としてＡＡＣを用いている地上デジタル放送では、音声のビットレートは、７２ｋｂｐｓである。したがって、スピーチ音と背景音とを別々に配信すると、７２ｋｂｐｓ×２＝１４４ｋｂｐｓであるが、一つの音声と音素成分情報とを配信すると７２ｋｂｓ＋１６ｋｂｐｓ＝８８ｋｂｐｓとなり、別々に配信する場合よりも、合計のビットレートを低くすることができる。 On the other hand, in digital terrestrial broadcasting using AAC as an encoding method, the audio bit rate is 72 kbps. Therefore, when the speech sound and the background sound are distributed separately, 72 kbps × 2 = 144 kbps, but when one voice and phoneme component information are distributed, 72 kbps + 16 kbps = 88 kbps, which is a total bit more than the case where they are distributed separately. The rate can be lowered.

さらに、音素成分情報算出部１２４は、音素情報Ｕｓを参照して、短時間フーリエ変換された音声信号である第１の非負値行列Ｘｐを、音素情報である第２の非負値行列Ｕｓと、音素成分情報である第３の非負値行列Ｈｐと、音声信号Ｐが表す音のうち、音素以外の成分に関する第４の非負値行列Ｈｎと第５の非負値行列Ｕｎとに分解する。
これにより、音素成分情報配信装置１２は、音声信号Ｐに含まれるスピーチ音と背景音とを分離するための音素成分情報Ｈｐを算出することができる。 Furthermore, the phoneme component information calculation unit 124 refers to the phoneme information Us, converts the first non-negative value matrix Xp that is a short-time Fourier-transformed speech signal, and the second non-negative value matrix Us that is phoneme information, The third non-negative matrix Hp that is phoneme component information and the fourth non-negative matrix Hn and the fifth non-negative matrix Un related to components other than phonemes among the sounds represented by the speech signal P are decomposed.
Thereby, the phoneme component information distribution device 12 can calculate the phoneme component information Hp for separating the speech sound and the background sound included in the audio signal P.

このように、端末装置３１は、人が発声するスピーチ音を構成する音素を表す音素情報Ｕｓを記憶する音素情報記憶部３０２と、音声信号Ｐに含まれる音素の成分を表す音素成分情報Ｈｐを取得する音素成分情報取得部３０３と、音素情報Ｕｓと音素成分情報Ｈｐとを参照して、音声信号Ｐに含まれている、スピーチ音と背景音とを分離する分離部３２０と、分離部３２０が分離した人が発声する音声と背景音との比率を調整して混合するスピーチ音・背景音混合部３０８とを備える。
これにより、端末装置３１における背景音とスピーチ音とに分離する処理の負荷を抑えることができる。 In this way, the terminal device 31 receives the phoneme information storage unit 302 that stores the phoneme information Us that represents the phonemes that make up the speech sound that a person utters, and the phoneme component information Hp that represents the components of the phonemes included in the audio signal P. With reference to the acquired phoneme component information acquisition unit 303, the phoneme information Us and the phoneme component information Hp, the separation unit 320 that separates the speech sound and the background sound included in the audio signal P, and the separation unit 320 A speech sound / background sound mixing unit 308 that adjusts and mixes the ratio of the sound uttered by the person who separated the sound and the background sound.
Thereby, the load of the process which isolate | separates into the background sound and speech sound in the terminal device 31 can be suppressed.

また、図１における音素情報配信装置１１、音素成分情報配信装置１２、音声信号配信装置１３、端末装置３１の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより各装置を実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。 Further, the program for realizing the functions of the phoneme information distribution device 11, the phoneme component information distribution device 12, the audio signal distribution device 13, and the terminal device 31 in FIG. Each apparatus may be realized by causing the computer system to read and execute the program recorded on the computer. Here, the “computer system” includes an OS and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

以上、この発明の実施形態を図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計変更等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes design changes and the like within a scope not departing from the gist of the present invention.

１１…音素情報配信装置
１２…音素成分情報配信装置
１３…音声信号配信装置
２１…ネットワーク
３１…端末装置
１１１…音声信号取得部
１１２…短時間フーリエ変換
１１３…音素情報生成部
１２１…音素情報記憶部
１２２…音素成分情報取得部
１２３…短時間フーリエ変換部
１２４…音素成分情報算出部
１２５…音素成分情報配信部
３０１…音素情報取得部
３０２…音素情報記憶部
３０３…音素成分情報取得部
３０４…スピーチ音生成部
３０５…音声信号取得部
３０６…短時間フーリエ変換部
３０７…背景音分離部
３０８…スピーチ音・背景音混合部
３０９…逆短時間フーリエ変換部
３１０…音声出力部 DESCRIPTION OF SYMBOLS 11 ... Phoneme information delivery apparatus 12 ... Phoneme component information delivery apparatus 13 ... Audio | voice signal delivery apparatus 21 ... Network 31 ... Terminal device 111 ... Audio | voice signal acquisition part 112 ... Short-time Fourier transform 113 ... Phoneme information generation part 121 ... Phoneme information storage part DESCRIPTION OF SYMBOLS 122 ... Phoneme component information acquisition part 123 ... Short-time Fourier transform part 124 ... Phoneme component information calculation part 125 ... Phoneme component information distribution part 301 ... Phoneme information acquisition part 302 ... Phoneme information storage part 303 ... Phoneme component information acquisition part 304 ... Speech Sound generation unit 305 ... Audio signal acquisition unit 306 ... Short time Fourier transform unit 307 ... Background sound separation unit 308 ... Speech sound / background sound mixing unit 309 ... Inverse short time Fourier transform unit 310 ... Audio output unit

Claims

An audio signal acquisition unit for acquiring an audio signal in which speech sound and background sound uttered by a person are mixed;
A phoneme component information calculation unit that calculates phoneme component information representing a component of the phoneme included in the speech signal with reference to phoneme information representing a phoneme constituting the speech sound;
An information processing apparatus comprising: a notification unit that notifies the device of adjusting a ratio of a speech sound and a background sound in the audio signal.

The phoneme component information calculation unit refers to the phoneme information, converts a first non-negative matrix that is an audio signal subjected to short-time Fourier transform, a second non-negative matrix that is the phoneme information, and the phoneme component The third non-negative matrix that is information and the fourth non-negative matrix and the fifth non-negative matrix related to components other than the phonemes in the sound represented by the speech signal are decomposed. The information processing apparatus according to 1.

A phoneme information storage unit that stores phoneme information representing a phoneme constituting a speech sound uttered by a person;
An audio signal acquisition unit for acquiring an audio signal in which a speech sound and a background sound are mixed;
A phoneme component information acquisition unit that acquires phoneme component information representing a component of the phoneme included in the audio signal;
A separation unit that separates a speech sound and a background sound included in the audio signal with reference to the phoneme information and the phoneme component information;
A terminal device comprising: a mixing unit that adjusts and mixes a ratio of a voice uttered by a person separated by the separation unit and a background sound.

The separation unit is
A speech sound generator that generates information representing a speech sound included in the audio signal with reference to the phoneme information and the phoneme component information;
The terminal device according to claim 3, further comprising: a background sound separation unit that subtracts a sound represented by the information generated by the speech sound generation unit from the audio signal to generate information indicating the background sound. .

Computer
A phoneme information storage unit that stores phoneme information representing a phoneme constituting a speech sound uttered by a person;
An audio signal acquisition unit for acquiring an audio signal in which a speech sound and a background sound are mixed;
A phoneme component information acquisition unit that acquires phoneme component information representing a component of the phoneme included in the audio signal;
A separation unit that separates a speech sound and a background sound included in the audio signal with reference to the phoneme information and the phoneme component information,
The program for functioning as a mixing part which adjusts and mixes the ratio of the sound uttered by the person separated by the separation part and the background sound.