WO2010131318A1

WO2010131318A1 - Video-sound output device and method for localizing sound

Info

Publication number: WO2010131318A1
Application number: PCT/JP2009/058744
Authority: WO
Inventors: 洋人河内; 和実菅谷; 禎司鈴木
Original assignee: パイオニア株式会社
Priority date: 2009-05-11
Filing date: 2009-05-11
Publication date: 2010-11-18

Abstract

A video-sound output device (1) is provided with a video analysis unit (11) that analyzes a video image to specify the position of a speaker, a sound separation unit (12) that separates a mixed speaker and background sound into a speaker sound and a background sound, and a localization unit (13) localizes the speaker sound separated by the sound separation unit (12) to the speaker position specified by the audio-sound analysis unit (11).

Description

Video / audio output device and audio localization method

The present invention relates to an audio localization technology for a video / audio output device that outputs content data including video and audio, and more particularly, to an audio localization technology for performing audio localization according to a speaker position. Related to sound localization technology.

When receiving program content such as TV broadcast, displaying video on a display and outputting sound from a speaker, in monaural sound, a human voice can be heard from the position of the speaker. In stereo / surround sound, in many cases, a human voice is localized at the center of the screen so that the human voice can be heard from the center of the screen.

However, since it is generally known that the presence of a person's voice is localized at the speaker position on the display, it is known that the sense of presence increases. An audio localization technique for localizing audio is disclosed.

For example, in Patent Document 1, the position of a speaker is detected, and the volume of sound output from a plurality of speakers is controlled according to the detected position. Moreover, in patent document 2, the position of a speaker is specified, an effect and volume adjustment are performed according to the specified position, and audio | speech data is output from the optimal speaker.

JP-A-11-313272 JP 2007-110582 A

However, in Patent Document 1 described above, since the sound is localized at the speaker position without considering the contents of the scene, depending on the scene, there is a case where stress is felt instead of enhancing the sense of reality. . For example, even in a scene where background sounds such as sound effects and BGM flow, since background sounds such as sound effects and BGM are output from the speaker position, viewers who watch the scene feel stress on the contrary. There is a problem. Further, in Patent Document 2, when the background sound is recorded in different audio channels, the speaker sound can be localized at the speaker position. When the background sound and the background sound are recorded on the same audio channel, there is a problem in that only the speaker's voice cannot be localized at the speaker position, which causes a sense of incongruity.

The present invention has been made in view of the above circumstances, and an example of the problem is to provide a sound localization technique that does not cause a sense of incongruity even when the speaker sound and the background sound are recorded in the same sound channel. It is in.

In order to achieve the above object, a video / audio output device according to an aspect of the present invention includes a speaker position specifying unit that analyzes video and specifies a speaker's position, and the speaker's voice and background sound are mixed. Voice separation means for separating the mixed voice into the voice of the speaker and the background sound, and voice for localizing the voice of the speaker separated by the voice separation means at the position of the speaker specified by the speaker position specifying means Localization means.

In addition, the sound localization method according to one aspect of the present invention includes a speaker position specifying step of analyzing a video to specify a speaker position, and a mixed sound in which a speaker's voice and a background sound are mixed. And a sound localization step of localizing the voice of the speaker separated in the voice separation step at the position of the speaker identified in the speaker position identification step. .

1 is a schematic configuration diagram of a video / audio output device according to a first embodiment of the present invention. It is an example of the image which the video / audio output device which concerns on the 1st Embodiment of this invention displays. It is an example of the frequency parameter of the video / audio output device which concerns on the 1st Embodiment of this invention. It is a flowchart which shows the flow of the video / audio output process of the video / audio output device which concerns on the 1st Embodiment of this invention. It is a flowchart which shows in detail the flow of the audio | voice separation process of step S4 of FIG. It is a flowchart which shows the flow of the audio | voice localization process of step S6 of FIG. 4 in detail. It is a flowchart which shows the flow of the audio | voice output process of step S10 of FIG. 4 in detail. It is a schematic block diagram of the modification of the video / audio output device which concerns on the 1st Embodiment of this invention. It is a flowchart which shows the flow of the audio localization process of the modification of the video / audio output device which concerns on the 1st Embodiment of this invention. It is a schematic block diagram of the video / audio output device which concerns on the 2nd Embodiment of this invention. It is a flowchart which shows the flow of the audio | voice separation process of the video / audio output device which concerns on the 2nd Embodiment of this invention. It is a schematic block diagram of the video / audio output device which concerns on the 3rd Embodiment of this invention. It is a figure which shows the data structure of the characteristic data of the video / audio output device which concerns on the 3rd Embodiment of this invention. It is a flowchart which shows the flow of the audio | voice separation process of the video / audio output device which concerns on the 3rd Embodiment of this invention. It is a figure which shows the data structure of the characteristic data of the video / audio output device which concerns on the 4th Embodiment of this invention. It is a flowchart which shows the flow of the audio | voice separation process of the video / audio output device which concerns on the 4th Embodiment of this invention.

Hereinafter, embodiments of the present invention will be described with reference to the drawings.

<First Embodiment>
FIG. 1 is a schematic configuration diagram of a video / audio output apparatus 1 according to an embodiment of the present invention. The video / audio output device 1 is a device that outputs audio with audio localization in accordance with the speaker position. In this embodiment, since the speaker voice and background sound are recorded on the same channel, the speaker voice and background sound recorded on the same channel are separated, and only the separated speaker voice is The panorama has been changed to match.

In the following, “speaker” refers to a person speaking in the video data (on the screen), and “speaker voice” refers to the voice of the person speaking. The “background sound” means a sound other than the speaker's voice, and specifically means BGM, environmental sound, noise, voice other than the speaker, and the like. “Speaker position” refers to a position on the screen of the speaker, but more precisely, a position near the face (especially mouth) of the speaker. “Output the voice with the voice localization in accordance with the speaker position” means outputting the voice so that the voice can be heard from the position of the speaker. For example, as shown in FIG. Is present on the left side of the screen, the volume of the speaker voice output from the speaker SP1 provided on the left side of the screen d10 is increased, and the volume of the speaker voice output from the speaker SP2 provided on the right side of the screen d10 is increased. This means that the sound is output so that the sound can be heard from the position of the speaker on the left side of the screen at a reduced volume.

Here, the video / audio output device 1 may be any device as long as it has a function of reproducing content data including video and audio input from the outside and outputting the content data to the outside. A television (TV), a DVD player and recorder, a BD player and recorder, a personal computer (PC), and the like are assumed.

Specifically, the video / audio output device 1 includes a video analysis unit 11, an audio separation unit 12, a localization processing unit 13, a video display unit 14, and an audio output unit 15.

The video analysis unit 11 outputs the input video data to the video display unit 14 (to synchronize with the audio data, the video data is delayed and output to the video display unit 14 as necessary) and the input video data The speaker position is specified from the above. The method for specifying the speaker position is performed using a known technique. For example, a speaker may be specified by detecting a region of a human face from video data and detecting a mouth movement in the face. At this time, in detecting the movement of the mouth, using the video data of several frames before and after, the difference such as the brightness of the mouth area is calculated as a feature amount, and the person with the mouth region having the largest feature value is calculated. Is determined as a speaker, the speaker can be specified even when a plurality of faces are detected.

Further, the video analysis unit 11 outputs the specified speaker position to the localization processing unit 13.

The voice separation unit 12 separates the input voice data (voice data in which the speaker voice and the background sound are mixed) into the speaker voice and the background sound based on the frequency parameter P1. The frequency parameter P1 is a parameter indicating a general frequency band of a human voice. In the present embodiment, as shown in FIG. 3, the lower limit frequency f1 is set to 80 Hz, and the upper limit frequency f2 is set to 3000 Hz. Specifically, the speaker voice can be obtained by passing the input voice data through a band-pass filter in which the frequency parameter P1 is set. That is, in the present embodiment, the voice in the frequency band between 80 Hz and 3000 Hz is separated as the speaker voice. The background sound can be obtained by passing the input audio data through a band rejection filter in which the frequency parameter P1 is set. That is, in the present embodiment, the voice in the frequency band of less than 80 Hz or more than 3000 Hz is separated as the background sound. Note that the frequency parameter P1 shown in FIG. 3 shows a preferable example value, and the frequency parameter P1 is not necessarily limited to the value shown in FIG.

The voice separation unit 12 outputs the separated speaker voice to the localization processing unit 13 and outputs the separated background sound to the voice output unit 15.

The localization processing unit 13 performs localization change processing of the speaker voice output from the voice separation unit 12 based on the speaker position output from the video analysis unit 11. That is, the volume is adjusted so that the speaker voice is localized at the speaker position on the screen. For example, as shown in FIG. 2, when the speaker A exists on the left side of the screen, the volume of the speaker voice output from the speaker SP1 provided on the left side of the screen is increased, and the speaker SP2 provided on the right side of the screen. Decrease the volume of the speaker voice output from. Specifically, when the speaker A is present on the screen at a position where the horizontal ratio is C: D, the volume ratio of the speaker voices of the speakers SP1 and SP2 is set to D: C. You may do it.

Note that the playback environment may be taken into account in the process of changing the localization of the speaker voice. The reproduction environment is, for example, the size of a display screen such as a display or the position of a speaker. For example, in the case where speakers are provided in the upper, lower, left and right directions of the screen, the volume of the upper and lower speakers is adjusted not only according to the volume of the left and right speakers as shown in FIG. You may do it.

Also, the localization processing unit 13 outputs the speaker voice subjected to the localization change process to the voice output unit 15.

The video display unit 14 displays the video data output from the video analysis unit 11 on a display or the like.

The voice output unit 15 mixes the speaker voice subjected to the localization change process and the background sound output from the voice separation unit 12 and outputs the mixed sound to the speaker.

Yes.

Next, the video / audio output processing of the video / audio output device 1 of the present embodiment will be described with reference to FIG. FIG. 4 is a flowchart showing the flow of the video / audio output process of the video / audio output device 1.

First, the video analysis unit 11 of the video / audio output device 1 analyzes the input video data and performs video analysis processing for specifying the speaker position on the screen (step S2).

Next, the audio separation unit 12 of the video / audio output device 1 performs an audio separation process for separating the input audio data into the speaker sound and the background sound (step S4).

Here, the speech separation process will be described in detail with reference to FIG. FIG. 5 is a flowchart showing in detail the speech separation process in step S4 of FIG.

The voice separation unit 12 sets the lower limit frequency f1 and the upper limit frequency f2 of the frequency parameter P1 (step S12). Specifically, f1 = 80 Hz and f2 = 3000 Hz. As for the value of the frequency parameter P1, the video / audio output apparatus 1 may hold a fixed value in advance, or the user can instruct the video / audio output apparatus 1 to set a variable value. It is good.

Next, the voice separator 12 passes the input voice data through a band-pass filter in which the frequency parameter P1 is set, and separates the speaker voice (step S14).

Next, the sound separation unit 12 passes the input sound data through a band rejection filter in which the frequency parameter P1 is set, and separates the background sound (step S16).

Referring back to FIG. 4, next, the localization processing unit 13 of the video / audio output device 1 performs audio localization processing for changing the localization of the speaker voice separated to the specified speaker position (step S6).

Here, the sound localization process will be described in detail with reference to FIG. FIG. 6 is a flowchart showing in detail the sound localization process in step S6 of FIG.

The localization processing unit 13 determines whether or not there is a speaker on the screen (step S22). Whether or not there is a speaker in the screen is determined based on the presence or absence of the speaker position analyzed by the video analysis unit 11. When the speaker position exists, it is determined that there is a speaker in the screen.

If there is a speaker on the screen (step S22: YES), the speaker voice output value near the speaker position is raised (step S24), and the speaker voice output from the speaker far from the speaker position is output. The value is lowered (step S26). For example, as shown in FIG. 2, when the speaker A is on the left side of the screen, the volume of the speaker SP1 is increased and the volume of the speaker SP2 is decreased.

If there is no speaker on the screen (step S22: NO), the process of changing the sound localization is not performed.

Returning to FIG. 4, next, the video display unit 14 of the video / audio output device 1 performs video display processing for displaying the video data on a display or the like (step S8), and the audio output unit 15 transmits the audio data from the speaker. A sound output process is performed (step S10).

Here, the audio output processing will be described in detail with reference to FIG. FIG. 7 is a flowchart showing in detail the audio output process in step S10 of FIG.

The video display unit 14 mixes the speaker voice subjected to the localization change process and the background sound (step S32), and outputs the mixed voice data from the speaker (step S34). For example, as shown in FIG. 2, when the speaker A is on the left side of the screen, the speaker sound is output by increasing the volume of the speaker SP1 and decreasing the volume of the speaker SP2, and the background sound is The speakers SP1 and SP2 are output at the same volume. As a result, the speaker voice can be heard from the position of the person, and the background sound can maintain the localization of the input voice information.

As described above, according to the video / audio output device 1 according to the present embodiment, even if the video content includes the speaker audio and the background sound recorded in the same audio channel, only the speaker audio is transmitted to the speaker. Since the position can be changed to the position, the viewer does not feel uncomfortable and more natural and realistic viewing is possible.

Note that the voice separation unit 12 according to the present embodiment separates the speaker voice and the background sound using the frequency parameter P1 indicating the general frequency band of the human voice. The separation method is not limited to this, and other methods may be used. For example, in the case of stereo speech, the left and right speech may be converted into the frequency domain, and the speaker speech may be separated by comparing the spectral power of the left and right speech in the converted frequency domain. In the case of this method, since the speaker's voice in stereo sound is localized in the center, the speaker's voice can be separated using the frequency band having a small difference in spectral power between the left and right as the speaker's frequency band. it can.

In the present embodiment, when there is a background sound having a frequency band that overlaps the frequency band of the human voice (background sound of 80 to 3000 Hz; hereinafter referred to as background sound 1), the background sound is the speaker. The sound is mixed in the voice, and the background sound mixed in the speaker's voice is simultaneously changed to the position of the person. Therefore, in such a case, the background sound separated to cancel the sound localization of the background sound mixed with the speaker sound (background sound less than 80 Hz or more than 3000 Hz; hereinafter referred to as background sound 2) The localization may be changed.

For example, as shown in FIG. 2, when the speaker A is on the left side of the screen, the speaker sound is output with the volume of the speaker SP1 raised and the volume of the speaker SP2 lowered. Since the background sound 1 is included, the background sound 1 is also heard from the position of the person, that is, the left side on the screen. Therefore, the background sound 2 is output with the volume of the speaker SP1 lowered and the volume of the speaker SP2 lowered so that it can be heard from the right side of the screen. As a result, the entire background sound including the background sound 1 and the background sound 2 can be localized in the input voice information without being localized in the position of the person.

FIG. 8 is a schematic configuration diagram of the video / audio output device 1A in consideration of the background sound included in the speaker voice. The video / audio output device 1A has substantially the same configuration as the video / audio output device 1, but only the function of the localization processing unit 13A is different.

Based on the speaker position output from the video analysis unit 11, the localization processing unit 13 </ b> A performs a localization change process of the separated speaker voice (including background sound 1) output from the voice separation unit 12. It has become. That is, the volume is adjusted so that the separated speaker voice (including background sound 1) is localized at the speaker position on the screen.

In addition, the localization processing unit 13A performs a localization changing process for the separated background sound (background sound 2) output from the voice separation unit 12. That is, the volume is adjusted so that the separated background sound (background sound 2) is localized in the direction opposite to the speaker position on the screen (with respect to the center on the screen).

As a result, the localization processing unit 13A outputs the speaker voice subjected to the localization change process and the background sound subjected to the localization change process to the voice output unit 15.

FIG. 9 is a flowchart showing in detail the flow of audio localization processing of the video / audio output device 1A.

The localization processing unit 13A determines whether or not there is a speaker on the screen (step S42).

When there is a speaker in the screen (step S42: YES), the localization processing unit 13A increases the output value of the speaker voice (including background sound 1) of the speaker close to the speaker position (step S44). Then, the output value of the speaker voice (including background sound 1) from the speaker far from the speaker position is lowered (step S46). For example, as shown in FIG. 2, when the speaker SP is on the left side of the screen, the volume of the speaker SP1 for speaker voice (including background sound 1) is increased and the volume of the speaker SP2 is decreased.

Next, the localization processing unit 13A decreases the output value of the background sound 2 of the speaker close to the speaker position (step S48), and the localization processing unit 13A increases the output value of the background sound 2 of the speaker far from the speaker position ( Step S50). For example, as shown in FIG. 2, when the speaker SP is on the left side of the screen, the volume of the speaker SP1 of the background sound 2 is lowered and the volume of the speaker SP2 is raised.

On the other hand, when there is no speaker on the screen (step S42: NO), the process of changing the sound localization is not performed.

As described above, according to the video / audio output device 1A according to the present modification, it is possible to change the localization of the speaker voice to the speaker position. Further, according to the video / audio output device 1A, even if the frequency band of the speaker voice and the background sound overlaps and the background sound is mixed with the speaker voice, the background sound separated from the speaker voice is displayed at the center of the screen. Therefore, the entire background sound can maintain the localization of the input voice information. As a result, the viewer does not feel uncomfortable, and more natural and realistic viewing is possible.

<Second Embodiment>
FIG. 10 is a schematic configuration diagram of the video / audio output device 2 according to the second embodiment of the present invention. The video / audio output device 2 is a device that outputs audio with audio localization in accordance with the speaker position. In the following description, only configurations, functions, and processes different from those of the first embodiment will be described, and with regard to other configurations, functions, and processes, the same portions are denoted by the same reference numerals and description thereof is omitted.

The video analysis unit 21 outputs the input video data to the video display unit 14 (to synchronize with the audio data, the video data is delayed and output to the video display unit 14 as necessary), and the input video data The speaker position is specified from the above.

Also, the video analysis unit 21 extracts the speaker's facial features and analyzes the speaker's attributes from the extracted facial features. Facial features are extracted using a known technique. In this embodiment, a method for detecting the feature amount of a major part of the face is adopted, and the feature amount indicating the positional relationship between the face contour, both eyebrows, both eyes, nose, mouth and the like is used as the facial feature. Extract as In addition, the attribute of the speaker means age and sex, and in the present embodiment, determination of a man or a woman and determination of whether or not a child is performed.

Also, the video analysis unit 21 outputs the identified speaker position to the localization processing unit 13 and outputs the analyzed attribute to the voice separation unit 22.

The attribute database (hereinafter referred to as attribute DB) 23 is a database that stores data relating to facial features for each attribute (hereinafter referred to as facial feature data), and in this embodiment, male or female facial feature data, and It has child face feature data. The video analysis unit 21 analyzes the speaker's attributes by comparing the extracted speaker's facial features with the facial feature data stored in the attribute DB 23.

In the present embodiment, the video / audio output device 2 includes the attribute DB 23. However, the video / audio output device 2 may not include the attribute DB 23, and the video / audio output device 2 may be connected via a communication network. The attribute DB 23 may be accessed to refer to the facial feature data stored in the attribute DB 23.

The voice separation unit 22 separates input voice data (voice data in which speaker voice and background sound are mixed) into speaker voice and background sound according to a frequency parameter P2 in which an attribute is added to the frequency parameter P1. ing. In the present embodiment, for example, in the case of a male, the frequency parameter P2 is a value obtained by reducing the value of the upper limit frequency f2 by the correction value α, and in the case of a female, the value of the lower limit frequency f1 is the correction value β. Only the value is raised. Since men are generally considered to have lower voice than women, the difference between men and women is reflected in the frequency parameter P2. In the case of a child, in addition to the correction between men and women, the lower limit frequency f1 is further increased by a correction value γ. Since the child is generally considered to be louder than the adult, the age difference is reflected in the frequency parameter P2. In addition, this correction method shows one suitable method, and is not limited to this.

Also, the voice separation unit 22 outputs the separated speaker voice to the localization processing unit 13 and outputs the separated background sound to the voice output unit 15.

FIG. 11 is a flowchart showing in detail the flow of the audio separation process of the video / audio output apparatus 2 according to the present embodiment.

The voice separation unit 22 sets the lower limit frequency f1 and the upper limit frequency f2 of the frequency parameter P2 (step S52). Specifically, the value of the frequency parameter P1 is set as it is, and f1 = 80 Hz and f2 = 3000 Hz.

Next, the voice separation unit 22 determines whether or not the speaker is a male (step S54). If the speaker is a male (step S52: YES), the speech separation unit 22 corrects the upper limit frequency f2 of the frequency parameter P2 (step S56). If the speaker is a female (step S54: NO), the voice separation unit 22 corrects the lower limit frequency f1 of the frequency parameter P2 (step S58). Specifically, when the speaker is a male, the correction value α is subtracted from the upper limit frequency f2 of the frequency parameter P2, and when the speaker is a female, the correction value β is added to the lower limit frequency f1 of the frequency parameter P2. to add.

Next, the voice separation unit 22 determines whether or not the speaker is a child (step S60). When the speaker is a child (step S60: YES), the voice separation unit 22 further corrects the lower limit frequency f1 of the frequency parameter P2 (step S62). Specifically, when the speaker is a child, the correction value γ is added to the lower limit frequency f1 of the frequency parameter P2.

Next, the voice separator 22 passes the input voice data through a band-pass filter in which the frequency parameter P2 is set, and separates the speaker voice (step S64).

Next, the voice separation unit 22 passes the inputted voice data through a band rejection filter in which the frequency parameter P2 is set, and separates the background sound (step S66).

As described above, according to the video / audio output device 2 according to the present embodiment, the speaker's attributes are analyzed, and the voice is separated based on the frequency parameter P2 reflecting the voice difference due to the attributes. The accuracy of separating the sound and the background sound can be further increased.

As a result, viewers can enjoy a more natural and realistic viewing experience without feeling uncomfortable.

<Third Embodiment>
FIG. 12 is a schematic configuration diagram of a video / audio output device 3 according to the third embodiment of the present invention. The video / audio output device 3 is a device that outputs audio with audio localization in accordance with the speaker position.

The video analysis unit 31 outputs the input video data to the video display unit 14 (to synchronize with the audio data, the video data is delayed and output to the video display unit 14 as necessary), and the input video data The speaker position is specified from the above.

Also, the video analysis unit 31 extracts the speaker's facial features from the input video data. Facial features are extracted using a known technique. In the present embodiment, a method for detecting the feature amount of a major part of the face is adopted, and the position coordinates of a plurality of facial feature points such as a face outline, brows, both eyes, nose, and mouth are used. Extract as facial features.

Also, the video analysis unit 31 outputs the position of the identified speaker to the localization processing unit 13 and outputs the extracted facial features to the voice separation unit 32.

The feature DB 33 is a database that stores feature data in which face features and voice features are associated with each other. FIG. 13 shows the data structure of feature data. As shown in FIG. 13, the feature data includes coordinates of a plurality of facial features (specifically, the positions of eyes, nose, and mouth) and voice features (specifically, a lower limit frequency f1 and an upper limit frequency f2). Is shown). The feature data is composed of data such as actors appearing in moving image contents such as TV and movies. The voice separation unit 32 described later compares the speaker's facial features extracted by the video analysis unit 31 with the feature data stored in the feature DB 33, and if there is matching feature data, Voice features corresponding to the facial features are acquired.

In the present embodiment, the video / audio output device 3 includes the feature DB 33. However, the video / audio output device 3 may not include the feature DB 33, and the video / audio output device 3 may be connected via a communication network. The feature DB 33 may be accessed and the data stored in the feature DB 33 may be referred to. Further, the feature data stored in the feature DB 33 may be updated as needed according to the latest video content.

The voice separation unit 32 separates the input voice data (voice data in which the speaker voice and the background sound are mixed) into the speaker voice and the background sound according to the frequency parameter P1 or the frequency parameter P3. Specifically, when the voice separation unit 32 can acquire a voice feature from the feature DB 33 based on the facial feature output from the video analysis unit 31, the voice separation unit 32 sets the acquired voice feature as the frequency parameter P3, and the set frequency parameter P3. The voice data is separated according to the above. On the other hand, when the voice feature cannot be acquired from the feature DB 33 based on the facial feature output by the video analysis unit 31, the audio data is separated according to the frequency parameter P1. For example, when the voice feature of the speaker A can be acquired from the feature data shown in FIG. 13, the lower limit frequency f1 = 3000 Hz and the upper limit frequency f2 = 5000 Hz of the frequency parameter P3 are set. Separate speaker and background sounds.

Also, the voice separation unit 32 outputs the separated speaker voice to the localization processing unit 13 and outputs the separated background sound to the voice output unit 15.

FIG. 14 is a flowchart showing in detail the flow of audio separation processing of the video / audio output device 3 according to the present embodiment.

The voice separation unit 32 sets the lower limit frequency f1 and the upper limit frequency f2 of the frequency parameter P1 (step S72). Specifically, f1 = 80 Hz and f2 = 3000 Hz.

Next, the voice separation unit 22 determines whether or not there is feature data with matching facial features in the feature DB 33 (step S74). If there is matching feature data in the feature DB 33 (step S74: YES), the voice feature of the matched feature data is acquired, and the lower limit frequency f1 and the upper limit frequency f2 of the frequency parameter P3 are set (step S76).

Next, the voice separation unit 22 inputs the voice data based on the frequency parameter P3 when the frequency parameter P3 is set, and based on the frequency parameter P1 when the frequency parameter P3 is not set. Is applied to a band pass filter to separate the speaker voice (step S78).

Next, the voice separation unit 22 inputs the voice data based on the frequency parameter P3 when the frequency parameter P3 is set, and based on the frequency parameter P1 when the frequency parameter P3 is not set. Is applied to the band rejection filter to separate the background sound (step S80).

As described above, according to the video / audio output device 3 according to the present embodiment, an individual is identified from the speaker's facial features, and the speech is separated based on the frequency parameter P3 reflecting the identified individual's voice features. Therefore, the accuracy of separating the speaker voice and the background sound can be further increased.

<Fourth embodiment>
FIG. 15 is a schematic configuration diagram of a video / audio output device 4 according to the fourth embodiment of the present invention. The video / audio output device 4 is a device that outputs audio with audio localization in accordance with the speaker position.

The video analysis unit 41 outputs the input video data to the video display unit 14 (to synchronize with the audio data, the video data is delayed and output to the video display unit 14 as necessary), and the input video data The speaker position is specified from the above.

Further, the video analysis unit 41 has a function of analyzing the features of the scene, and acquires the scene features of the input video data. Specifically, the scene feature is acquired from the RGB histogram of the video data, the temporal change of the RGB histogram, the distribution of the motion vector in the unit block, and the like.

Also, the video analysis unit 41 outputs the specified speaker position to the localization processing unit 13 and outputs the acquired scene feature to the audio separation unit 42.

The voice separation unit 42 separates the input voice data (voice data in which the speaker voice and the background sound are mixed) into the speaker voice and the background sound according to the frequency parameter P1 or the frequency parameter P4. Specifically, when the background sound can be estimated from the scene feature acquired by the video analysis unit 41, the sound separation unit 42 acquires the background sound feature (frequency parameter P4) of the scene feature. For example, when the scene feature is “rainfall scene”, the speaker voice can be obtained by removing the rain sound from the input voice data. Specifically, since the sound of rain has characteristics close to white noise, the speaker's voice can be separated by subtracting a certain power from all frequency bands. On the other hand, when the video analysis unit 41 cannot estimate the background sound, the audio data is separated according to the frequency parameter P1.

FIG. 16 is a flowchart showing in detail the flow of the audio separation process of the video / audio output device 4 according to the present embodiment.

The sound separation unit 42 determines whether the background sound can be estimated from the calculated scene feature (step S82). When the background sound can be estimated (step S82: YES), the speech separation unit 42 sets the frequency parameter P4 suitable for the estimated background sound and acquires the speaker sound and the background sound (step S82). S84). On the other hand, when the background sound cannot be estimated (step S82: NO), the sound separation unit 42 sets the lower limit frequency f1 and the upper limit frequency f2 of the frequency parameter P1 (step S86). Specifically, f1 = 80 Hz and f2 = 3000 Hz.

Next, the voice separator 42 applies a band-pass filter to the input voice data based on the frequency parameter P1 to separate the speaker voice (step S88), and the input is performed based on the frequency parameter P1. A band rejection filter is applied to the audio data to separate background sounds (step S90).

As described above, according to the video / audio output device 4 according to the present embodiment, the scene feature is analyzed, and the background sound is estimated from the analyzed scene feature. Can be further increased.

Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and various modifications can be made to the embodiments of the present invention without departing from the gist of the present invention. Such modifications and changes can be made, and those accompanying such modifications and changes are also included in the technical scope of the present invention.

1, 1A, 2, 3, 4 Video /

audio output device

11, 21, 31, 41

Video analysis unit

12, 22, 32, 42

Audio separation unit

13, 13A Localization processing unit 14 Video display unit 15 Audio output unit 23 Attribute DB
33 Feature DB
P1, P2, P3, P4 Frequency parameter f1 Lower limit frequency f2 Upper limit frequency

Claims

A speaker position specifying means for analyzing a video and specifying a speaker position;
A voice separation means for separating the mixed voice mixed with the speaker's voice and the background sound into the speaker's voice and the background sound;
Voice localization means for localizing the voice of the speaker separated by the voice separation means at the position of the speaker identified by the speaker position identification means;
A video / audio output device comprising:
The voice separation means is
A first parameter for the upper and lower frequency limits of the speaker's voice;
The video / audio output apparatus according to claim 1, wherein the mixed sound is separated into a speaker's voice and a background sound based on the first parameter.
Attribute detection means for analyzing the video and detecting speaker attributes;
Voice parameter adjustment means for adjusting the value of the first parameter according to the speaker attribute detected by the attribute detection means;
Further comprising
The voice separation means is
3. The video according to claim 2, wherein the mixed sound is separated into a speaker's voice and a background sound based on the value of the first parameter adjusted by the voice parameter adjusting means. Audio output device.
Facial feature detection means for analyzing the video and detecting the facial features of the speaker;
A database that stores the facial features of the person and the second parameters related to the upper limit frequency and the lower limit frequency of the voice of the person in association with each other;
Parameter acquisition means for acquiring a second parameter associated with the facial feature detected by the facial feature detection means when the facial feature detected by the facial feature detection means exists in the database;
Further comprising
The voice separation means is
If the facial feature detected by the facial feature detection means exists in the database, the facial feature detected by the facial feature detection means does not exist in the database based on the second parameter. 3. The video / audio output device according to claim 2, wherein the mixed sound is separated into a speaker's sound and a background sound based on the first parameter.
Scene feature detection means for analyzing a video and detecting a scene feature;
A background sound estimation means for estimating a background sound from the detected scene feature;
Further comprising
The voice separation means is
When the background sound estimation means can estimate the background sound, based on the parameter relating to the frequency suitable for the estimated background sound, when the background sound estimation means cannot estimate the background sound, the first parameter The video / audio output apparatus according to claim 2, wherein the mixed sound is separated into a speaker's voice and a background sound based on the above.
The voice localization means includes
6. The background sound separated by the voice separation unit is localized in a direction away from the center of the screen with respect to the speaker position specified by the speaker position specifying unit. The video / audio output device according to Item 1.
A speaker location step for analyzing the video and identifying the speaker location;
A voice separation step of separating the mixed voice mixed with the speaker's voice and the background sound into the speaker's voice and the background sound;
A voice localization step of localizing the voice of the speaker separated in the voice separation step to the position of the speaker identified in the speaker position identification step;
A sound localization method comprising: