WO2010140254A1 - Dispositif de sortie d'image/de son et procédé de localisation de son - Google Patents

Dispositif de sortie d'image/de son et procédé de localisation de son Download PDF

Info

Publication number
WO2010140254A1
WO2010140254A1 PCT/JP2009/060362 JP2009060362W WO2010140254A1 WO 2010140254 A1 WO2010140254 A1 WO 2010140254A1 JP 2009060362 W JP2009060362 W JP 2009060362W WO 2010140254 A1 WO2010140254 A1 WO 2010140254A1
Authority
WO
WIPO (PCT)
Prior art keywords
attribute
voice
face
feature
video
Prior art date
Application number
PCT/JP2009/060362
Other languages
English (en)
Japanese (ja)
Inventor
和実 菅谷
洋人 河内
禎司 鈴木
Original Assignee
パイオニア株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by パイオニア株式会社 filed Critical パイオニア株式会社
Priority to PCT/JP2009/060362 priority Critical patent/WO2010140254A1/fr
Publication of WO2010140254A1 publication Critical patent/WO2010140254A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/10Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/441Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card
    • H04N21/4415Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card using biometric characteristics of the user, e.g. by voice recognition or fingerprint scanning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/44Receiver circuitry for the reception of television signals according to analogue transmission standards
    • H04N5/60Receiver circuitry for the reception of television signals according to analogue transmission standards for the sound signals
    • H04N5/602Receiver circuitry for the reception of television signals according to analogue transmission standards for the sound signals for digital sound signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field

Definitions

  • the present invention relates to an audio localization technology for a video / audio output device that outputs content data including video and audio, and more particularly, to an audio localization technology for performing audio localization according to a speaker position.
  • a human voice When receiving program content such as TV broadcast, displaying video on a display and outputting sound from a speaker, in monaural sound, a human voice can be heard from the position of the speaker. In stereo / surround sound, in many cases, a human voice is localized at the center of the screen so that the human voice can be heard from the center of the screen.
  • Patent Document 1 the position of a speaker is detected, and the volume of sound output from a plurality of speakers is controlled according to the detected position.
  • the present invention has been made in view of the above circumstances, and an example of the problem is to provide an audio localization technique that takes into account the case where the speaker's face is not on the screen.
  • face feature extraction is performed to analyze an input video, detect one or a plurality of face positions, and extract face features of each detected face position.
  • face attribute determination means for determining the attribute of each face feature extracted by the face feature extraction means as a first attribute
  • voice feature extraction means for analyzing the input voice and extracting a voice feature
  • Voice attribute determination means for determining the attribute of the voice feature extracted by the voice feature extraction means as a first attribute, first attributes of each face feature determined by the face attribute determination means, and determination by the voice attribute determination means
  • the degree of fitness is highest when it is determined that the speaker's face is in the screen from the fitness level determination means for determining the fitness level of the first attribute of the voice feature and the determination result of the fitness level determination means.
  • Voice localization hand that localizes voice to speaker's face position When a video-audio output apparatus comprising a.
  • an input image is analyzed, the input image is analyzed, one or a plurality of face positions are detected, and a face feature is extracted for each detected face position.
  • Each of the facial features extracted in the step and the facial feature extraction step is compared with the facial feature information stored in the facial feature database storing facial feature information related to the facial features classified by attribute.
  • a face attribute determining step for determining an attribute of the face feature a voice feature extracting step for analyzing the input voice and extracting a voice feature, a voice feature extracted in the voice feature extracting step, and a voice by attribute
  • a voice attribute determination step for comparing the voice feature information stored in the voice feature database storing the voice feature information about the feature to determine the attribute of the extracted voice feature, and the determination in the face attribute determination step
  • the face of the speaker is displayed on the screen from the attribute of each facial feature, the fitness determination step for determining the fitness of the voice feature attribute determined in the voice attribute determination step, and the determination result of the fitness determination step. If it is determined that there is a voice localization method, the voice localization method includes a voice localization step of localizing the voice to the position of the speaker's face having the highest fitness.
  • FIG. 1 is a schematic configuration diagram of a video / audio output device according to a first embodiment of the present invention. It is an example of the image which the video / audio output device which concerns on the 1st Embodiment of this invention displays. It is an example of the face attribute determination result and the voice attribute determination result by the video / audio output device according to the embodiment of the present invention. It is an example of the face attribute determination result and the voice attribute determination result by the video / audio output device according to the embodiment of the present invention. It is a flowchart which shows the flow of the video / audio output process of the video / audio output device which concerns on the 1st Embodiment of this invention.
  • FIG. 1 is a schematic configuration diagram of a video / audio output device 1 according to an embodiment of the present invention.
  • the video / audio output device 1 is a device that outputs audio with audio localization in accordance with the speaker position. In this embodiment, it is determined whether or not the speaker's face is in the screen. If the speaker's face is in the screen, the voice is output in accordance with the position of the speaker in the screen. The panorama has been changed.
  • “speaker” refers to the person speaking in the video data (on the screen)
  • “speaker position” refers to the position of the speaker on the screen. The position near the speaker's face. “Output the voice with the voice localization in accordance with the speaker position” means outputting the voice so that the voice can be heard from the position of the speaker. For example, as shown in FIG. Is present on the left side of the screen, the volume of the speaker voice output from the speaker SP1 provided on the left side of the screen d10 is increased, and the volume of the speaker voice output from the speaker SP2 provided on the right side of the screen d10 is increased.
  • the sound is output so that the sound can be heard from the position of the speaker on the left side of the screen at a reduced volume.
  • the volume ratio of the speaker voices of the speakers SP1 and SP2 may be set to D: C. .
  • the video / audio output device 1 may be any device as long as it has a function of reproducing content data including video and audio input from the outside and outputting the content data to the outside.
  • a television (TV), a DVD player and recorder, a BD player and recorder, a personal computer (PC), and the like are assumed.
  • the video / audio output apparatus 1 includes a face area detection unit 101, a face feature detection unit 102, an attribute-specific face feature database (hereinafter referred to as attribute-specific face feature DB) 103, a face attribute determination unit 104, and voice feature detection.
  • Unit 105 attribute-specific voice feature database (hereinafter referred to as attribute-specific voice feature DB) 106, voice attribute determination unit 107, conformity determination unit 108, audio localization processing unit 109, video display unit 110, and audio output unit 111.
  • attribute-specific face feature database hereinafter referred to as attribute-specific voice feature DB
  • the face area detection unit 101 detects a human face area from the input video data.
  • the detection of the face area is performed using a known technique. For example, there are a technique for detecting a face area by searching for a skin color area, a technique for detecting a face area by a template matching method for detecting a face area while comparing templates (face pattern images) on images.
  • the face area detection unit 101 outputs the detected face area position information (face area position information) to the face feature detection unit 102.
  • face area position information face area position information
  • each face area position information is output to the face feature detection unit 102, and when no face area is detected, the face position information is output to the face feature detection unit 102.
  • the face feature detection unit 102 detects each human face feature from the input video data based on each face region position information output from the face region detection unit 101. Facial features are detected using a known technique. In this embodiment, a method for detecting the feature amount of a major part of the face is adopted, and the feature amount indicating the positional relationship between the face contour, both eyebrows, both eyes, nose, mouth and the like is used as the facial feature. Detect as. Then, the face feature detection unit 102 outputs the detected face feature to the face attribute determination unit 104.
  • the attribute-specific face feature DB 103 is a database that stores data related to attribute-specific face features.
  • attribute means, for example, sex and age.
  • the data is classified into six attributes of women aged between 49 and 49, and women aged 50 and over, and data relating to facial features is stored for each classified attribute.
  • classification is made into six attributes, but the classification of attributes is not limited to this, and may be further subdivided.
  • the video / audio output apparatus 1 includes the attribute-specific face feature DB 103, but the video / audio output apparatus 1 may not include the attribute-specific face feature DB 103.
  • the video / audio output apparatus 1 may access the attribute-specific face feature DB 103 via a communication network and refer to data stored in the attribute-specific face feature DB 103.
  • the face attribute determination unit 104 compares each face feature output from the face feature detection unit 102 with data related to the face feature by attribute stored in the attribute-specific face feature DB 103, and determines the attribute of each face. It is supposed to be.
  • the determined face attribute result (referred to as a face attribute determination result) is data having a probability belonging to each attribute, and in the present embodiment, for example, data as shown in FIG. Then, the face attribute determination unit 104 outputs the face attribute determination result to the suitability determination unit 108.
  • a method for determining attributes (age, gender) from face features in the face feature detection unit 102, the attribute-specific face feature DB 103, and the face attribute determination unit 104 is performed using a known technique. For example, according to the method described in IEICE, IEICE, PRMU 2001-138 (2001) “Proposal of Gender / Age Estimation Method Using Distance from Average Face”, the average by detected face and attribute A matching attribute may be determined based on the feature point distance with the face.
  • the voice feature detecting unit 105 detects a human voice feature from the input voice data. The detection of the voice feature is performed using a known technique. In this embodiment, the voice frequency is analyzed, the formant frequency and pitch frequency are specified, and the specified formant frequency and pitch frequency are detected as voice characteristics. Then, the voice feature detection unit 105 outputs the detected voice feature to the voice attribute determination unit 107.
  • the attribute-specific voice feature DB 106 is a database that stores data related to voice features by attribute.
  • attribute means, for example, sex and age.
  • attribute means, for example, males under 20 years old, males 20 years old and younger than 49 years old, males 50 years old and older, females under 20 years old, 20
  • the data is classified into six attributes of a woman aged between 49 and 49 and a woman aged 50 and over, and data relating to voice characteristics is stored for each classified attribute.
  • the above six attributes are classified, but the attribute classification is not limited to this.
  • the video / audio output device 1 includes the attribute-specific voice feature DB 103, but the video / audio output device 1 may not include the attribute-specific voice feature DB 106.
  • the video / audio output device 1 may access the attribute-specific voice feature DB 106 via a communication network and refer to data stored in the attribute-specific voice feature DB 106.
  • the voice attribute determination unit 107 determines the voice attribute by comparing the voice feature output from the voice feature detection unit 105 with the data regarding the voice feature by attribute stored in the attribute-specific voice feature DB 106. ing.
  • the determined voice attribute result (referred to as a voice attribute determination result) is data having a probability belonging to each attribute, and in the present embodiment, for example, data as shown in FIG. Then, the voice attribute determination unit 107 outputs the voice attribute determination result to the suitability determination unit 108.
  • a method for determining attributes (age, sex) from voice features in the voice feature detection unit 105, the attribute-specific voice feature DB 106, and the voice attribute determination unit 107 is performed using a known technique. For example, according to the method described in Journal of the Acoustical Society of Japan Vol. 24, No. 6 (1968) “Changes in Pitch Frequency and Formant Frequency of Japanese 5 Vowels by Age and Gender” You may make it discriminate
  • the suitability determination unit 108 determines the degree of matching between the face and the voice. It has become. More specifically, an adaptation determination process is performed to determine whether or not the speaker's face is on the screen, and if the speaker's face is on the screen, which face is the speaker's face.
  • the match determination unit 108 outputs speaker information, which is a result of the match determination process, to the sound localization processing unit 109.
  • FIG. 3 shows each face attribute determination result and face attribute determination result when three face features are detected from the screen.
  • FIG. 4 shows each face attribute detection result when two face features are detected.
  • the face attribute determination result and the face attribute determination result are shown.
  • the matching degree between the face and the voice is calculated by multiplying values belonging to the same attribute for all the attributes and adding the respective multiplied values.
  • the threshold value T1 for determining whether or not the speaker's face is in the screen is 0.5 as an example, and when the threshold T1 is greater than this value, it is determined that the speaker's face is in the screen. Since the large value is A1 and A1 ⁇ T1, it is determined that there is a speaker in the screen, and the speaker is determined to be a person having face 1. In this case, position information indicating the position of the face 1 is output to the speech localization processing unit 109 as speaker information.
  • the value having the highest fitness is B1, but since B1 ⁇ T1, it is determined that there is no speaker in the screen.
  • information indicating that there is no speaker on the screen is output to the speech localization processing unit 109 as speaker information.
  • the voice localization processing unit 109 performs a localization change process of the input voice data based on the speaker position information output from the matching determination unit 108. That is, the volume is adjusted so that the sound is localized at the speaker position on the screen. For example, as shown in FIG. 2, when the speaker A exists on the left side of the screen, the volume of the sound output from the speaker SP1 provided on the left side of the screen d10 is increased and provided on the right side of the screen d10. The sound volume output from the speaker SP2 is reduced.
  • the sound localization processing unit 109 outputs the sound subjected to the localization change process to the sound output unit 111.
  • the video display unit 110 outputs and displays the input video data on a display or the like. Note that the video data to be output and displayed is delayed as necessary in order to synchronize with the audio data to be output.
  • the audio output unit 111 outputs audio data that has been subjected to the localization change process to a speaker.
  • FIG. 5 is a flowchart showing the flow of the video / audio output process of the video / audio output device 1.
  • the video / audio output device 1 analyzes the input video data and performs face attribute determination processing for determining the face attribute (step S10). Specifically, first, the face area detection unit 101 detects a human face area from the input video data, and then the face feature detection unit 102 detects the human face from the input video data based on the detected face area. The feature is detected, and finally, the face attribute determination unit 104 determines the detected face attribute by comparing the detected face feature with the data of the attribute-specific face feature DB 103.
  • the video / audio output device 1 analyzes the input audio data and performs a voice attribute determination process for determining a voice attribute (step S20). Specifically, first, a voice feature of a person is detected from the voice data input by the voice feature detector 105, and then detected by comparing the voice feature detected by the voice attribute determination unit 107 with the data of the voice feature DB 106 by attribute. Determine voice attributes.
  • the suitability determination unit 108 of the video / audio output device 1 performs a suitability determination process for determining the suitability of the face and the voice based on the face attribute determination result and the voice attribute determination result (step S30).
  • step S40 If it is determined that the speaker's face is within the screen as a result of the fitness determination process (step S40: YES), the sound localization process is performed to change the sound to the position of the speaker's face determined to be within the screen. Is performed (step S50). For example, the output value of the speaker voice of the speaker close to the speaker position is raised, and the output value of the speaker voice of the speaker far from the speaker position is lowered.
  • step S40 If it is determined that the speaker's face is not on the screen as a result of the fitness determination process (step S40: NO), the voice localization process is not performed.
  • the video display unit 110 of the video / audio output device 1 outputs video data, and the audio output unit 111 performs the audio localization change when the speaker's face is on the screen.
  • the audio data that has not been subjected to the audio localization change is output (step S60).
  • the face of the speaker is determined. It can be determined whether or not is in the screen. As a result of the determination, if the speaker's face is on the screen, the sound is panned to the specified speaker position, whereas if the speaker's face is not on the screen, the voice panning is changed. Since this is not performed, the viewer does not feel uncomfortable and more natural and realistic viewing is possible.
  • FIG. 6 is a schematic configuration diagram of the video / audio output device 2 according to the second embodiment of the present invention.
  • the audio / video output apparatus 2 is an apparatus that outputs audio with audio localization in accordance with the speaker position, and the audio / video output apparatus 1 has a function of selecting attribute-specific facial feature data based on content program information, and A function for selecting voice characteristics data by attribute based on content program information is added.
  • the same portions are denoted by the same reference numerals and description thereof is omitted.
  • the attribute-specific face feature DB 201 includes data related to attribute-specific face features divided by race.
  • data relating to Western facial feature quantities and data relating to Eastern facial feature quantities are provided.
  • the attribute-specific face feature selection unit 202 is configured to select data relating to the optimal face feature value for each attribute from the program information of the content. For example, if the video content to be reproduced is an American movie, data relating to Western facial features is selected from the program information.
  • the face attribute determination unit 104 determines the face attribute of the detected face feature based on the data regarding the attribute-specific face feature selected by the attribute-specific face feature selection unit 202.
  • the attribute-specific voice feature DB 203 is provided with data related to the voice features by attribute divided by race.
  • data relating to a voice feature value of a Western person and data relating to a voice feature value of an Oriental person are provided.
  • the attribute-specific voice feature selection unit 204 selects data related to the optimum voice feature by attribute from the program information of the content. For example, when the video content to be reproduced is an American movie, data relating to Western voice characteristics is selected from the program information.
  • the voice attribute determination unit 107 determines the voice attribute of the detected voice feature based on the data regarding the voice feature by attribute selected by the attribute-specific voice feature selection unit 204.
  • the video / audio output device 2 of the present embodiment since the data regarding the attribute-specific facial features and the data regarding the attribute-specific voice characteristics are provided for each race, the matching of the face and the voice The accuracy of the degree determination can be further improved. As a result, the speaker position can be more accurately specified, so that the viewer can view with a more natural and realistic feeling.
  • FIG. 7 is a schematic configuration diagram of the video / audio output device 3 according to the third embodiment of the present invention.
  • the audio / video output device 3 is a device that outputs audio with audio localization in accordance with the speaker position, and a function of performing audio localization processing in consideration of viewing environment information is added to the audio / video output device 1.
  • the voice localization processing unit 301 performs a localization change process of the input voice data based on the speaker information output from the matching determination unit 108.
  • the sound localization processing unit 301 performs the localization changing process in consideration of the viewing environment information during the localization changing process.
  • the viewing environment information is, for example, the size of a display screen such as a display or the position of a speaker.
  • the volume of the upper and lower speakers is adjusted not only according to the volume of the left and right speakers as shown in FIG. Is.
  • the viewing environment information may be set by the user instructing the video / audio output device 3 to operate, or the video / audio output device 3 automatically determines the screen size, the position of the speaker, and the like. If it can be determined, it may be automatically set by the video / audio output device 3.
  • the video / audio output device 3 since the sound can be localized and changed to a more accurate position using the viewing environment information, the viewer is more natural and realistic. Viewing is possible.
  • FIG. 8 is a schematic configuration diagram of a video / audio output device 4 according to the fourth embodiment of the present invention.
  • the video / audio output device 4 is a device that outputs audio with audio localization in accordance with the speaker position.
  • the video / audio output device 1 stores data on attribute-specific facial features stored in the attribute-specific facial feature DB 103 and attributes. A function of updating data related to voice features by attribute stored in the different face feature DB 106 is added.
  • the attribute-specific face feature update unit 401 determines that the face feature at that time that is a target of determination when the degree of matching determined by the matching determination unit 108 is a high value that is greater than or equal to a threshold T2 (T2> T1). Is reflected in the data relating to the face feature by attribute in the face feature DB 103 by attribute. As a result, when the face of the same person is input again as video data, the degree of matching between the face and the voice is further increased, and the accuracy of the sound localization can be improved.
  • the attribute-specific voice feature update unit 402 when the degree of fitness determined by the suitability determination unit 108 is a high value equal to or higher than a certain threshold value, determines the voice feature at that time as the attribute of the attribute of the attribute-specific voice feature DB 106. It is to be reflected in the data related to different voice characteristics. Thereby, when the voice of the same person is input again as voice data, the degree of matching between the face and voice is further increased, and the accuracy of voice localization can be improved.
  • the video / audio output device 4 based on the determination of the degree of matching between face and voice, the data related to attribute-specific face features stored in the attribute-specific face feature DB 103 and the attribute-specific face Since the data related to the voice feature for each attribute stored in the feature DB 106 is updated at any time, the data related to the facial feature and the voice feature for each attribute can be made more accurate by reproducing the video content.
  • data related to attributes and facial features and voice features is updated as needed, but in addition to this, data related to individual facial features and voice features is created as needed. It is also possible to update the created data regarding individual facial features and voice features as needed. This makes it possible to identify individuals such as actors and talents who often appear in video content, for example, so that the position of the speaker can be grasped more accurately, and the sound localization can be changed accurately. Can do.
  • Video / audio output device 101 Face region detection unit 102 Face feature detection unit 103, 201 Attribute-specific face feature DB 104 face attribute determination unit 105 voice feature detection unit 106,203 attribute-specific voice feature DB DESCRIPTION OF SYMBOLS 107 Voice attribute determination part 108 Conformity determination part 109,301 Audio localization process part 110 Image

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Image Analysis (AREA)

Abstract

Un dispositif de sortie d'image/de son (1) comprend : une unité de détection de région de visage (101) destinée à détecter une ou plusieurs positions de visage en analysant une image ; une unité de détection de caractéristique de visage (102) destinée à extraire une caractéristique de chacune des positions de visage détectées ; une unité de jugement d'attribut de visage (104) destinée à juger un attribut de la caractéristique de visage extraite en comparant la caractéristique de visage extraite à une DB de caractéristique de visage par attributs (103) qui stocke des informations de caractéristique de visage qui se rapportent à la caractéristique de visage par attributs ; une unité de détection de caractéristique de voix (105) destinée à extraire une caractéristique de voix en analysant un son ; une unité de jugement d'attribut de voix (107) destinée à juger un attribut de la caractéristique de voix extraite en comparant la caractéristique de voix extraite à une DB de caractéristique de voix par attributs (106) qui stocke des informations de caractéristique de voix qui se rapporte à la caractéristique de voix par attributs ; une unité de jugement de correspondance (108) destinée à juger la correspondance entre l'attribut de la caractéristique de visage jugée et celui de la caractéristique de voix jugée ; et une unité de localisation de son (109) destinée à localiser le son dans la position du visage de la personne qui parle si on juge que la personne qui parle doit être présentée sur l'écran sur la base du jugement de correspondance.
PCT/JP2009/060362 2009-06-05 2009-06-05 Dispositif de sortie d'image/de son et procédé de localisation de son WO2010140254A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2009/060362 WO2010140254A1 (fr) 2009-06-05 2009-06-05 Dispositif de sortie d'image/de son et procédé de localisation de son

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2009/060362 WO2010140254A1 (fr) 2009-06-05 2009-06-05 Dispositif de sortie d'image/de son et procédé de localisation de son

Publications (1)

Publication Number Publication Date
WO2010140254A1 true WO2010140254A1 (fr) 2010-12-09

Family

ID=43297400

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2009/060362 WO2010140254A1 (fr) 2009-06-05 2009-06-05 Dispositif de sortie d'image/de son et procédé de localisation de son

Country Status (1)

Country Link
WO (1) WO2010140254A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014127019A1 (fr) * 2013-02-15 2014-08-21 Qualcomm Incorporated Génération de données audio multivoies assistée par analyse vidéo
JP2019152737A (ja) * 2018-03-02 2019-09-12 株式会社日立製作所 話者推定方法および話者推定装置
EP3706442A1 (fr) * 2019-03-08 2020-09-09 LG Electronics Inc. Procédé et appareil de suivi d'objets sonores
CN112929739A (zh) * 2021-01-27 2021-06-08 维沃移动通信有限公司 发声控制方法、装置、电子设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11313272A (ja) * 1998-04-27 1999-11-09 Sharp Corp 映像音声出力装置
JP2000295700A (ja) * 1999-04-02 2000-10-20 Nippon Telegr & Teleph Corp <Ntt> 画像情報を用いた音源定位方法及び装置及び該方法を実現するプログラムを記録した記憶媒体
JP2004056286A (ja) * 2002-07-17 2004-02-19 Fuji Photo Film Co Ltd 画像表示方法
JP2007201818A (ja) * 2006-01-26 2007-08-09 Sony Corp オーディオ信号処理装置、オーディオ信号処理方法及びオーディオ信号処理プログラム

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11313272A (ja) * 1998-04-27 1999-11-09 Sharp Corp 映像音声出力装置
JP2000295700A (ja) * 1999-04-02 2000-10-20 Nippon Telegr & Teleph Corp <Ntt> 画像情報を用いた音源定位方法及び装置及び該方法を実現するプログラムを記録した記憶媒体
JP2004056286A (ja) * 2002-07-17 2004-02-19 Fuji Photo Film Co Ltd 画像表示方法
JP2007201818A (ja) * 2006-01-26 2007-08-09 Sony Corp オーディオ信号処理装置、オーディオ信号処理方法及びオーディオ信号処理プログラム

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014127019A1 (fr) * 2013-02-15 2014-08-21 Qualcomm Incorporated Génération de données audio multivoies assistée par analyse vidéo
CN104995681A (zh) * 2013-02-15 2015-10-21 高通股份有限公司 多声道音频数据的视频分析辅助产生
US9338420B2 (en) 2013-02-15 2016-05-10 Qualcomm Incorporated Video analysis assisted generation of multi-channel audio data
JP2016513410A (ja) * 2013-02-15 2016-05-12 クゥアルコム・インコーポレイテッドQualcomm Incorporated マルチチャネルオーディオデータのビデオ解析支援生成
CN104995681B (zh) * 2013-02-15 2017-10-31 高通股份有限公司 多声道音频数据的视频分析辅助产生
JP2019152737A (ja) * 2018-03-02 2019-09-12 株式会社日立製作所 話者推定方法および話者推定装置
EP3706442A1 (fr) * 2019-03-08 2020-09-09 LG Electronics Inc. Procédé et appareil de suivi d'objets sonores
US11277702B2 (en) 2019-03-08 2022-03-15 Lg Electronics Inc. Method and apparatus for sound object following
CN112929739A (zh) * 2021-01-27 2021-06-08 维沃移动通信有限公司 发声控制方法、装置、电子设备和存储介质

Similar Documents

Publication Publication Date Title
US8935169B2 (en) Electronic apparatus and display process
US20210249012A1 (en) Systems and methods for operating an output device
JP4736511B2 (ja) 情報提供方法および情報提供装置
JP5057918B2 (ja) 電子機器およびシーン種類表示方法
US7467088B2 (en) Closed caption control apparatus and method therefor
US9251805B2 (en) Method for processing speech of particular speaker, electronic system for the same, and program for electronic system
US8983846B2 (en) Information processing apparatus, information processing method, and program for providing feedback on a user request
CN102111601B (zh) 内容可适性的多媒体处理系统与处理方法
JP5830672B2 (ja) 補聴器フィッティング装置
KR101378493B1 (ko) 영상 데이터에 동기화된 텍스트 데이터 설정 방법 및 장치
KR101958664B1 (ko) 멀티미디어 콘텐츠 재생 시스템에서 다양한 오디오 환경을 제공하기 위한 장치 및 방법
KR20150093425A (ko) 콘텐츠 추천 방법 및 장치
JP2011250100A (ja) 画像処理装置および方法、並びにプログラム
CN108055592A (zh) 字幕显示方法、装置、移动终端及存储介质
US11211074B2 (en) Presentation of audio and visual content at live events based on user accessibility
US11122341B1 (en) Contextual event summary annotations for video streams
US20140064517A1 (en) Multimedia processing system and audio signal processing method
WO2010140254A1 (fr) Dispositif de sortie d&#39;image/de son et procédé de localisation de son
CN112601120A (zh) 字幕显示方法及装置
JP2012512424A (ja) 音声合成のための方法および装置
JP5330551B2 (ja) 電子機器および表示処理方法
WO2020234939A1 (fr) Dispositif de traitement d&#39;informations, procédé de traitement d&#39;informations et programme
JP2010134507A (ja) 再生装置
US20230014995A1 (en) Content recommendations for users with disabilities
US11857877B2 (en) Automatic in-game subtitles and closed captions

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09845537

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09845537

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP