JP2021007216A - 映像データを用いて容易化された音源強調 - Google Patents
映像データを用いて容易化された音源強調 Download PDFInfo
- Publication number
- JP2021007216A JP2021007216A JP2020096190A JP2020096190A JP2021007216A JP 2021007216 A JP2021007216 A JP 2021007216A JP 2020096190 A JP2020096190 A JP 2020096190A JP 2020096190 A JP2020096190 A JP 2020096190A JP 2021007216 A JP2021007216 A JP 2021007216A
- Authority
- JP
- Japan
- Prior art keywords
- signal
- audio
- sound source
- video
- target sound
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 claims abstract description 104
- 230000005236 sound signal Effects 0.000 claims abstract description 72
- 238000000034 method Methods 0.000 claims abstract description 63
- 230000008569 process Effects 0.000 claims abstract description 17
- 230000000694 effects Effects 0.000 claims description 4
- 238000004891 communication Methods 0.000 description 12
- 230000001815 facial effect Effects 0.000 description 11
- 239000013598 vector Substances 0.000 description 11
- 230000009467 reduction Effects 0.000 description 10
- 210000000887 face Anatomy 0.000 description 9
- 238000003384 imaging method Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000007493 shaping process Methods 0.000 description 4
- 230000002452 interceptive effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 241000287181 Sturnus vulgaris Species 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 210000001061 forehead Anatomy 0.000 description 1
- 210000003128 head Anatomy 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0324—Details of processing therefor
- G10L21/034—Automatic adjustment
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/141—Systems for two-way working between two video terminals, e.g. videophone
- H04N7/147—Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements, protocols or services for supporting real-time applications in data packet communication
- H04L65/40—Support for services or applications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
- G06F3/012—Head tracking input arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
- G06F3/013—Eye tracking input arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/165—Management of the audio stream, e.g. setting of volume, audio stream path
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/73—Deblurring; Sharpening
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
- G06V40/165—Detection; Localisation; Normalisation using facial parts and geometric relationships
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements, protocols or services for supporting real-time applications in data packet communication
- H04L65/1066—Session management
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements, protocols or services for supporting real-time applications in data packet communication
- H04L65/1066—Session management
- H04L65/1083—In-session procedures
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements, protocols or services for supporting real-time applications in data packet communication
- H04L65/40—Support for services or applications
- H04L65/403—Arrangements for multi-party communication, e.g. for conferences
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements, protocols or services for supporting real-time applications in data packet communication
- H04L65/60—Network streaming of media packets
- H04L65/75—Media network packet handling
- H04L65/765—Media network packet handling intermediate
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/15—Conference systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/20—Arrangements for obtaining desired frequency or directional characteristics
- H04R1/32—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
- H04R1/40—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
- H04R1/406—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
- H04R3/005—Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/302—Electronic adaptation of stereophonic sound system to listener position or orientation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/302—Electronic adaptation of stereophonic sound system to listener position or orientation
- H04S7/303—Tracking of listener position or orientation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
- G06T2207/30201—Face
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2400/00—Details of stereophonic systems covered by H04S but not provided for in its groups
- H04S2400/15—Aspects of sound capture and related signal processing for recording or reproduction
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- General Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Computer Networks & Wireless Communication (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Otolaryngology (AREA)
- General Business, Economics & Management (AREA)
- Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Social Psychology (AREA)
- Psychiatry (AREA)
- Geometry (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Circuit For Audible Band Transducer (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (20)
- 複数の音声入力デバイスによって検出された音声信号を備える多チャンネル音声信号を受信することと、
映像入力デバイスによって撮像された画像を受信することと、
前記画像に少なくとも部分的に基づいて、対象音源に関する確からしさを示す第1信号を規定することと、
前記多チャンネル音声信号と前記第1信号とに少なくとも部分的に基づいて、前記対象音源に由来する音声成分に関する確からしさを示す第2信号を規定することと、
前記第2信号に少なくとも部分的に基づいて、出力音声信号を生成するように前記多チャンネル音声信号を処理することと、
を含む
方法。 - 前記処理することが、前記対象音源に由来する前記音声成分を強調し、
前記複数の音声入力デバイスが、マイクロホンのアレイを備えている
請求項1に記載の方法。 - 更に、
複数の画像を受信することと、
前記複数の画像において或る音源を前記対象音源として識別することと、
前記複数の画像に少なくとも部分的に基づいて前記音源について口唇動き検出を行うことと、
を含み、
前記第2信号が、前記口唇動き検出に更に基づいている
請求項1に記載の方法。 - 前記多チャンネル音声信号を処理することは、
前記多チャンネル音声信号を処理して、前記対象音源が前記画像にいると判定されたか、前記映像入力デバイスに対する前記対象音源の位置、前記対象音源の凝視の方向、及び/又は、前記対象音源の唇の動きが検出されたかに少なくとも部分的に基づいてミュートされた音声を生成することを含む
請求項1に記載の方法。 - 前記第1信号が二値信号であり、
前記二値信号が、前記対象音源が前記画像にいると判定されることに少なくとも部分的に基づいて第1状態になる
請求項1に記載の方法。 - 更に、
前記画像において少なくとも一の顔を検出することと、
事前に定義された顔識別子に少なくとも部分的に基づいて、前記少なくとも一の顔の一つが、前記対象音源であると識別することと
を含む
請求項1に記載の方法。 - 更に、前記多チャンネル音声信号に対して音声アクティビティ検出(VAD)を行ってVAD信号を生成することを含み、
前記第2信号が、前記VAD信号に少なくとも部分的に基づいて規定される
請求項1に記載の方法。 - 更に、
前記画像における前記対象音源の位置を特定することと、
前記画像を処理して前記位置に少なくとも部分的に基づいて出力映像信号を生成することを備える
請求項1に記載の方法。 - 更に、
前記出力音声信号と前記出力映像信号とを、ネットワークを介して外部デバイスに送信することを備える
請求項8に記載の方法。 - 前記画像を処理することが、前記位置に少なくとも部分的に基づいて前記画像の一部をぼかして前記出力映像信号を生成することを含む
請求項8に記載の方法。 - 前記対象音源が前記画像にいないと判定された場合、前記出力映像信号が、全体がぼかされた画像又は全体が空白にされた画像を含んでいる
請求項8に記載の方法。 - 更に、
前記画像に少なくとも部分的に基づいて前記対象音源の凝視の方向を特定することを含み、
前記第1信号及び/又は前記第2信号が、前記凝視の前記方向に更に基づいている
請求項1に記載の方法。 - 更に、ボイスオーバーインターネットプロトコル(VoIP)アプリケーションにおいて使用するために前記出力音声信号を送信することを含む
請求項1に記載の方法。 - 更に、前記映像入力デバイスに対する前記対象音源の位置に少なくとも部分的に基づいて前記VoIPアプリケーションのセッションをスリープモードに設定することを含む
請求項13に記載の方法。 - 映像入力デバイスによって撮像された画像を受信するように構成された映像サブシステムであって、前記画像に少なくとも部分的に基づいて、対象音源に関する確からしさを示す第1信号を規定するように構成された識別コンポーネントを備える映像サブシステムと、
複数の音声入力デバイスによって検出された音声入力を含む多チャンネル音声信号を受信するように構成された音声サブシステムと
を備え、
前記音声サブシステムが、
前記多チャンネル音声信号と前記第1信号とに少なくとも部分的に基づいて、前記対象音源に由来する音声成分に関する確からしさを示す第2信号を規定するように構成されたロジックコンポーネントと、
前記第2信号に少なくとも部分的に基づいて前記多チャンネル音声信号を処理して出力音声信号を生成するように構成された音声処理コンポーネントと
を備える
システム。 - 前記映像サブシステムが、更に、前記画像における前記対象音源の位置に少なくとも部分的に基づいて前記画像を処理して出力映像信号を生成するように構成された映像処理コンポーネントを備える
請求項15に記載のシステム。 - 前記映像処理コンポーネントが、前記位置に少なくとも部分的に基づいて前記画像の一部分をぼかして前記出力映像信号を生成するように構成された背景ぼかしコンポーネントを備える
請求項16に記載のシステム。 - 前記識別コンポーネントが、前記複数の画像において或る音源を前記対象音源と識別するように更に構成されており、
前記映像サブシステムが、前記複数の画像に少なくとも部分的に基づいて口唇動き検出を前記音源に対して行うように構成された口唇動き検出コンポーネントを更に備え、
前記第2信号が前記口唇動き検出に更に基づいている
請求項15に記載のシステム。 - 前記音声サブシステムが、前記多チャンネル音声信号に対して音声アクティビティ検出(VAD)を行ってVAD信号を生成するように構成されたVADコンポーネントを更に備え、
前記第2信号が、前記VAD信号に少なくとも部分的に基づいて規定される
請求項15に記載のシステム。 - 前記音声処理コンポーネントが、前記多チャンネル音声信号を処理して、前記対象音源が前記画像にいると判定されたか、前記映像入力デバイスに対する前記対象音源の位置、前記対象音源の凝視の方向、及び/又は、前記対象音源の唇の動きが検出されたか、に少なくとも部分的に基づいて、ミュートされた音声を生成するように構成された
請求項15に記載のシステム。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/455,668 US11082460B2 (en) | 2019-06-27 | 2019-06-27 | Audio source enhancement facilitated using video data |
US16/455,668 | 2019-06-27 |
Publications (3)
Publication Number | Publication Date |
---|---|
JP2021007216A true JP2021007216A (ja) | 2021-01-21 |
JP2021007216A5 JP2021007216A5 (ja) | 2023-05-31 |
JP7525304B2 JP7525304B2 (ja) | 2024-07-30 |
Family
ID=73887691
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP2020096190A Active JP7525304B2 (ja) | 2019-06-27 | 2020-06-02 | 映像データを用いて容易化された音源強調 |
Country Status (3)
Country | Link |
---|---|
US (1) | US11082460B2 (ja) |
JP (1) | JP7525304B2 (ja) |
CN (1) | CN112151063A (ja) |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2565315B (en) * | 2017-08-09 | 2022-05-04 | Emotech Ltd | Robots, methods, computer programs, computer-readable media, arrays of microphones and controllers |
CN110364161A (zh) * | 2019-08-22 | 2019-10-22 | 北京小米智能科技有限公司 | 响应语音信号的方法、电子设备、介质及系统 |
FR3103955A1 (fr) * | 2019-11-29 | 2021-06-04 | Orange | Dispositif et procédé d’analyse environnementale, et dispositif et procédé d’assistance vocale les implémentant |
EP4073792A1 (en) * | 2019-12-09 | 2022-10-19 | Dolby Laboratories Licensing Corp. | Adjusting audio and non-audio features based on noise metrics and speech intelligibility metrics |
TWI740339B (zh) * | 2019-12-31 | 2021-09-21 | 宏碁股份有限公司 | 自動調整特定聲源的方法及應用其之電子裝置 |
US11234090B2 (en) * | 2020-01-06 | 2022-01-25 | Facebook Technologies, Llc | Using audio visual correspondence for sound source identification |
US11087777B1 (en) | 2020-02-11 | 2021-08-10 | Facebook Technologies, Llc | Audio visual correspondence based signal augmentation |
US11460927B2 (en) * | 2020-03-19 | 2022-10-04 | DTEN, Inc. | Auto-framing through speech and video localizations |
KR20210128074A (ko) * | 2020-04-16 | 2021-10-26 | 엘지전자 주식회사 | 립리딩 기반의 화자 검출에 따른 오디오 줌 |
US11915716B2 (en) | 2020-07-16 | 2024-02-27 | International Business Machines Corporation | Audio modifying conferencing system |
US11303465B2 (en) | 2020-07-16 | 2022-04-12 | International Business Machines Corporation | Contextually aware conferencing system |
US11190735B1 (en) * | 2020-07-16 | 2021-11-30 | International Business Machines Corporation | Video modifying conferencing system |
US11082465B1 (en) * | 2020-08-20 | 2021-08-03 | Avaya Management L.P. | Intelligent detection and automatic correction of erroneous audio settings in a video conference |
WO2022146169A1 (en) * | 2020-12-30 | 2022-07-07 | Ringcentral, Inc., (A Delaware Corporation) | System and method for noise cancellation |
EP4385204A1 (en) * | 2021-08-14 | 2024-06-19 | ClearOne, Inc. | Muting specific talkers using a beamforming microphone array |
WO2023234939A1 (en) * | 2022-06-02 | 2023-12-07 | Innopeak Technology, Inc. | Methods and systems for audio processing using visual information |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7590941B2 (en) * | 2003-10-09 | 2009-09-15 | Hewlett-Packard Development Company, L.P. | Communication and collaboration system using rich media environments |
KR100754385B1 (ko) | 2004-09-30 | 2007-08-31 | 삼성전자주식회사 | 오디오/비디오 센서를 이용한 위치 파악, 추적 및 분리장치와 그 방법 |
US20110099017A1 (en) * | 2009-10-26 | 2011-04-28 | Ure Michael J | System and method for interactive communication with a media device user such as a television viewer |
US20120013620A1 (en) * | 2010-07-13 | 2012-01-19 | International Business Machines Corporation | Animating Speech Of An Avatar Representing A Participant In A Mobile Communications With Background Media |
KR101971697B1 (ko) * | 2012-02-24 | 2019-04-23 | 삼성전자주식회사 | 사용자 디바이스에서 복합 생체인식 정보를 이용한 사용자 인증 방법 및 장치 |
US9609273B2 (en) * | 2013-11-20 | 2017-03-28 | Avaya Inc. | System and method for not displaying duplicate images in a video conference |
KR102217191B1 (ko) * | 2014-11-05 | 2021-02-18 | 삼성전자주식회사 | 단말 장치 및 그 정보 제공 방법 |
US9445050B2 (en) * | 2014-11-17 | 2016-09-13 | Freescale Semiconductor, Inc. | Teleconferencing environment having auditory and visual cues |
EP3579551B1 (en) * | 2014-11-18 | 2022-10-26 | Caavo Inc | Automatic identification and mapping of consumer electronic devices to ports on an hdmi switch |
EP3101838A1 (en) * | 2015-06-03 | 2016-12-07 | Thomson Licensing | Method and apparatus for isolating an active participant in a group of participants |
EP3488439B1 (en) | 2016-07-22 | 2021-08-11 | Dolby Laboratories Licensing Corporation | Network-based processing and distribution of multimedia content of a live musical performance |
JP6410769B2 (ja) | 2016-07-28 | 2018-10-24 | キヤノン株式会社 | 情報処理システム及びその制御方法、コンピュータプログラム |
US10867610B2 (en) * | 2018-05-04 | 2020-12-15 | Microsoft Technology Licensing, Llc | Computerized intelligent assistant for conferences |
-
2019
- 2019-06-27 US US16/455,668 patent/US11082460B2/en active Active
-
2020
- 2020-06-02 JP JP2020096190A patent/JP7525304B2/ja active Active
- 2020-06-24 CN CN202010587240.1A patent/CN112151063A/zh active Pending
Also Published As
Publication number | Publication date |
---|---|
JP7525304B2 (ja) | 2024-07-30 |
US20200412772A1 (en) | 2020-12-31 |
CN112151063A (zh) | 2020-12-29 |
US11082460B2 (en) | 2021-08-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7525304B2 (ja) | 映像データを用いて容易化された音源強調 | |
EP3791390B1 (en) | Voice identification enrollment | |
JP5323770B2 (ja) | ユーザ指示取得装置、ユーザ指示取得プログラムおよびテレビ受像機 | |
EP3855731B1 (en) | Context based target framing in a teleconferencing environment | |
US11343446B2 (en) | Systems and methods for implementing personal camera that adapts to its surroundings, both co-located and remote | |
US20190311718A1 (en) | Context-aware control for smart devices | |
US10083710B2 (en) | Voice control system, voice control method, and computer readable medium | |
US8416998B2 (en) | Information processing device, information processing method, and program | |
US20230045237A1 (en) | Wearable apparatus for active substitution | |
CN104170374A (zh) | 在视频会议期间修改参与者的外观 | |
JP6562790B2 (ja) | 対話装置および対話プログラム | |
KR102077887B1 (ko) | 비디오 회의 강화 | |
US10325600B2 (en) | Locating individuals using microphone arrays and voice pattern matching | |
US20210092514A1 (en) | Methods and systems for recording mixed audio signal and reproducing directional audio | |
Huang et al. | Audio-visual speech recognition using an infrared headset | |
US11164341B2 (en) | Identifying objects of interest in augmented reality | |
CN114911449A (zh) | 音量控制方法、装置、存储介质和电子设备 | |
CN114187166A (zh) | 图像处理方法、智能终端及存储介质 | |
JP6874437B2 (ja) | コミュニケーションロボット、プログラム及びシステム | |
JP2021056499A (ja) | 方法、プログラム、及び装置 | |
KR20130054131A (ko) | 디스플레이장치 및 그 제어방법 | |
KR20140093459A (ko) | 자동 통역 방법 | |
US11743588B1 (en) | Object selection in computer vision | |
KR101171047B1 (ko) | 음성 및 영상 인식 기능을 갖는 로봇 시스템 및 그의 인식 방법 | |
Anderson et al. | Robust tri-modal automatic speech recognition for consumer applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A521 | Request for written amendment filed |
Free format text: JAPANESE INTERMEDIATE CODE: A523 Effective date: 20230523 |
|
A621 | Written request for application examination |
Free format text: JAPANESE INTERMEDIATE CODE: A621 Effective date: 20230523 |
|
TRDD | Decision of grant or rejection written | ||
A01 | Written decision to grant a patent or to grant a registration (utility model) |
Free format text: JAPANESE INTERMEDIATE CODE: A01 Effective date: 20240702 |
|
A61 | First payment of annual fees (during grant procedure) |
Free format text: JAPANESE INTERMEDIATE CODE: A61 Effective date: 20240718 |
|
R150 | Certificate of patent or registration of utility model |
Ref document number: 7525304 Country of ref document: JP Free format text: JAPANESE INTERMEDIATE CODE: R150 |