CN111932619A - Microphone tracking system and method combining image recognition and voice positioning - Google Patents
Microphone tracking system and method combining image recognition and voice positioning Download PDFInfo
- Publication number
- CN111932619A CN111932619A CN202010718515.0A CN202010718515A CN111932619A CN 111932619 A CN111932619 A CN 111932619A CN 202010718515 A CN202010718515 A CN 202010718515A CN 111932619 A CN111932619 A CN 111932619A
- Authority
- CN
- China
- Prior art keywords
- microphone
- sound
- scene
- distance
- camera
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000001914 filtration Methods 0.000 claims abstract description 8
- 230000002708 enhancing effect Effects 0.000 claims abstract description 4
- 230000005236 sound signal Effects 0.000 claims description 6
- 230000004807 localization Effects 0.000 claims description 5
- 230000003313 weakening effect Effects 0.000 claims description 3
- 238000003491 array Methods 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/20—Arrangements for obtaining desired frequency or directional characteristics
- H04R1/32—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
- H04R1/326—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only for microphones
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Human Computer Interaction (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Otolaryngology (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention discloses a microphone tracking system and method combining image recognition and voice positioning, and relates to the technical field of voice positioning. The invention comprises a camera, a microphone and a background server; the microphone comprises a sound acquisition module and a sound processing module; the sound acquisition module is used for acquiring sound of the current scene; the sound processing module is used for enhancing the sound according to the scene where the current microphone is positioned; the background server is used for calculating the distance between the microphone and the mouth and adjusting the distance and the elevation angle of the microphone. According to the invention, two top view scenes and two side view scenes are acquired through the camera, the microphone in the scenes is used as the original point of the three-dimensional image, the distance and the elevation angle between the microphone and the mouth of a person are calculated by utilizing a spatial filtering algorithm, and the current scene is judged to be in a near-field scene or a far-field scene, the microphone processing module is used for adjusting the strength of sound, so that the optimal angle and distance are intelligently adjusted, the method is suitable for a complex environment, and the user experience is improved.
Description
Technical Field
The invention belongs to the technical field of sound positioning, and particularly relates to a microphone tracking system and method combining image recognition and voice positioning.
Background
The existing voice positioning system and method are based on a microphone array to complete positioning, real-time tracking cannot be achieved, the positioning of the microphone array can be carried out again only by awakening the positioning system through voice, real-time tracking and monitoring cannot be achieved, and the user experience effect is poor.
Meanwhile, the existing voice positioning system and method have higher requirements on the applicable environment due to the self limitation: on one hand, the anti-interference capability is poor, for example, the anti-echo interference capability is poor, and for example, a voice positioning system integrated in equipment such as a television and a sound system, the self-sounding content can also interfere with positioning because the equipment pronounces; on the other hand, the adaptive capacity of a complex environment is poor, the positioning accuracy is reduced in a noise environment, and the interference of unsteady noise, such as multiple persons speaking at the same time, and the positioning accuracy is also affected by room reverberation, for example, a high reverberation environment of a hard reflection medium around, such as glass, etc., is provided.
In addition, the existing speech positioning system and method are limited by the microphone array, for example, the two-microphone array can only satisfy 180 ° planar positioning, the four-microphone array can only satisfy 360 ° planar positioning, and usually, the spatial positioning needs to be realized by the microphone array with a complex array type, but the three-dimensional spatial positioning is difficult to be realized by simpler equipment.
Disclosure of Invention
The invention aims to provide a microphone tracking system and a microphone tracking method combining image recognition and voice positioning.
In order to solve the technical problems, the invention is realized by the following technical scheme:
the invention relates to a microphone tracking system combining image recognition and voice positioning, which comprises a camera, a microphone and a background server, wherein the camera is connected with the microphone;
the camera is used for acquiring an image sequence of a current scene and sending the acquired image sequence to the background server for processing;
the microphone comprises a sound acquisition module and a sound processing module; the sound acquisition module is used for acquiring sound of the current scene; the sound processing module is used for weakening or enhancing sound according to a near-field scene or a far-field scene where the current microphone is located;
the background server comprises an image recognition unit and a microphone tracking unit; the image identification unit is used for identifying the mouth position and the microphone position of a person in an image sequence, taking the microphone as an origin of a three-dimensional coordinate, calculating the distance between the microphone and the mouth according to a spatial filtering algorithm, and judging whether a current scene is in a near-field scene or a far-field scene by using a preset distance threshold value; the microphone tracking module is used for adjusting the distance and the elevation angle of the microphone according to the calculated distance between the microphone and the mouth.
Preferably, the camera position and the microphone position are unified three-dimensional coordinates.
Preferably, the image recognition unit calculates the distance between the mouth of the person and the microphone, then judges the current scene and positions the directions of the mouth and the microphone, and feeds the scene and the directions of the mouth and the microphone back to the sound processing unit of the microphone; the sound processing unit reinforces the sound signal of the positioning direction and simultaneously suppresses the sound signals of other directions.
Preferably, the microphones are a set of two-microphone array; the microphone is fixed right in front of the teacher desk; the camera is a set of camera, and one of them camera is located the microphone directly over, and another camera is fixed in lectern one side and is the same with the microphone height.
The invention relates to a microphone tracking method combining image recognition and voice positioning, which comprises the following steps:
step S1: acquiring a current scene image sequence;
step S2: recognizing a human face and a microphone in the image sequence, and caching and recognizing a three-dimensional coordinate with the microphone as an origin;
step S3: calculating the distance and angle between the microphone and the mouth according to a spatial filtering algorithm;
step S4: judging whether the current scene is in a near-field scene or a far-field scene by using a preset distance threshold value;
step S5: the microphone tracking module adjusts the distance and the elevation angle between the microphone and the human mouth;
step S6: the microphone processing module processes the attenuation or enhancement of the sound depending on whether the current scene is a near-field scene or a far-field scene.
The invention has the following beneficial effects:
according to the invention, two top view scenes and two side view scenes are acquired through the camera, the microphone in the scenes is used as the original point of the three-dimensional image, the distance and the elevation angle between the microphone and the mouth of a person are calculated by utilizing a spatial filtering algorithm, and the current scene is judged to be in a near-field scene or a far-field scene, the microphone processing module is used for adjusting the strength of sound, so that the optimal angle and distance are intelligently adjusted, the method is suitable for a complex environment, and the user experience is improved.
Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a microphone tracking system incorporating image recognition and voice localization according to the present invention;
FIG. 2 is a diagram of the steps of a microphone tracking method combining image recognition and voice localization according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention is a microphone tracking system combining image recognition and voice positioning, including a camera, a microphone and a background server;
the camera is used for acquiring an image sequence of a current scene and sending the acquired image sequence to the background server for processing; the method comprises the following steps that a plurality of images are collected by a camera, and two optimal pictures are finally selected from the images, wherein the two optimal pictures are respectively a top view collected by the right-above camera and a side view collected by the right-side camera;
the microphone comprises a sound acquisition module and a sound processing module; the sound acquisition module is used for acquiring sound of the current scene; the sound processing module is used for weakening or enhancing sound according to a near-field scene or a far-field scene where the current microphone is located;
the background server comprises an image recognition unit and a microphone tracking unit; the image identification unit is used for identifying the mouth position and the microphone position of a person in an image sequence, taking the microphone as an origin of a three-dimensional coordinate, calculating the distance between the microphone and the mouth according to a spatial filtering algorithm, and judging whether a current scene is in a near-field scene or a far-field scene by using a preset distance threshold value; the microphone tracking module is used for adjusting the distance and the elevation angle of the microphone according to the calculated distance between the microphone and the mouth.
The camera position and the microphone position are unified to form a three-dimensional coordinate, the X-axis coordinate and the Z-axis coordinate of the human mouth compared with the microphone original point are obtained from the side view by using the original point of the three-dimensional coordinate of the microphone, and the X-axis coordinate and the Y-axis coordinate of the human mouth compared with the microphone original point are obtained from the top view, so that the specific coordinate position of the human mouth in the three-dimensional coordinate with the microphone as the original point is accurately obtained, and the angle of the real distance between the microphone and the human mouth is conveniently calculated.
The image recognition unit calculates the distance between the mouth of a person and the microphone, then judges the current scene and positions the directions of the mouth and the microphone, and feeds the scene and the directions back to the sound processing unit of the microphone; the sound processing unit reinforces the sound signal of the positioning direction and simultaneously suppresses the sound signals of other directions.
Wherein, the microphone is a group of double-microphone array; the microphone is fixed right in front of the teacher desk; the camera is a set of camera, wherein a camera is located directly over the microphone, another camera is fixed in lectern one side and the same with the microphone height, and the image of shooing is convenient for establish three-dimensional space coordinate system.
Referring to fig. 2, the present invention is a microphone tracking method combining image recognition and voice localization, including the following steps:
step S1: acquiring a current scene image sequence;
step S2: recognizing a human face and a microphone in the image sequence, and caching and recognizing a three-dimensional coordinate with the microphone as an origin;
step S3: calculating the distance and angle between the microphone and the mouth according to a spatial filtering algorithm;
step S4: judging whether the current scene is in a near-field scene or a far-field scene by using a preset distance threshold value;
step S5: the microphone tracking module adjusts the distance and the elevation angle between the microphone and the human mouth;
step S6: the microphone processing module processes the attenuation or enhancement of the sound depending on whether the current scene is a near-field scene or a far-field scene.
It should be noted that, in the above system embodiment, each included unit is only divided according to functional logic, but is not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
In addition, it is understood by those skilled in the art that all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing associated hardware, and the corresponding program may be stored in a computer-readable storage medium.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.
Claims (5)
1. The utility model provides a microphone tracking system who combines image recognition and speech localization, includes camera, microphone and backstage server, its characterized in that:
the camera is used for acquiring an image sequence of a current scene and sending the acquired image sequence to the background server for processing;
the microphone comprises a sound acquisition module and a sound processing module; the sound acquisition module is used for acquiring sound of the current scene; the sound processing module is used for weakening or enhancing sound according to a near-field scene or a far-field scene where the current microphone is located;
the background server comprises an image recognition unit and a microphone tracking unit; the image identification unit is used for identifying the mouth position and the microphone position of a person in an image sequence, taking the microphone as an origin of a three-dimensional coordinate, calculating the distance between the microphone and the mouth according to a spatial filtering algorithm, and judging whether a current scene is in a near-field scene or a far-field scene by using a preset distance threshold value; the microphone tracking module is used for adjusting the distance and the elevation angle of the microphone according to the calculated distance between the microphone and the mouth.
2. The system of claim 1, wherein the camera position and the microphone position are unified three-dimensional coordinates.
3. The microphone tracking system combining image recognition and voice positioning as claimed in claim 1, wherein the image recognition unit calculates the distance from the human mouth to the microphone, determines the current scene and positions the direction of the mouth and the microphone, and feeds the determined distance back to the sound processing unit of the microphone; the sound processing unit reinforces the sound signal of the positioning direction and simultaneously suppresses the sound signals of other directions.
4. The system of claim 1, wherein the microphones are a set of two-microphone arrays; the microphone is fixed right in front of the teacher desk; the camera is a set of camera, and one of them camera is located the microphone directly over, and another camera is fixed in lectern one side and is the same with the microphone height.
5. A microphone tracking method combining image recognition and voice localization, comprising the steps of:
step S1: acquiring a current scene image sequence;
step S2: recognizing a human face and a microphone in the image sequence, and caching and recognizing a three-dimensional coordinate with the microphone as an origin;
step S3: calculating the distance and angle between the microphone and the mouth according to a spatial filtering algorithm;
step S4: judging whether the current scene is in a near-field scene or a far-field scene by using a preset distance threshold value;
step S5: the microphone tracking module adjusts the distance and the elevation angle between the microphone and the human mouth;
step S6: the microphone processing module processes the attenuation or enhancement of the sound depending on whether the current scene is a near-field scene or a far-field scene.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010718515.0A CN111932619A (en) | 2020-07-23 | 2020-07-23 | Microphone tracking system and method combining image recognition and voice positioning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010718515.0A CN111932619A (en) | 2020-07-23 | 2020-07-23 | Microphone tracking system and method combining image recognition and voice positioning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111932619A true CN111932619A (en) | 2020-11-13 |
Family
ID=73314555
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010718515.0A Pending CN111932619A (en) | 2020-07-23 | 2020-07-23 | Microphone tracking system and method combining image recognition and voice positioning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111932619A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112614508A (en) * | 2020-12-11 | 2021-04-06 | 北京华捷艾米科技有限公司 | Audio and video combined positioning method and device, electronic equipment and storage medium |
WO2023193803A1 (en) * | 2022-04-08 | 2023-10-12 | 南京地平线机器人技术有限公司 | Volume control method and apparatus, storage medium, and electronic device |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102160398A (en) * | 2008-07-31 | 2011-08-17 | 诺基亚公司 | Electronic device directional audio-video capture |
CN102223594A (en) * | 2010-04-19 | 2011-10-19 | 鸿富锦精密工业(深圳)有限公司 | Microphone control device and method |
CN104123950A (en) * | 2014-07-17 | 2014-10-29 | 深圳市中兴移动通信有限公司 | Sound recording method and device |
US20150022636A1 (en) * | 2013-07-19 | 2015-01-22 | Nvidia Corporation | Method and system for voice capture using face detection in noisy environments |
CN106024003A (en) * | 2016-05-10 | 2016-10-12 | 北京地平线信息技术有限公司 | Voice positioning and enhancement system and method combining images |
CN106233384A (en) * | 2014-04-17 | 2016-12-14 | 微软技术许可有限责任公司 | Dialog detection |
CN107534725A (en) * | 2015-05-19 | 2018-01-02 | 华为技术有限公司 | A kind of audio signal processing method and device |
CN110691196A (en) * | 2019-10-30 | 2020-01-14 | 歌尔股份有限公司 | Sound source positioning method of audio equipment and audio equipment |
CN111048104A (en) * | 2020-01-16 | 2020-04-21 | 北京声智科技有限公司 | Speech enhancement processing method, device and storage medium |
-
2020
- 2020-07-23 CN CN202010718515.0A patent/CN111932619A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102160398A (en) * | 2008-07-31 | 2011-08-17 | 诺基亚公司 | Electronic device directional audio-video capture |
CN102223594A (en) * | 2010-04-19 | 2011-10-19 | 鸿富锦精密工业(深圳)有限公司 | Microphone control device and method |
US20150022636A1 (en) * | 2013-07-19 | 2015-01-22 | Nvidia Corporation | Method and system for voice capture using face detection in noisy environments |
CN106233384A (en) * | 2014-04-17 | 2016-12-14 | 微软技术许可有限责任公司 | Dialog detection |
CN104123950A (en) * | 2014-07-17 | 2014-10-29 | 深圳市中兴移动通信有限公司 | Sound recording method and device |
CN107534725A (en) * | 2015-05-19 | 2018-01-02 | 华为技术有限公司 | A kind of audio signal processing method and device |
CN106024003A (en) * | 2016-05-10 | 2016-10-12 | 北京地平线信息技术有限公司 | Voice positioning and enhancement system and method combining images |
CN110691196A (en) * | 2019-10-30 | 2020-01-14 | 歌尔股份有限公司 | Sound source positioning method of audio equipment and audio equipment |
CN111048104A (en) * | 2020-01-16 | 2020-04-21 | 北京声智科技有限公司 | Speech enhancement processing method, device and storage medium |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112614508A (en) * | 2020-12-11 | 2021-04-06 | 北京华捷艾米科技有限公司 | Audio and video combined positioning method and device, electronic equipment and storage medium |
WO2023193803A1 (en) * | 2022-04-08 | 2023-10-12 | 南京地平线机器人技术有限公司 | Volume control method and apparatus, storage medium, and electronic device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106653041B (en) | Audio signal processing apparatus, method and electronic apparatus | |
CN106328156B (en) | Audio and video information fusion microphone array voice enhancement system and method | |
CN106782584B (en) | Audio signal processing device, method and electronic device | |
CN107346661B (en) | Microphone array-based remote iris tracking and collecting method | |
US10582117B1 (en) | Automatic camera control in a video conference system | |
CN110517705B (en) | Binaural sound source positioning method and system based on deep neural network and convolutional neural network | |
Aarabi et al. | Robust sound localization using multi-source audiovisual information fusion | |
CN107534725B (en) | Voice signal processing method and device | |
CN111833899B (en) | Voice detection method based on polyphonic regions, related device and storage medium | |
CN206349145U (en) | Audio signal processing apparatus | |
CN106024003A (en) | Voice positioning and enhancement system and method combining images | |
CN109640224A (en) | A kind of sound pick-up method and device | |
CN112069863B (en) | Face feature validity determination method and electronic equipment | |
CN111932619A (en) | Microphone tracking system and method combining image recognition and voice positioning | |
CN113676592B (en) | Recording method, recording device, electronic equipment and computer readable medium | |
CN111863020B (en) | Voice signal processing method, device, equipment and storage medium | |
Nakadai et al. | Real-time speaker localization and speech separation by audio-visual integration | |
CN110188179B (en) | Voice directional recognition interaction method, device, equipment and medium | |
CN109147787A (en) | A kind of smart television acoustic control identifying system and its recognition methods | |
CN110718227A (en) | Multi-mode interaction based distributed Internet of things equipment cooperation method and system | |
CN111551921A (en) | Sound source orientation system and method based on sound image linkage | |
JP2022062875A (en) | Audio signal processing method and audio signal processing apparatus | |
CN106409306A (en) | Intelligent system obtaining human voice and obtaining method based on the system | |
CN113432276B (en) | Method and equipment for automatically adjusting air conditioner and air conditioner | |
US20230254639A1 (en) | Sound Pickup Method and Apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201113 |
|
RJ01 | Rejection of invention patent application after publication |