CN111932619A - Microphone tracking system and method combining image recognition and voice positioning - Google Patents

Microphone tracking system and method combining image recognition and voice positioning Download PDF

Info

Publication number
CN111932619A
CN111932619A CN202010718515.0A CN202010718515A CN111932619A CN 111932619 A CN111932619 A CN 111932619A CN 202010718515 A CN202010718515 A CN 202010718515A CN 111932619 A CN111932619 A CN 111932619A
Authority
CN
China
Prior art keywords
microphone
sound
scene
distance
camera
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010718515.0A
Other languages
Chinese (zh)
Inventor
虞焰兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Semxum Information Technology Co ltd
Original Assignee
Anhui Semxum Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Semxum Information Technology Co ltd filed Critical Anhui Semxum Information Technology Co ltd
Priority to CN202010718515.0A priority Critical patent/CN111932619A/en
Publication of CN111932619A publication Critical patent/CN111932619A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/326Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only for microphones

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Otolaryngology (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a microphone tracking system and method combining image recognition and voice positioning, and relates to the technical field of voice positioning. The invention comprises a camera, a microphone and a background server; the microphone comprises a sound acquisition module and a sound processing module; the sound acquisition module is used for acquiring sound of the current scene; the sound processing module is used for enhancing the sound according to the scene where the current microphone is positioned; the background server is used for calculating the distance between the microphone and the mouth and adjusting the distance and the elevation angle of the microphone. According to the invention, two top view scenes and two side view scenes are acquired through the camera, the microphone in the scenes is used as the original point of the three-dimensional image, the distance and the elevation angle between the microphone and the mouth of a person are calculated by utilizing a spatial filtering algorithm, and the current scene is judged to be in a near-field scene or a far-field scene, the microphone processing module is used for adjusting the strength of sound, so that the optimal angle and distance are intelligently adjusted, the method is suitable for a complex environment, and the user experience is improved.

Description

Microphone tracking system and method combining image recognition and voice positioning
Technical Field
The invention belongs to the technical field of sound positioning, and particularly relates to a microphone tracking system and method combining image recognition and voice positioning.
Background
The existing voice positioning system and method are based on a microphone array to complete positioning, real-time tracking cannot be achieved, the positioning of the microphone array can be carried out again only by awakening the positioning system through voice, real-time tracking and monitoring cannot be achieved, and the user experience effect is poor.
Meanwhile, the existing voice positioning system and method have higher requirements on the applicable environment due to the self limitation: on one hand, the anti-interference capability is poor, for example, the anti-echo interference capability is poor, and for example, a voice positioning system integrated in equipment such as a television and a sound system, the self-sounding content can also interfere with positioning because the equipment pronounces; on the other hand, the adaptive capacity of a complex environment is poor, the positioning accuracy is reduced in a noise environment, and the interference of unsteady noise, such as multiple persons speaking at the same time, and the positioning accuracy is also affected by room reverberation, for example, a high reverberation environment of a hard reflection medium around, such as glass, etc., is provided.
In addition, the existing speech positioning system and method are limited by the microphone array, for example, the two-microphone array can only satisfy 180 ° planar positioning, the four-microphone array can only satisfy 360 ° planar positioning, and usually, the spatial positioning needs to be realized by the microphone array with a complex array type, but the three-dimensional spatial positioning is difficult to be realized by simpler equipment.
Disclosure of Invention
The invention aims to provide a microphone tracking system and a microphone tracking method combining image recognition and voice positioning.
In order to solve the technical problems, the invention is realized by the following technical scheme:
the invention relates to a microphone tracking system combining image recognition and voice positioning, which comprises a camera, a microphone and a background server, wherein the camera is connected with the microphone;
the camera is used for acquiring an image sequence of a current scene and sending the acquired image sequence to the background server for processing;
the microphone comprises a sound acquisition module and a sound processing module; the sound acquisition module is used for acquiring sound of the current scene; the sound processing module is used for weakening or enhancing sound according to a near-field scene or a far-field scene where the current microphone is located;
the background server comprises an image recognition unit and a microphone tracking unit; the image identification unit is used for identifying the mouth position and the microphone position of a person in an image sequence, taking the microphone as an origin of a three-dimensional coordinate, calculating the distance between the microphone and the mouth according to a spatial filtering algorithm, and judging whether a current scene is in a near-field scene or a far-field scene by using a preset distance threshold value; the microphone tracking module is used for adjusting the distance and the elevation angle of the microphone according to the calculated distance between the microphone and the mouth.
Preferably, the camera position and the microphone position are unified three-dimensional coordinates.
Preferably, the image recognition unit calculates the distance between the mouth of the person and the microphone, then judges the current scene and positions the directions of the mouth and the microphone, and feeds the scene and the directions of the mouth and the microphone back to the sound processing unit of the microphone; the sound processing unit reinforces the sound signal of the positioning direction and simultaneously suppresses the sound signals of other directions.
Preferably, the microphones are a set of two-microphone array; the microphone is fixed right in front of the teacher desk; the camera is a set of camera, and one of them camera is located the microphone directly over, and another camera is fixed in lectern one side and is the same with the microphone height.
The invention relates to a microphone tracking method combining image recognition and voice positioning, which comprises the following steps:
step S1: acquiring a current scene image sequence;
step S2: recognizing a human face and a microphone in the image sequence, and caching and recognizing a three-dimensional coordinate with the microphone as an origin;
step S3: calculating the distance and angle between the microphone and the mouth according to a spatial filtering algorithm;
step S4: judging whether the current scene is in a near-field scene or a far-field scene by using a preset distance threshold value;
step S5: the microphone tracking module adjusts the distance and the elevation angle between the microphone and the human mouth;
step S6: the microphone processing module processes the attenuation or enhancement of the sound depending on whether the current scene is a near-field scene or a far-field scene.
The invention has the following beneficial effects:
according to the invention, two top view scenes and two side view scenes are acquired through the camera, the microphone in the scenes is used as the original point of the three-dimensional image, the distance and the elevation angle between the microphone and the mouth of a person are calculated by utilizing a spatial filtering algorithm, and the current scene is judged to be in a near-field scene or a far-field scene, the microphone processing module is used for adjusting the strength of sound, so that the optimal angle and distance are intelligently adjusted, the method is suitable for a complex environment, and the user experience is improved.
Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a microphone tracking system incorporating image recognition and voice localization according to the present invention;
FIG. 2 is a diagram of the steps of a microphone tracking method combining image recognition and voice localization according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention is a microphone tracking system combining image recognition and voice positioning, including a camera, a microphone and a background server;
the camera is used for acquiring an image sequence of a current scene and sending the acquired image sequence to the background server for processing; the method comprises the following steps that a plurality of images are collected by a camera, and two optimal pictures are finally selected from the images, wherein the two optimal pictures are respectively a top view collected by the right-above camera and a side view collected by the right-side camera;
the microphone comprises a sound acquisition module and a sound processing module; the sound acquisition module is used for acquiring sound of the current scene; the sound processing module is used for weakening or enhancing sound according to a near-field scene or a far-field scene where the current microphone is located;
the background server comprises an image recognition unit and a microphone tracking unit; the image identification unit is used for identifying the mouth position and the microphone position of a person in an image sequence, taking the microphone as an origin of a three-dimensional coordinate, calculating the distance between the microphone and the mouth according to a spatial filtering algorithm, and judging whether a current scene is in a near-field scene or a far-field scene by using a preset distance threshold value; the microphone tracking module is used for adjusting the distance and the elevation angle of the microphone according to the calculated distance between the microphone and the mouth.
The camera position and the microphone position are unified to form a three-dimensional coordinate, the X-axis coordinate and the Z-axis coordinate of the human mouth compared with the microphone original point are obtained from the side view by using the original point of the three-dimensional coordinate of the microphone, and the X-axis coordinate and the Y-axis coordinate of the human mouth compared with the microphone original point are obtained from the top view, so that the specific coordinate position of the human mouth in the three-dimensional coordinate with the microphone as the original point is accurately obtained, and the angle of the real distance between the microphone and the human mouth is conveniently calculated.
The image recognition unit calculates the distance between the mouth of a person and the microphone, then judges the current scene and positions the directions of the mouth and the microphone, and feeds the scene and the directions back to the sound processing unit of the microphone; the sound processing unit reinforces the sound signal of the positioning direction and simultaneously suppresses the sound signals of other directions.
Wherein, the microphone is a group of double-microphone array; the microphone is fixed right in front of the teacher desk; the camera is a set of camera, wherein a camera is located directly over the microphone, another camera is fixed in lectern one side and the same with the microphone height, and the image of shooing is convenient for establish three-dimensional space coordinate system.
Referring to fig. 2, the present invention is a microphone tracking method combining image recognition and voice localization, including the following steps:
step S1: acquiring a current scene image sequence;
step S2: recognizing a human face and a microphone in the image sequence, and caching and recognizing a three-dimensional coordinate with the microphone as an origin;
step S3: calculating the distance and angle between the microphone and the mouth according to a spatial filtering algorithm;
step S4: judging whether the current scene is in a near-field scene or a far-field scene by using a preset distance threshold value;
step S5: the microphone tracking module adjusts the distance and the elevation angle between the microphone and the human mouth;
step S6: the microphone processing module processes the attenuation or enhancement of the sound depending on whether the current scene is a near-field scene or a far-field scene.
It should be noted that, in the above system embodiment, each included unit is only divided according to functional logic, but is not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
In addition, it is understood by those skilled in the art that all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing associated hardware, and the corresponding program may be stored in a computer-readable storage medium.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims (5)

1. The utility model provides a microphone tracking system who combines image recognition and speech localization, includes camera, microphone and backstage server, its characterized in that:
the camera is used for acquiring an image sequence of a current scene and sending the acquired image sequence to the background server for processing;
the microphone comprises a sound acquisition module and a sound processing module; the sound acquisition module is used for acquiring sound of the current scene; the sound processing module is used for weakening or enhancing sound according to a near-field scene or a far-field scene where the current microphone is located;
the background server comprises an image recognition unit and a microphone tracking unit; the image identification unit is used for identifying the mouth position and the microphone position of a person in an image sequence, taking the microphone as an origin of a three-dimensional coordinate, calculating the distance between the microphone and the mouth according to a spatial filtering algorithm, and judging whether a current scene is in a near-field scene or a far-field scene by using a preset distance threshold value; the microphone tracking module is used for adjusting the distance and the elevation angle of the microphone according to the calculated distance between the microphone and the mouth.
2. The system of claim 1, wherein the camera position and the microphone position are unified three-dimensional coordinates.
3. The microphone tracking system combining image recognition and voice positioning as claimed in claim 1, wherein the image recognition unit calculates the distance from the human mouth to the microphone, determines the current scene and positions the direction of the mouth and the microphone, and feeds the determined distance back to the sound processing unit of the microphone; the sound processing unit reinforces the sound signal of the positioning direction and simultaneously suppresses the sound signals of other directions.
4. The system of claim 1, wherein the microphones are a set of two-microphone arrays; the microphone is fixed right in front of the teacher desk; the camera is a set of camera, and one of them camera is located the microphone directly over, and another camera is fixed in lectern one side and is the same with the microphone height.
5. A microphone tracking method combining image recognition and voice localization, comprising the steps of:
step S1: acquiring a current scene image sequence;
step S2: recognizing a human face and a microphone in the image sequence, and caching and recognizing a three-dimensional coordinate with the microphone as an origin;
step S3: calculating the distance and angle between the microphone and the mouth according to a spatial filtering algorithm;
step S4: judging whether the current scene is in a near-field scene or a far-field scene by using a preset distance threshold value;
step S5: the microphone tracking module adjusts the distance and the elevation angle between the microphone and the human mouth;
step S6: the microphone processing module processes the attenuation or enhancement of the sound depending on whether the current scene is a near-field scene or a far-field scene.
CN202010718515.0A 2020-07-23 2020-07-23 Microphone tracking system and method combining image recognition and voice positioning Pending CN111932619A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010718515.0A CN111932619A (en) 2020-07-23 2020-07-23 Microphone tracking system and method combining image recognition and voice positioning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010718515.0A CN111932619A (en) 2020-07-23 2020-07-23 Microphone tracking system and method combining image recognition and voice positioning

Publications (1)

Publication Number Publication Date
CN111932619A true CN111932619A (en) 2020-11-13

Family

ID=73314555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010718515.0A Pending CN111932619A (en) 2020-07-23 2020-07-23 Microphone tracking system and method combining image recognition and voice positioning

Country Status (1)

Country Link
CN (1) CN111932619A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614508A (en) * 2020-12-11 2021-04-06 北京华捷艾米科技有限公司 Audio and video combined positioning method and device, electronic equipment and storage medium
WO2023193803A1 (en) * 2022-04-08 2023-10-12 南京地平线机器人技术有限公司 Volume control method and apparatus, storage medium, and electronic device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102160398A (en) * 2008-07-31 2011-08-17 诺基亚公司 Electronic device directional audio-video capture
CN102223594A (en) * 2010-04-19 2011-10-19 鸿富锦精密工业(深圳)有限公司 Microphone control device and method
CN104123950A (en) * 2014-07-17 2014-10-29 深圳市中兴移动通信有限公司 Sound recording method and device
US20150022636A1 (en) * 2013-07-19 2015-01-22 Nvidia Corporation Method and system for voice capture using face detection in noisy environments
CN106024003A (en) * 2016-05-10 2016-10-12 北京地平线信息技术有限公司 Voice positioning and enhancement system and method combining images
CN106233384A (en) * 2014-04-17 2016-12-14 微软技术许可有限责任公司 Dialog detection
CN107534725A (en) * 2015-05-19 2018-01-02 华为技术有限公司 A kind of audio signal processing method and device
CN110691196A (en) * 2019-10-30 2020-01-14 歌尔股份有限公司 Sound source positioning method of audio equipment and audio equipment
CN111048104A (en) * 2020-01-16 2020-04-21 北京声智科技有限公司 Speech enhancement processing method, device and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102160398A (en) * 2008-07-31 2011-08-17 诺基亚公司 Electronic device directional audio-video capture
CN102223594A (en) * 2010-04-19 2011-10-19 鸿富锦精密工业(深圳)有限公司 Microphone control device and method
US20150022636A1 (en) * 2013-07-19 2015-01-22 Nvidia Corporation Method and system for voice capture using face detection in noisy environments
CN106233384A (en) * 2014-04-17 2016-12-14 微软技术许可有限责任公司 Dialog detection
CN104123950A (en) * 2014-07-17 2014-10-29 深圳市中兴移动通信有限公司 Sound recording method and device
CN107534725A (en) * 2015-05-19 2018-01-02 华为技术有限公司 A kind of audio signal processing method and device
CN106024003A (en) * 2016-05-10 2016-10-12 北京地平线信息技术有限公司 Voice positioning and enhancement system and method combining images
CN110691196A (en) * 2019-10-30 2020-01-14 歌尔股份有限公司 Sound source positioning method of audio equipment and audio equipment
CN111048104A (en) * 2020-01-16 2020-04-21 北京声智科技有限公司 Speech enhancement processing method, device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614508A (en) * 2020-12-11 2021-04-06 北京华捷艾米科技有限公司 Audio and video combined positioning method and device, electronic equipment and storage medium
WO2023193803A1 (en) * 2022-04-08 2023-10-12 南京地平线机器人技术有限公司 Volume control method and apparatus, storage medium, and electronic device

Similar Documents

Publication Publication Date Title
CN106653041B (en) Audio signal processing apparatus, method and electronic apparatus
CN106328156B (en) Audio and video information fusion microphone array voice enhancement system and method
CN106782584B (en) Audio signal processing device, method and electronic device
CN107346661B (en) Microphone array-based remote iris tracking and collecting method
US10582117B1 (en) Automatic camera control in a video conference system
CN110517705B (en) Binaural sound source positioning method and system based on deep neural network and convolutional neural network
Aarabi et al. Robust sound localization using multi-source audiovisual information fusion
CN107534725B (en) Voice signal processing method and device
CN111833899B (en) Voice detection method based on polyphonic regions, related device and storage medium
CN206349145U (en) Audio signal processing apparatus
CN106024003A (en) Voice positioning and enhancement system and method combining images
CN109640224A (en) A kind of sound pick-up method and device
CN112069863B (en) Face feature validity determination method and electronic equipment
CN111932619A (en) Microphone tracking system and method combining image recognition and voice positioning
CN113676592B (en) Recording method, recording device, electronic equipment and computer readable medium
CN111863020B (en) Voice signal processing method, device, equipment and storage medium
Nakadai et al. Real-time speaker localization and speech separation by audio-visual integration
CN110188179B (en) Voice directional recognition interaction method, device, equipment and medium
CN109147787A (en) A kind of smart television acoustic control identifying system and its recognition methods
CN110718227A (en) Multi-mode interaction based distributed Internet of things equipment cooperation method and system
CN111551921A (en) Sound source orientation system and method based on sound image linkage
JP2022062875A (en) Audio signal processing method and audio signal processing apparatus
CN106409306A (en) Intelligent system obtaining human voice and obtaining method based on the system
CN113432276B (en) Method and equipment for automatically adjusting air conditioner and air conditioner
US20230254639A1 (en) Sound Pickup Method and Apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201113

RJ01 Rejection of invention patent application after publication