CN110767226A - Sound source positioning method and device with high accuracy, voice recognition method and system, storage equipment and terminal - Google Patents

Sound source positioning method and device with high accuracy, voice recognition method and system, storage equipment and terminal Download PDF

Info

Publication number
CN110767226A
CN110767226A CN201911048283.6A CN201911048283A CN110767226A CN 110767226 A CN110767226 A CN 110767226A CN 201911048283 A CN201911048283 A CN 201911048283A CN 110767226 A CN110767226 A CN 110767226A
Authority
CN
China
Prior art keywords
sound source
voice
voiceprint
voice signal
speaker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911048283.6A
Other languages
Chinese (zh)
Other versions
CN110767226B (en
Inventor
周辉
高鑫
任亚敏
邓朋朋
王之帅
王宇飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi Jiansheng Technology Co Ltd
Original Assignee
Shanxi Jiansheng Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi Jiansheng Technology Co Ltd filed Critical Shanxi Jiansheng Technology Co Ltd
Priority to CN201911048283.6A priority Critical patent/CN110767226B/en
Publication of CN110767226A publication Critical patent/CN110767226A/en
Application granted granted Critical
Publication of CN110767226B publication Critical patent/CN110767226B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/22Position of source determined by co-ordinating a plurality of position lines defined by path-difference measurements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

The invention discloses a sound source positioning method and a device with high accuracy, which comprises the following steps: collecting a sound signal; judging whether a voice signal exists in the sound signal or not; extracting all voice signals and acquiring the sound source position of each voice signal; performing voiceprint recognition on each voice signal one by one; judging whether the identified voiceprint features are stored in a voiceprint database or not; acquiring image information of a sound source position where a voice signal corresponding to the voiceprint feature is located; performing model training by using a machine self-learning method, determining a speaker corresponding to the voiceprint characteristics and identity information thereof, and storing the corresponding voiceprint characteristics and the identity information of the speaker in a voiceprint database; and displaying the sound source position information of the voice signal corresponding to the voiceprint characteristics and the identity information of the corresponding speaker. The invention can position the speaker in preparation, match the identity of the speaker and the content of speaking; the method is suitable for the field of voice recognition.

Description

Sound source positioning method and device with high accuracy, voice recognition method and system, storage equipment and terminal
Technical Field
The invention relates to the technical field of voice recognition, in particular to a high-accuracy sound source positioning method and device, a high-pertinence voice recognition method, a high-pertinence voice recognition system, a high-pertinence voice recognition storage device and a high-accuracy voice recognition terminal.
Background
As one of the most important scientific and technological development technologies in the field of information technology, speech recognition technology has been widely developed in various fields such as industry, home appliances, communications, automotive electronics, medical care, home services, consumer electronics, and the like. When speech recognition is performed, some microphone array-based algorithm is usually used for sound source localization.
Generally, a sound source localization algorithm based on a microphone array is divided into three categories: the method is based on beam forming, high-resolution spectrum estimation and time delay difference of arrival (TDOA). (1) The technical idea of Beamforming based on maximum output power is to perform weighted summation on signals collected by each array element to form a beam, guide the beam by searching possible positions of a sound source, and modify weights to maximize the output signal power of a microphone array. This method can be used in both the time domain and the frequency domain. Its time shift in the time domain is equivalent to the phase delay in the frequency domain. In frequency domain processing, a Matrix containing self-spectra and Cross-spectra, called a Cross-Spectral Matrix (CSM), is first used. At each frequency of interest, the processing of the array signal gives the energy level at each given spatial scanning grid point or each Direction of arrival (DOA) of the signal. Thus, the array represents a summed number of responses associated with the sound source distribution. The method is suitable for large microphone arrays and has strong adaptability to test environments. (2) Methods based on high resolution spectral estimation include auto-regressive AR models, minimum variance spectral estimation (MV), and eigenvalue decomposition methods (e.g., Music algorithm), all of which compute the correlation matrix of the spatial spectrum by acquiring signals of the microphone array. The method can effectively estimate the direction of the sound source theoretically, in practice, if the ideal precision is obtained, a large amount of calculation cost is needed, more assumed conditions are needed, when the array is large, the spectrum estimation method has a large amount of calculation and is sensitive to environmental noise, and inaccurate positioning is easily caused, so that the method is rarely used in modern large sound source positioning systems. (3) The sound source positioning method is generally divided into two steps, namely, estimating the sound arrival time difference and acquiring the sound delay (TDOA) between array elements in a microphone array; and further determining the position of the sound source by using the acquired sound arrival time difference and combining the known spatial position of the microphone array. The calculation amount of the method is generally smaller than that of the former two methods, the method is more beneficial to real-time processing, but the positioning accuracy and the anti-interference capability are weaker, the method is suitable for near fields and single sound sources, and is not a repetitive signal, such as a voice signal, and a kinect microphone array (4 one-dimensional arrays with unequal intervals) of Microsoft XBOX360 is a typical TDOA algorithm application.
At present, the traditional sound source positioning algorithms have large dependence on the number of microphone arrays and poor sound signal positioning capability with small signal-to-noise ratio, and in the face of scenes such as multi-person conferences, cocktail meetings and the like, the traditional sound source positioning almost cannot work, and even the position of a speaker, the identity of the speaker and the content of speaking cannot be accurately positioned.
Disclosure of Invention
Aiming at the defects in the related technology, the technical problem to be solved by the invention is as follows: a sound source localization method, a sound source localization apparatus, a speech recognition method, a speech recognition system, a storage device, and a terminal are provided which are capable of locating a position of a speaker in preparation, matching an identity of the speaker, and contents of the utterance with high accuracy.
In order to solve the above technical problem, the present invention provides a sound source localization method with high accuracy, which includes the following steps: s101, collecting sound signals in an environment in real time; s102, judging whether a voice signal exists in the collected sound signal, if so, executing a step S103, otherwise, returning to the step S101; s103, extracting all voice signals and acquiring sound source position information of each voice signal; s104, performing voiceprint recognition on each extracted voice signal one by one; s105, judging whether the currently identified voiceprint features are stored in a voiceprint database, if so, executing a step S108, otherwise, executing a step S106; the voiceprint database stores a plurality of groups of voiceprint characteristics and the identity information of the speaker uniquely corresponding to each group of voiceprint characteristics; s106, acquiring image information of a sound source position where the voice signal corresponding to the voiceprint feature is located; s107, performing model training by using a machine self-learning method, determining a speaker corresponding to the voiceprint characteristics and identity information thereof, and storing the corresponding voiceprint characteristics and the identity information of the speaker in a voiceprint database; and S108, displaying the sound source position information of the voice signal corresponding to the voiceprint characteristic and the identity information of the corresponding speaker.
Preferably, after the step S103 is completed, the steps S103-1 to S103-3 are executed first, and then the step S104 is executed; s103-1, collecting image information of a sound source position where each voice signal is located; s103-2, judging whether people exist at the corresponding sound source positions one by one, if so, executing the step S103-3, otherwise, returning to the step S101; s103-3, judging that the voice signal corresponding to the sound source position with the existence of people is an effective voice signal, and then executing the step S104; the voice signal described in step S104 is an effective voice signal.
Preferably, the image information in step S106 includes: face feature information and lip movement feature information of the person; after the step S106 is executed, the steps S106-1 to S106-4 are executed, and then the step S107 is executed; s106-1, carrying out face recognition and lip movement recognition according to the acquired image information, and determining the current number of the sounders at the sound source position; s106-2, judging whether the number of the current sounders is only 1, if so, executing a step S107, otherwise, executing a step S106-3; s106-3, continuously collecting the sound signals at the sound source position; s106-4, judging whether other voice signals exist at the sound source position, if so, returning to the step S104, otherwise, repeatedly executing the step S106-2.
The present invention also provides a sound source localization apparatus with high accuracy, including: the first sound collection unit: the system is used for acquiring sound signals in the environment in real time; a first judgment unit: the voice recognition device is used for judging whether a voice signal exists in the collected voice signals; the voice extraction and positioning unit: the voice recognition system is used for extracting all voice signals and acquiring the position information of a sound source where each voice signal is located when the voice signals exist in the acquired voice signals; a voiceprint recognition unit: the voice print recognition module is used for carrying out voice print recognition on each extracted voice signal one by one; a second judgment unit: the voice print database is used for judging whether the currently identified voice print characteristics are stored in the voice print database; the voiceprint database stores a plurality of groups of voiceprint characteristics and the identity information of the speaker uniquely corresponding to each group of voiceprint characteristics; an image acquisition unit: the voice recognition method comprises the steps that when the currently recognized voiceprint features are not stored in a voiceprint database, image information of a sound source position where a voice signal corresponding to the voiceprint features is located is obtained; a machine learning unit: the system is used for carrying out model training by utilizing a machine self-learning method, determining a speaker corresponding to the voiceprint characteristics and identity information thereof, and storing the corresponding voiceprint characteristics and the identity information of the speaker in a voiceprint database; a display unit: and the voice recognition module is used for displaying the sound source position information of the voice signal corresponding to the voiceprint characteristics and the identity information of the corresponding speaker when the currently recognized voiceprint characteristics are stored in the voiceprint database.
Preferably, the method further comprises the following steps: an image acquisition unit: the voice recognition system is used for acquiring image information of a sound source position where each voice signal is located after all voice signals are extracted and sound source position information where each voice signal is located is acquired; a third judging unit: the system is used for judging whether people exist at the corresponding sound source positions one by one; an effective speech determination unit: and when a person exists at the corresponding sound source position, judging the voice signal corresponding to the sound source position where the person exists as an effective voice signal, and then carrying out voiceprint recognition on all the effective voice signals one by one.
Preferably, the image information in the image acquiring unit includes: face feature information and lip movement feature information of the person; the sound source localization apparatus with high accuracy includes: the sound production people counting unit: the voice recognition system is used for carrying out face recognition and lip movement recognition according to the acquired image information after acquiring the image information of the sound source position where the voice signal corresponding to the voiceprint feature is located when the currently recognized voiceprint feature is not stored in the voiceprint database, and determining the current number of the voice producing people at the sound source position; a fourth judging unit: the voice print recognition system is used for judging whether the number of the current voice speakers is only 1, if so, performing model training by using a machine self-learning method, determining the voice speakers corresponding to the voice print characteristics and identity information thereof, and storing the corresponding voice print characteristics and the identity information of the voice speakers in a voice print database; a second sound collection unit: when the number of the current sounders is not more than 1, continuously acquiring the sound signals at the sound source position; a fifth judging unit: and the voice recognition module is used for judging whether other voice signals exist at the position of the sound source, if so, performing voiceprint recognition on each extracted voice signal one by one, and otherwise, repeatedly judging whether the number of the current uttered people is only 1.
The invention also provides a high-pertinence speech recognition method, which comprises the following steps: s10, determining the sound source position information of the voice signal and the identity information of the corresponding speaker by adopting a sound source positioning method; s20, responding to the voice recognition command, converting the voice content corresponding to the appointed speaker into character content, and displaying; the sound source positioning method is the sound source positioning method with high accuracy.
The invention also provides a speech recognition system with high pertinence, comprising: sound source localization apparatus: the method comprises the steps of determining sound source position information of a voice signal and identity information of a corresponding speaker by adopting a sound source positioning method; a voice conversion module: the voice recognition device is used for responding to a voice recognition command, converting voice content corresponding to a specified speaker into character content and displaying the character content; the sound source positioning device is the sound source positioning device with high accuracy.
The invention also provides a storage device having stored therein a plurality of instructions adapted to be loaded by a processor and to perform a speech recognition method as described above with high pertinence.
The present invention also provides a terminal, including: a processor adapted to implement instructions; and a storage device adapted to store a plurality of instructions adapted to be loaded by the processor and to perform the speech recognition method with high pertinence as described above.
The invention has the beneficial technical effects that:
1. the method comprises the steps that firstly, various sound signals in the environment are collected in real time through a first sound collection unit, then whether the collected sound signals contain voice signals or not is judged, if not, the sound signals are not wanted target signals and are not processed; if the voice signals exist, on one hand, the sound source positions of the voice signals are obtained, and on the other hand, the voice print recognition is carried out on the voice signals one by one. Since the vocal print features of any two speakers are different due to the difference in size and shape of the vocal organs (e.g., tongue, teeth, larynx, lung, nasal cavity, etc.) used by each speaker when speaking, each of the voice signals in the present invention uniquely corresponds to a set of vocal print features as long as they are from different speakers. After obtaining a voiceprint feature corresponding to a voice signal, firstly judging whether the voiceprint feature is stored in a voiceprint database, if so, indicating that the voiceprint feature is collected in advance and stored in the voiceprint database, and the identity of a speaker with the voiceprint feature is also confirmed, and then directly displaying the current position information of the corresponding speaker and the identity information of the speaker through a display unit; if not, the position and the identity of the speaker can be confirmed and matched only in a field acquisition mode, firstly, image information of a sound source position where the voice signal is located is obtained through an image obtaining unit, then, model training is carried out on the speaker by using a machine self-learning method, finally, the speaker corresponding to the voiceprint feature and the identity information of the speaker are determined, the corresponding voiceprint feature and the identity information of the speaker are stored in a voiceprint database, and then, the position information of the corresponding speaker at present and the identity information of the speaker are displayed through a display unit. Thus, it is possible to accurately know who is speaking and where at present. If people want to know what content is spoken, the voice conversion module can convert the voice content corresponding to the specified speaker into text content and display the text content to the requester only by sending a request to the voice conversion module. The invention adds voiceprint recognition and image recognition on the basis of the traditional sound source positioning, can accurately position the position of the speaker, match the identity of the speaker and the speaking content, has small dependence degree on the number of microphones, can accurately position the environmental sound source with low signal-to-noise ratio, and can completely adapt to the environments with complex sound sources, such as multi-person conferences, cocktail meetings and the like.
2. In the invention, after the voice signals exist in the collected voice signals, the image information of the sound source position of each voice signal can be collected through the image collecting unit, then whether people exist at the sound source positions is judged, if people exist at the sound source positions, the voice signals are subjected to subsequent processing, thus the interference of sound sources (such as sound production of a radio, sound production of a television and the like) which do not come from a human body is eliminated, and the accuracy and the recognition efficiency are improved.
3. In the invention, when the position and the identity of a speaker need to be acquired on site, the image information acquired by the image acquisition unit on site comprises face characteristic information and lip movement characteristic information of the speaker, so that face recognition and lip movement recognition can be carried out according to the acquired face characteristic information and lip movement characteristic information, the current number of speakers at the sound source position is determined, then whether the current number of speakers is only 1 is judged, and if the current number of speakers is 1, machine self-learning is directly carried out; if the number of the voice fingerprints is multiple, whether other voice signals exist at the position of the sound source is judged firstly, if the voice fingerprints exist, voiceprint recognition is directly carried out, if the voice fingerprints do not exist, the number of the speaker is judged in a circulating mode, machine self-learning is carried out until the number of the speaker is 1, and finally the identity and the position of the current speaker can be uniquely determined, so that the interference that a plurality of people can speak at the same time and can not distinguish the currently collected voiceprints to which the person belongs at all can be avoided.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.
Fig. 1 is a schematic flow chart of a sound source positioning method with high accuracy according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a sound source positioning device with high accuracy according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for speech recognition with high pertinence according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a speech recognition system with high pertinence according to an embodiment of the present invention;
fig. 5 is a schematic flowchart of a sound source positioning method with high accuracy according to a second embodiment of the present invention;
fig. 6 is a schematic structural diagram of a sound source positioning device with high accuracy according to a second embodiment of the present invention;
fig. 7 is a schematic flowchart of a sound source positioning method with high accuracy according to a third embodiment of the present invention;
fig. 8 is a schematic structural diagram of a sound source positioning device with high accuracy according to a third embodiment of the present invention;
fig. 9 is a schematic flowchart of a sound source positioning method with high accuracy according to a fourth embodiment of the present invention;
fig. 10 is a schematic structural diagram of a sound source positioning device with high accuracy according to a fourth embodiment of the present invention;
in the figure: 101 is a first sound collection unit, 102 is a first judgment unit, 103 is a sound extraction and positioning unit, 104 is a voiceprint recognition unit, 105 is a second judgment unit, 106 is an image acquisition unit, 107 is a machine learning unit, 108 is a display unit, 109 is an identity marking unit, 103-1 is an image collection unit, 103-2 is a third judgment unit, 103-3 is an effective sound judgment unit, 106-1 is a sound production people counting unit, 106-2 is a fourth judgment unit, 106-3 is a second sound collection unit, 106-4 is a fifth judgment unit, 10 is a sound source positioning device, and 20 is a sound conversion module.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the present invention; all other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Next, the present invention is described in detail with reference to the schematic drawings, and when the embodiments of the present invention are described in detail, the schematic drawings are only examples for convenience of description, and should not limit the scope of the present invention.
An embodiment of a sound source positioning method, a sound source positioning apparatus, a speech recognition method, a speech recognition system, a storage device, and a terminal with high accuracy is described in detail below with reference to the accompanying drawings.
Example one
Fig. 1 is a schematic flowchart of a sound source positioning method with high accuracy according to an embodiment of the present invention, and as shown in fig. 1, the sound source positioning method with high accuracy may include the following steps:
s101, sound signals in the environment are collected in real time.
S102, judging whether the collected sound signals contain voice signals or not, if yes, executing a step S103, and if not, returning to the step S101.
S103, extracting all voice signals and acquiring the position information of the sound source of each voice signal.
And S104, performing voiceprint recognition on each extracted voice signal one by one.
S105, judging whether the currently identified voiceprint features are stored in a voiceprint database, if so, executing a step S108, otherwise, executing a step S106; the voiceprint database stores a plurality of groups of voiceprint characteristics and the identity information of the speaker uniquely corresponding to each group of voiceprint characteristics.
And S106, acquiring image information of the sound source position where the voice signal corresponding to the voiceprint feature is located.
And S107, performing model training by using a machine self-learning method, determining the speaker corresponding to the voiceprint characteristics and identity information thereof, and storing the corresponding voiceprint characteristics and the identity information of the speaker in a voiceprint database.
And S108, displaying the sound source position information of the voice signal corresponding to the voiceprint characteristic and the identity information of the corresponding speaker.
Accordingly, this embodiment further provides a sound source positioning device with high accuracy, fig. 2 is a schematic structural diagram of the sound source positioning device with high accuracy according to an embodiment of the present invention, and as shown in fig. 2, the sound source positioning device with high accuracy may include:
the first sound collection unit 101: for collecting sound signals in real time in an environment.
The first judgment unit 102: and the method is used for judging whether the voice signal exists in the collected sound signal.
Speech extraction and localization unit 103: the method is used for extracting all voice signals and acquiring the position information of a sound source where each voice signal is located when the voice signals exist in the collected voice signals.
Voiceprint recognition unit 104: for performing voiceprint recognition on each extracted voice signal one by one.
Second determination unit 105: the voice print database is used for judging whether the currently identified voice print characteristics are stored in the voice print database; the voiceprint database stores a plurality of groups of voiceprint characteristics and the identity information of the speaker uniquely corresponding to each group of voiceprint characteristics.
The image acquisition unit 106: and the voice recognition module is used for acquiring the image information of the sound source position where the voice signal corresponding to the voiceprint feature is located when the currently recognized voiceprint feature is not stored in the voiceprint database.
The machine learning unit 107: the method is used for carrying out model training by utilizing a machine self-learning method, determining the speaker corresponding to the voiceprint characteristics and the identity information thereof, and storing the corresponding voiceprint characteristics and the identity information of the speaker in a voiceprint database.
The display unit 108: and the voice recognition module is used for displaying the sound source position information of the voice signal corresponding to the voiceprint characteristics and the identity information of the corresponding speaker when the currently recognized voiceprint characteristics are stored in the voiceprint database.
In addition, the present embodiment further provides a speech recognition method with high pertinence, fig. 3 is a schematic flow chart of the speech recognition method with high pertinence according to an embodiment of the present invention, and as shown in fig. 3, the speech recognition method with high pertinence may include the following steps:
and S10, determining the position information of the sound source where the voice signal is located and the identity information of the corresponding speaker by adopting a sound source positioning method.
And S20, responding to the voice recognition command, converting the voice content corresponding to the appointed speaker into character content, and displaying the character content.
The sound source positioning method is the sound source positioning method with high accuracy.
Correspondingly, the present embodiment further provides a speech recognition system with high pertinence, fig. 4 is a schematic structural diagram of a speech recognition system with high pertinence provided in the first embodiment of the present invention, and as shown in fig. 4, the speech recognition system with high pertinence may include:
sound source localization apparatus 10: the method is used for determining the sound source position information of the voice signal and the identity information of the corresponding speaker by adopting a sound source positioning method.
The voice conversion module 20: and the voice recognition device is used for responding to the voice recognition command, converting the voice content corresponding to the appointed speaker into the character content and displaying the character content.
The sound source localization apparatus 10 is the sound source localization apparatus with high accuracy described above.
In the embodiment, first, the first sound collection unit 101 collects various sound signals in an environment in real time, and then judges whether the collected sound signals contain voice signals, if not, the sound signals are not intended target signals, and no processing is performed; if the voice signals exist, on one hand, the sound source positions of the voice signals are obtained, and on the other hand, the voice print recognition is carried out on the voice signals one by one. Since the vocal print features of any two speakers are different due to the difference in size and shape of the vocal organs (e.g., tongue, teeth, larynx, lung, nasal cavity, etc.) used by each speaker when speaking, each of the voice signals in the present invention uniquely corresponds to a set of vocal print features as long as they are from different speakers. After obtaining a voiceprint feature corresponding to a voice signal, firstly judging whether the voiceprint feature is stored in a voiceprint database, if so, indicating that the voiceprint feature is collected in advance and stored in the voiceprint database, and the identity of a speaker with the voiceprint feature is also confirmed, and then directly displaying the current position information of the corresponding speaker and the identity information of the speaker through the display unit 108; if not, the position and the identity of the speaker can only be confirmed and matched in a field acquisition mode, firstly, the image information of the sound source position where the voice signal is located is acquired through the image acquisition unit 106, then, model training is carried out on the speaker by using a machine self-learning method, finally, the speaker corresponding to the voiceprint feature and the identity information of the speaker are determined, the corresponding voiceprint feature and the identity information of the speaker are stored in a voiceprint database, and then, the position information of the corresponding speaker at present and the identity information of the speaker are displayed through the display unit 108. Thus, it is possible to accurately know who is speaking and where at present. If a user wants to know who says what content, the user only needs to send a request to the voice conversion module 20, and the voice conversion module 20 can convert the voice content corresponding to the specified speaker into text content and display the text content to the requester. The invention adds voiceprint recognition and image recognition on the basis of the traditional sound source positioning, can accurately position the position of the speaker, match the identity of the speaker and the speaking content, has small dependence degree on the number of microphones, can accurately position the environmental sound source with low signal-to-noise ratio, and can completely adapt to the environments with complex sound sources, such as multi-person conferences, cocktail meetings and the like.
In specific implementation, the display of the position information of the speaker in the invention can be displayed in a map form, the core geographic position of the map can be the position of a user wearing the device/apparatus, and the positions of all speakers are marked one by one around the position of the user on the map for real-time display. Specifically, the identity information of the speaker in the invention can be the head portrait of the speaker or the name of the speaker. Before the equipment/device is used, voiceprint information and identity information of a person who pronounces the voice can be collected in advance, so that the voiceprint information and the identity information can be directly matched in the using process, and a large amount of time for field collection and matching is saved.
Example two
Fig. 5 is a schematic flowchart of a sound source positioning method with high accuracy according to a second embodiment of the present invention, and as shown in fig. 5, after the step S103 is completed, the present invention may first perform steps S103-1 to S103-3, and then perform step S104.
S103-1, collecting image information of the sound source position where each voice signal is located.
S103-2, judging whether people exist at the corresponding sound source positions one by one, if so, executing the step S103-3, otherwise, returning to the step S101.
S103-3, judging that the voice signal corresponding to the sound source position with the existence of people is an effective voice signal, and then executing the step S104;
the voice signal described in step S104 is an effective voice signal.
Accordingly, this embodiment further provides a sound source positioning device with high accuracy, fig. 6 is a schematic structural diagram of a sound source positioning device with high accuracy according to a second embodiment of the present invention, as shown in fig. 6, the sound source positioning device with high accuracy may further include:
image acquisition unit 103-1: the method comprises the steps of extracting all voice signals and acquiring sound source position information of each voice signal, and then acquiring image information of the sound source position of each voice signal.
Third judging unit 103-2: and the method is used for judging whether people exist at the corresponding sound source positions one by one.
Valid speech determination unit 103-3: and when a person exists at the corresponding sound source position, judging the voice signal corresponding to the sound source position where the person exists as an effective voice signal, and then carrying out voiceprint recognition on all the effective voice signals one by one.
In this embodiment, after it is determined that there is a voice signal in the collected voice signal, the image collecting unit 103-1 may collect image information of a sound source position where each voice signal is located, and then determine whether there is a person at the sound source position, and if there is a person at the sound source position, perform subsequent processing on the voice signals, so as to eliminate interference of sound sources (such as a sound of a radio, a sound of a television, and the like) that do not come from a human body, and improve accuracy and recognition efficiency.
EXAMPLE III
Fig. 7 is a schematic flowchart of a sound source localization method with high accuracy according to a third embodiment of the present invention, and as shown in fig. 7, the image information in step S106 may include: face feature information and lip movement feature information of the person.
After the step S106 is completed, the steps S106-1 to S106-4 are executed, and then the step S107 is executed:
s106-1, carrying out face recognition and lip movement recognition according to the acquired image information, and determining the current number of the voice-producing people at the position of the sound source.
S106-2, judging whether the number of the current sounders is only 1, if so, executing the step S107, otherwise, executing the step S106-3.
And S106-3, continuing to collect the sound signal at the sound source position.
S106-4, judging whether other voice signals exist at the sound source position, if so, returning to the step S104, otherwise, repeatedly executing the step S106-2.
Accordingly, this embodiment further provides a sound source positioning device with high accuracy, fig. 8 is a schematic structural diagram of a sound source positioning device with high accuracy according to a third embodiment of the present invention, as shown in fig. 8, on the basis of the first embodiment, the image information in the image acquiring unit 106 may include: face feature information and lip movement feature information of the person; the sound source localization apparatus having high accuracy may include:
the uttered people counting unit 106-1: and the voice recognition system is used for acquiring image information of a sound source position where a voice signal corresponding to the voiceprint features is located when the currently recognized voiceprint features are not stored in the voiceprint database, then carrying out face recognition and lip movement recognition according to the acquired image information, and determining the number of the current voice-producing people at the sound source position.
Fourth judging unit 106-2: and the method is used for judging whether the number of the current voice speakers is only 1, if so, performing model training by using a machine self-learning method, determining the voice speakers corresponding to the voiceprint characteristics and identity information thereof, and storing the corresponding voiceprint characteristics and the identity information of the voice speakers in a voiceprint database.
Second sound pickup unit 106-3: and the method is used for continuously collecting the sound signals at the sound source position when the number of the current sounders is not more than 1.
Fifth judging unit 106-4: and the voice recognition module is used for judging whether other voice signals exist at the position of the sound source, if so, performing voiceprint recognition on each extracted voice signal one by one, and otherwise, repeatedly judging whether the number of the current uttered people is only 1.
In this embodiment, when the position and identity of the speaker need to be collected on site, the image information obtained by the image obtaining unit 106 on site includes the facial feature information and lip movement feature information of the speaker, so that face recognition and lip movement recognition can be performed according to the obtained facial feature information and lip movement feature information to determine the number of the current speakers at the sound source position, and then whether the number of the current speakers is only 1 is determined, and if the number of the current speakers is 1, machine self-learning is directly performed; if the number of the voice fingerprints is multiple, whether other voice signals exist at the position of the sound source is judged firstly, if the voice fingerprints exist, voiceprint recognition is directly carried out, if the voice fingerprints do not exist, the number of the speaker is judged in a circulating mode, machine self-learning is carried out until the number of the speaker is 1, and finally the identity and the position of the current speaker can be uniquely determined, so that the interference that a plurality of people can speak at the same time and can not distinguish the currently collected voiceprints to which the person belongs at all can be avoided.
Example four
Fig. 9 is a schematic flowchart of a sound source positioning method with high accuracy according to a fourth embodiment of the present invention, as shown in fig. 9, on the basis of the first embodiment, the sound source positioning method with high accuracy may further include the following steps:
and S109, responding to the voice producer identity marking command, marking the identity information of the designated voice producer, and displaying.
Accordingly, this embodiment also provides a sound source positioning device with high accuracy, and fig. 10 is a schematic structural diagram of a sound source positioning device with high accuracy according to a fourth embodiment of the present invention, as shown in fig. 10, on the basis of the first embodiment,
the sound source localization apparatus having high accuracy may further include:
identity marking unit 109: and the voice generator is used for responding to the voice generator identity marking command, marking the identity information of the specified voice generator and displaying the identity information.
After the position and the identity of the speaker are determined, the identity of the speaker can be marked according to actual requirements for convenience of identification.
Accordingly, the present invention also provides a storage device having stored therein a plurality of instructions adapted to be loaded by a processor and to perform a speech recognition method with high pertinence as described above.
The storage device may be a computer-readable storage medium, and may include: ROM, RAM, magnetic or optical disks, and the like.
Correspondingly, the invention also provides a terminal, which can comprise:
a processor adapted to implement instructions; and
a storage device adapted to store a plurality of instructions adapted to be loaded by a processor and to perform a speech recognition method with high pertinence as described above.
The terminal can be any device capable of sound source localization and speech recognition, and the device can be various wearable terminal devices, such as: glasses, bracelets, etc., which may be implemented in software and/or hardware. The processor can adopt a multi-core processor with a GPU or an NPU, such as a high-pass processor, a Rayleigh core micro processor, a joint issuing processor and the like.
In specific implementation, the first sound collection unit 101 and the second sound collection unit 106-3 may be linear arrays, circular arrays, heart-shaped arrays, spiral arrays, irregular arrays, and the like based on electret microphones, or linear arrays, circular arrays, heart-shaped arrays, spiral arrays, irregular arrays, and the like based on MEMS microphones. The image acquisition unit 106 and the image acquisition unit 103-1 may be a monocular camera, a binocular camera, a multi-view camera, a depth camera, etc. The display unit 108 may be a display screen such as an OLED or an LCD, may adopt a prism module, a free-form surface, an optical waveguide, or an LED lamp or other display modules.
Compared with the traditional sound source positioning, the method has greater advantages in a noisy environment where multiple persons speak simultaneously or an environment where a speaker is far away from a listener, and particularly has prominent advantages for disabled persons with hearing impairment. The invention is based on machine vision and voiceprint recognition technology, combines the audio data and image data collected on site, carries out sound source positioning, voiceprint recognition, face recognition and lip movement recognition, determines the number of persons, the number of persons and the number of persons on the corresponding sound source position, and further accurately obtains the position information of a plurality of sound sources and the identity information of the persons who utter sound. The invention solves the problems that the traditional sound source positioning algorithm is inaccurate in positioning and cannot acquire expected human voice in the application environments of multi-person meetings, cocktail meetings and the like, realizes accurate sound source positioning in complex application environments by combining machine vision and voiceprint recognition technologies, and has outstanding substantive characteristics and remarkable progress.
In the description of the present invention, it is understood that the relevant features of the above method, apparatus and system are mutually referenced. In addition, "first", "second", and the like in the above-described embodiments are for distinguishing between the embodiments or for descriptive purposes and do not represent advantages or disadvantages of the embodiments, nor are they to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include at least one of the feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The above-described system embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and other divisions may be realized in practice, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A sound source localization method with high accuracy, characterized by: the method comprises the following steps:
s101, collecting sound signals in an environment in real time;
s102, judging whether a voice signal exists in the collected sound signal, if so, executing a step S103, otherwise, returning to the step S101;
s103, extracting all voice signals and acquiring sound source position information of each voice signal;
s104, performing voiceprint recognition on each extracted voice signal one by one;
s105, judging whether the currently identified voiceprint features are stored in a voiceprint database, if so, executing a step S108, otherwise, executing a step S106; the voiceprint database stores a plurality of groups of voiceprint characteristics and the identity information of the speaker uniquely corresponding to each group of voiceprint characteristics;
s106, acquiring image information of a sound source position where the voice signal corresponding to the voiceprint feature is located;
s107, performing model training by using a machine self-learning method, determining a speaker corresponding to the voiceprint characteristics and identity information thereof, and storing the corresponding voiceprint characteristics and the identity information of the speaker in a voiceprint database;
and S108, displaying the sound source position information of the voice signal corresponding to the voiceprint characteristic and the identity information of the corresponding speaker.
2. The sound source localization method with high accuracy according to claim 1, characterized in that: after the step S103 is completed, the steps S103-1 to S103-3 are executed, and then the step S104 is executed;
s103-1, collecting image information of a sound source position where each voice signal is located;
s103-2, judging whether people exist at the corresponding sound source positions one by one, if so, executing the step S103-3, otherwise, returning to the step S101;
s103-3, judging that the voice signal corresponding to the sound source position with the existence of people is an effective voice signal, and then executing the step S104;
the voice signal described in step S104 is an effective voice signal.
3. The sound source localization method with high accuracy according to claim 1, characterized in that: the image information in step S106 includes: face feature information and lip movement feature information of the person;
after the step S106 is executed, the steps S106-1 to S106-4 are executed, and then the step S107 is executed;
s106-1, carrying out face recognition and lip movement recognition according to the acquired image information, and determining the current number of the sounders at the sound source position;
s106-2, judging whether the number of the current sounders is only 1, if so, executing a step S107, otherwise, executing a step S106-3;
s106-3, continuously collecting the sound signals at the sound source position;
s106-4, judging whether other voice signals exist at the sound source position, if so, returning to the step S104, otherwise, repeatedly executing the step S106-2.
4. Sound source localization apparatus with high accuracy, characterized by: the method comprises the following steps:
first sound collection unit (101): the system is used for acquiring sound signals in the environment in real time;
first judgment unit (102): the voice recognition device is used for judging whether a voice signal exists in the collected voice signals;
speech extraction and localization unit (103): the voice recognition system is used for extracting all voice signals and acquiring the position information of a sound source where each voice signal is located when the voice signals exist in the acquired voice signals;
voiceprint recognition unit (104): the voice print recognition module is used for carrying out voice print recognition on each extracted voice signal one by one;
second determination unit (105): the voice print database is used for judging whether the currently identified voice print characteristics are stored in the voice print database; the voiceprint database stores a plurality of groups of voiceprint characteristics and the identity information of the speaker uniquely corresponding to each group of voiceprint characteristics;
image acquisition unit (106): the voice recognition method comprises the steps that when the currently recognized voiceprint features are not stored in a voiceprint database, image information of a sound source position where a voice signal corresponding to the voiceprint features is located is obtained;
machine learning unit (107): the system is used for carrying out model training by utilizing a machine self-learning method, determining a speaker corresponding to the voiceprint characteristics and identity information thereof, and storing the corresponding voiceprint characteristics and the identity information of the speaker in a voiceprint database;
display unit (108): and the voice recognition module is used for displaying the sound source position information of the voice signal corresponding to the voiceprint characteristics and the identity information of the corresponding speaker when the currently recognized voiceprint characteristics are stored in the voiceprint database.
5. The sound source localization device with high accuracy according to claim 4, wherein: further comprising:
image acquisition unit (103-1): the voice recognition system is used for acquiring image information of a sound source position where each voice signal is located after all voice signals are extracted and sound source position information where each voice signal is located is acquired;
third judging unit (103-2): the system is used for judging whether people exist at the corresponding sound source positions one by one;
valid speech determination unit (103-3): and when a person exists at the corresponding sound source position, judging the voice signal corresponding to the sound source position where the person exists as an effective voice signal, and then carrying out voiceprint recognition on all the effective voice signals one by one.
6. The sound source localization device with high accuracy according to claim 4, wherein: the image information in the image acquisition unit (106) includes: face feature information and lip movement feature information of the person; the sound source localization apparatus with high accuracy includes:
an utterance people counting unit (106-1): the voice recognition system is used for carrying out face recognition and lip movement recognition according to the acquired image information after acquiring the image information of the sound source position where the voice signal corresponding to the voiceprint feature is located when the currently recognized voiceprint feature is not stored in the voiceprint database, and determining the current number of the voice producing people at the sound source position;
fourth judging unit (106-2): the voice print recognition system is used for judging whether the number of the current voice speakers is only 1, if so, performing model training by using a machine self-learning method, determining the voice speakers corresponding to the voice print characteristics and identity information thereof, and storing the corresponding voice print characteristics and the identity information of the voice speakers in a voice print database;
a second sound collection unit (106-3): when the number of the current sounders is not more than 1, continuously acquiring the sound signals at the sound source position;
fifth judging unit (106-4): and the voice recognition module is used for judging whether other voice signals exist at the position of the sound source, if so, performing voiceprint recognition on each extracted voice signal one by one, and otherwise, repeatedly judging whether the number of the current uttered people is only 1.
7. A speech recognition method with high pertinence, characterized by: the method comprises the following steps:
determining the sound source position information of the voice signal and the identity information of a corresponding speaker by adopting a sound source positioning method;
responding to the voice recognition command, converting the voice content corresponding to the appointed speaker into character content, and displaying the character content;
the sound source localization method according to any one of claims 1 to 3, wherein the sound source localization method is a sound source localization method with high accuracy.
8. A speech recognition system with high pertinence, characterized by: the method comprises the following steps:
sound source localization device (10): the method comprises the steps of determining sound source position information of a voice signal and identity information of a corresponding speaker by adopting a sound source positioning method;
speech conversion module (20): the voice recognition device is used for responding to a voice recognition command, converting voice content corresponding to a specified speaker into character content and displaying the character content;
the sound source localization apparatus (10) is the sound source localization apparatus with high accuracy according to any one of claims 4 to 6.
9. A storage device having a plurality of instructions stored therein, characterized in that: the instructions are adapted to be loaded by a processor and to perform a sound source localization method with high accuracy as claimed in any of claims 1-3.
10. A terminal, characterized in that: the method comprises the following steps:
a processor adapted to implement instructions; and
a storage device adapted to store a plurality of instructions adapted to be loaded by a processor and to perform a sound source localization method with high accuracy as claimed in any of claims 1-3.
CN201911048283.6A 2019-10-30 2019-10-30 Sound source positioning method and device with high accuracy, voice recognition method and system, storage equipment and terminal Active CN110767226B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911048283.6A CN110767226B (en) 2019-10-30 2019-10-30 Sound source positioning method and device with high accuracy, voice recognition method and system, storage equipment and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911048283.6A CN110767226B (en) 2019-10-30 2019-10-30 Sound source positioning method and device with high accuracy, voice recognition method and system, storage equipment and terminal

Publications (2)

Publication Number Publication Date
CN110767226A true CN110767226A (en) 2020-02-07
CN110767226B CN110767226B (en) 2022-08-16

Family

ID=69334510

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911048283.6A Active CN110767226B (en) 2019-10-30 2019-10-30 Sound source positioning method and device with high accuracy, voice recognition method and system, storage equipment and terminal

Country Status (1)

Country Link
CN (1) CN110767226B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476126A (en) * 2020-03-27 2020-07-31 海信集团有限公司 Indoor positioning method and system and intelligent equipment
CN111787609A (en) * 2020-07-09 2020-10-16 北京中超伟业信息安全技术股份有限公司 Personnel positioning system and method based on human body voiceprint characteristics and microphone base station
CN111816174A (en) * 2020-06-24 2020-10-23 北京小米松果电子有限公司 Speech recognition method, device and computer readable storage medium
CN111836062A (en) * 2020-06-30 2020-10-27 北京小米松果电子有限公司 Video playing method and device and computer readable storage medium
CN111895991A (en) * 2020-08-03 2020-11-06 杭州十域科技有限公司 Indoor positioning navigation method combined with voice recognition
CN112581941A (en) * 2020-11-17 2021-03-30 北京百度网讯科技有限公司 Audio recognition method and device, electronic equipment and storage medium
CN112584225A (en) * 2020-12-03 2021-03-30 维沃移动通信有限公司 Video recording processing method, video playing control method and electronic equipment
CN112738499A (en) * 2020-12-25 2021-04-30 京东方科技集团股份有限公司 Information display method and device based on AR, AR equipment, electronic equipment and medium
CN113406567A (en) * 2021-06-25 2021-09-17 安徽淘云科技股份有限公司 Sound source positioning method, device, equipment and storage medium
CN113593572A (en) * 2021-08-03 2021-11-02 深圳地平线机器人科技有限公司 Method and apparatus for performing sound zone localization in spatial region, device and medium
CN113576527A (en) * 2021-08-27 2021-11-02 复旦大学 Method for judging ultrasonic input by using voice control
CN113611308A (en) * 2021-09-08 2021-11-05 杭州海康威视数字技术股份有限公司 Voice recognition method, device, system, server and storage medium
CN114281182A (en) * 2020-09-17 2022-04-05 华为技术有限公司 Man-machine interaction method, device and system
WO2022142610A1 (en) * 2020-12-28 2022-07-07 深圳壹账通智能科技有限公司 Speech recording method and apparatus, computer device, and readable storage medium
CN114863943A (en) * 2022-07-04 2022-08-05 杭州兆华电子股份有限公司 Self-adaptive positioning method and device for environmental noise source based on beam forming
CN115101078A (en) * 2022-06-23 2022-09-23 浙江吉利控股集团有限公司 Voiceprint capturing and displaying system, vehicle with voiceprint capturing and displaying system, control method and storage medium
CN115240698A (en) * 2021-06-30 2022-10-25 达闼机器人股份有限公司 Model training method, voice detection positioning method, electronic device and storage medium
CN116299179A (en) * 2023-05-22 2023-06-23 北京边锋信息技术有限公司 Sound source positioning method, sound source positioning device and readable storage medium
CN116384879A (en) * 2023-04-07 2023-07-04 豪越科技有限公司 Intelligent management system for rapid warehouse-in and warehouse-out of fire-fighting equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968991A (en) * 2012-11-29 2013-03-13 华为技术有限公司 Method, device and system for sorting voice conference minutes
US20130124209A1 (en) * 2011-11-11 2013-05-16 Sony Corporation Information processing apparatus, information processing method, and program
CN104269172A (en) * 2014-07-31 2015-01-07 广东美的制冷设备有限公司 Voice control method and system based on video positioning
CN108305615A (en) * 2017-10-23 2018-07-20 腾讯科技(深圳)有限公司 A kind of object identifying method and its equipment, storage medium, terminal
CN108540660A (en) * 2018-03-30 2018-09-14 广东欧珀移动通信有限公司 Audio signal processing method and device, readable storage medium storing program for executing, terminal
WO2018210219A1 (en) * 2017-05-18 2018-11-22 刘国华 Device-facing human-computer interaction method and system
CN108920640A (en) * 2018-07-02 2018-11-30 北京百度网讯科技有限公司 Context acquisition methods and equipment based on interactive voice
CN109754811A (en) * 2018-12-10 2019-05-14 平安科技(深圳)有限公司 Sound-source follow-up method, apparatus, equipment and storage medium based on biological characteristic
CN110021302A (en) * 2019-03-06 2019-07-16 厦门快商通信息咨询有限公司 A kind of Intelligent office conference system and minutes method
CN110196914A (en) * 2019-07-29 2019-09-03 上海肇观电子科技有限公司 A kind of method and apparatus by face information input database

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130124209A1 (en) * 2011-11-11 2013-05-16 Sony Corporation Information processing apparatus, information processing method, and program
CN102968991A (en) * 2012-11-29 2013-03-13 华为技术有限公司 Method, device and system for sorting voice conference minutes
CN104269172A (en) * 2014-07-31 2015-01-07 广东美的制冷设备有限公司 Voice control method and system based on video positioning
WO2018210219A1 (en) * 2017-05-18 2018-11-22 刘国华 Device-facing human-computer interaction method and system
CN108305615A (en) * 2017-10-23 2018-07-20 腾讯科技(深圳)有限公司 A kind of object identifying method and its equipment, storage medium, terminal
CN108540660A (en) * 2018-03-30 2018-09-14 广东欧珀移动通信有限公司 Audio signal processing method and device, readable storage medium storing program for executing, terminal
CN108920640A (en) * 2018-07-02 2018-11-30 北京百度网讯科技有限公司 Context acquisition methods and equipment based on interactive voice
CN109754811A (en) * 2018-12-10 2019-05-14 平安科技(深圳)有限公司 Sound-source follow-up method, apparatus, equipment and storage medium based on biological characteristic
CN110021302A (en) * 2019-03-06 2019-07-16 厦门快商通信息咨询有限公司 A kind of Intelligent office conference system and minutes method
CN110196914A (en) * 2019-07-29 2019-09-03 上海肇观电子科技有限公司 A kind of method and apparatus by face information input database

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476126A (en) * 2020-03-27 2020-07-31 海信集团有限公司 Indoor positioning method and system and intelligent equipment
CN111476126B (en) * 2020-03-27 2024-02-23 海信集团有限公司 Indoor positioning method, system and intelligent device
CN111816174A (en) * 2020-06-24 2020-10-23 北京小米松果电子有限公司 Speech recognition method, device and computer readable storage medium
CN111836062A (en) * 2020-06-30 2020-10-27 北京小米松果电子有限公司 Video playing method and device and computer readable storage medium
CN111787609A (en) * 2020-07-09 2020-10-16 北京中超伟业信息安全技术股份有限公司 Personnel positioning system and method based on human body voiceprint characteristics and microphone base station
CN111895991A (en) * 2020-08-03 2020-11-06 杭州十域科技有限公司 Indoor positioning navigation method combined with voice recognition
CN114281182A (en) * 2020-09-17 2022-04-05 华为技术有限公司 Man-machine interaction method, device and system
CN112581941A (en) * 2020-11-17 2021-03-30 北京百度网讯科技有限公司 Audio recognition method and device, electronic equipment and storage medium
CN112584225A (en) * 2020-12-03 2021-03-30 维沃移动通信有限公司 Video recording processing method, video playing control method and electronic equipment
US11830154B2 (en) 2020-12-25 2023-11-28 Beijing Boe Optoelectronics Technology Co., Ltd. AR-based information displaying method and device, AR apparatus, electronic device and medium
CN112738499A (en) * 2020-12-25 2021-04-30 京东方科技集团股份有限公司 Information display method and device based on AR, AR equipment, electronic equipment and medium
WO2022142610A1 (en) * 2020-12-28 2022-07-07 深圳壹账通智能科技有限公司 Speech recording method and apparatus, computer device, and readable storage medium
CN113406567A (en) * 2021-06-25 2021-09-17 安徽淘云科技股份有限公司 Sound source positioning method, device, equipment and storage medium
CN113406567B (en) * 2021-06-25 2024-05-14 安徽淘云科技股份有限公司 Sound source positioning method, device, equipment and storage medium
CN115240698A (en) * 2021-06-30 2022-10-25 达闼机器人股份有限公司 Model training method, voice detection positioning method, electronic device and storage medium
CN113593572A (en) * 2021-08-03 2021-11-02 深圳地平线机器人科技有限公司 Method and apparatus for performing sound zone localization in spatial region, device and medium
CN113576527A (en) * 2021-08-27 2021-11-02 复旦大学 Method for judging ultrasonic input by using voice control
CN113611308A (en) * 2021-09-08 2021-11-05 杭州海康威视数字技术股份有限公司 Voice recognition method, device, system, server and storage medium
CN113611308B (en) * 2021-09-08 2024-05-07 杭州海康威视数字技术股份有限公司 Voice recognition method, device, system, server and storage medium
CN115101078A (en) * 2022-06-23 2022-09-23 浙江吉利控股集团有限公司 Voiceprint capturing and displaying system, vehicle with voiceprint capturing and displaying system, control method and storage medium
CN114863943B (en) * 2022-07-04 2022-11-04 杭州兆华电子股份有限公司 Self-adaptive positioning method and device for environmental noise source based on beam forming
CN114863943A (en) * 2022-07-04 2022-08-05 杭州兆华电子股份有限公司 Self-adaptive positioning method and device for environmental noise source based on beam forming
CN116384879B (en) * 2023-04-07 2023-11-21 豪越科技有限公司 Intelligent management system for rapid warehouse-in and warehouse-out of fire-fighting equipment
CN116384879A (en) * 2023-04-07 2023-07-04 豪越科技有限公司 Intelligent management system for rapid warehouse-in and warehouse-out of fire-fighting equipment
CN116299179B (en) * 2023-05-22 2023-09-12 北京边锋信息技术有限公司 Sound source positioning method, sound source positioning device and readable storage medium
CN116299179A (en) * 2023-05-22 2023-06-23 北京边锋信息技术有限公司 Sound source positioning method, sound source positioning device and readable storage medium

Also Published As

Publication number Publication date
CN110767226B (en) 2022-08-16

Similar Documents

Publication Publication Date Title
CN110767226B (en) Sound source positioning method and device with high accuracy, voice recognition method and system, storage equipment and terminal
US11601775B2 (en) Method for generating a customized/personalized head related transfer function
EP2847763B1 (en) Audio user interaction recognition and context refinement
Aarabi et al. Robust sound localization using multi-source audiovisual information fusion
CN110875060A (en) Voice signal processing method, device, system, equipment and storage medium
CN102447697B (en) Method and system of semi-private communication in open environments
CN106603878A (en) Voice positioning method, device and system
CN110111808B (en) Audio signal processing method and related product
CN111429939B (en) Sound signal separation method of double sound sources and pickup
JP6467736B2 (en) Sound source position estimating apparatus, sound source position estimating method, and sound source position estimating program
JP5123595B2 (en) Near-field sound source separation program, computer-readable recording medium recording this program, and near-field sound source separation method
CN113099031B (en) Sound recording method and related equipment
CN112363112A (en) Sound source positioning method and device based on linear microphone array
CN110443371A (en) A kind of artificial intelligence device and method
Hao et al. Spectral flux-based convolutional neural network architecture for speech source localization and its real-time implementation
CN111048067A (en) Microphone response method and device
CN112423191A (en) Video call device and audio gain method
Cho et al. Sound source localization for robot auditory systems
CN113744731B (en) Multi-modal voice recognition method, system and computer readable storage medium
CN108038291B (en) Personalized head-related transfer function generation system and method based on human body parameter adaptation algorithm
CN112328676A (en) Method for estimating personalized head-related transfer function and related equipment
CN118541734A (en) Mapping of environmental audio responses on mixed reality devices
CN111492668B (en) Method and system for locating the origin of an audio signal within a defined space
CN116266874A (en) Method and communication system for cooperatively playing audio in video playing
Lee et al. DNN-Based Feature Enhancement Using Joint Training Framework for Robust Multichannel Speech Recognition.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant