CN110646763A - Sound source positioning method and device based on semantics and storage medium - Google Patents

Sound source positioning method and device based on semantics and storage medium Download PDF

Info

Publication number
CN110646763A
CN110646763A CN201910957856.0A CN201910957856A CN110646763A CN 110646763 A CN110646763 A CN 110646763A CN 201910957856 A CN201910957856 A CN 201910957856A CN 110646763 A CN110646763 A CN 110646763A
Authority
CN
China
Prior art keywords
target
results
semantic
audio signal
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910957856.0A
Other languages
Chinese (zh)
Inventor
刘立杰
雷欣
李志飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chumen Wenwen Information Technology Co Ltd
Original Assignee
Chumen Wenwen Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chumen Wenwen Information Technology Co Ltd filed Critical Chumen Wenwen Information Technology Co Ltd
Priority to CN201910957856.0A priority Critical patent/CN110646763A/en
Publication of CN110646763A publication Critical patent/CN110646763A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/24Position of single direction-finder fixed by determining direction of a plurality of spaced sources of known location
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Remote Sensing (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Artificial Intelligence (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Obtaining Desirable Characteristics In Audible-Bandwidth Transducers (AREA)

Abstract

The invention discloses a semantic-based sound source positioning method, a semantic-based sound source positioning device and a semantic-based sound source positioning storage medium. The sound source positioning method based on the semantics comprises the following steps: firstly, enhancing audio signals in N directions by using a beam forming technology to respectively obtain corresponding N enhanced audio results, wherein the value of N is a positive integer greater than 1; next, comparing the N enhanced audio results with the target semantics respectively to obtain corresponding N matching degree values; then, selecting the audio signal with the highest matching degree value from the N matching degree values as a target audio signal; then, the direction in which the target audio signal is located is determined as the localization direction. It can be seen that, in the embodiment of the invention, on the basis of direction of arrival estimation (DOA), the beam forming technology is utilized to obtain the directional enhanced audio signal, and the consideration of semantic correlation is added, so that the target sound source can be positioned from a plurality of sound sources with similar audio frequency characteristics, the noise influence is removed, and the anti-interference capability is greatly improved.

Description

Sound source positioning method and device based on semantics and storage medium
Technical Field
The present invention relates to the technical field of Artificial Intelligence (AI), and in particular, to a semantic-based sound source localization method, apparatus, and computer storage medium.
Background
Currently, sound source localization methods generally utilize the following three techniques: controllable beamforming techniques based on maximum output power, high resolution spectral estimation techniques, and localization techniques based on time difference of arrival.
The Direction of Arrival (DOA) estimation based on the localization technology of the sound Arrival time difference is widely applied, and the method mainly uses the time difference of sound waves arriving at each microphone to perform solution to obtain the Direction of a sound source. The method can be well applied to the following scenes: 1) the information source is a far-field narrow-channel signal; 2) the number of the information sources is less than the number of the array elements; 3) the channel noises are additive noises, independent of each other and independent of the signal. Therefore, the technology is mainly applied to radar passive positioning, sonar array direction finding, electronic or communication interference reconnaissance and mobile communication neighborhood. In recent years, with the development and application of smart voice systems, DOA is also applied to the task of acquiring the voice sound source localization through a microphone array.
However, the present inventors found that when DOA technology is applied to sound source localization in an intelligent speech system, the following problems exist: 1) when a plurality of sound sources with similar audio frequency characteristics exist in a sound source collecting area, a target sound source cannot be accurately identified for positioning; 2) when a plurality of non-voice noises exist in the sound source collecting area in the direction of a non-target sound source, and the volume is large, the target sound source is difficult to accurately position. The above problem is particularly prominent in public environment places where the sound source collection area is relatively noisy.
Disclosure of Invention
In order to solve the above problems, embodiments of the present invention creatively provide a method and an apparatus for positioning a sound source based on semantics, and a computer storage medium.
According to a first aspect of the embodiments of the present invention, there is provided a semantic-based sound source localization method, including: enhancing the audio signals in the N directions by using a beam forming technology to respectively obtain corresponding N enhanced audio results, wherein the value of N is a positive integer greater than 1; respectively comparing the N enhanced audio results with the target semantics to obtain corresponding N matching degree values; selecting the audio signal with the highest matching degree value from the N matching degree values as a target audio signal; the direction in which the target audio signal is located is determined as the localization direction.
According to an embodiment of the present invention, the value of N is 6 or more.
According to an embodiment of the present invention, the N directions include: n directions are divided at equal intervals of 360 degrees on the omnibearing plane.
According to an embodiment of the present invention, the enhancement of audio signals in N directions by using a beamforming technique includes: and enhancing the audio signals in the N directions by a multi-channel speech enhancement algorithm by utilizing a microphone array beam forming technology.
According to an embodiment of the present invention, comparing the N enhanced audio results with the target semantics to obtain N corresponding matching degree values, includes: and comparing the N enhanced audio results with the target Keyword respectively by using a Keyword detection (KWS) technology to obtain corresponding N confident values.
According to an embodiment of the present invention, comparing the N enhanced audio results with the target semantics to obtain N corresponding matching degree values, includes: and respectively matching the N enhanced audio results with the target text by utilizing a voice recognition technology to obtain corresponding N matching results.
According to an embodiment of the present invention, comparing the N enhanced audio results with the target semantics to obtain N corresponding matching degree values, includes: and respectively comparing the N enhanced audio results with the target semantics by utilizing a voice recognition and natural language understanding technology to obtain corresponding N semantic similarities.
According to an embodiment of the present invention, comparing N enhanced audio results with a target semantic respectively by using a speech recognition and natural language understanding technology to obtain corresponding N semantic similarities, including: and respectively comparing the N enhanced audio results with the target semantics through a neural network model by utilizing a voice recognition and natural language understanding technology to obtain corresponding N semantic similarities.
According to a second aspect of the embodiments of the present invention, there is also provided a semantic-based sound source localization apparatus, including: the audio signal enhancement module is used for enhancing the audio signals in the N directions by utilizing a beam forming technology to respectively obtain corresponding N enhanced audio results, wherein the value of N is a positive integer greater than 1; the semantic comparison module is used for comparing the N enhanced audio results with the target semantic respectively to obtain corresponding N matching degree values; the target audio signal selection module is used for selecting the audio signal with the highest matching degree value from the N matching degree values as a target audio signal; and the positioning direction determining module is used for determining the direction of the target audio signal as the positioning direction.
According to an embodiment of the present invention, the audio signal enhancement module is specifically configured to enhance the audio signals in the N directions by a multi-channel speech enhancement algorithm using a microphone array beamforming technology.
According to an embodiment of the present invention, the semantic comparison module is specifically configured to compare the N enhanced audio results with the target keyword using a keyword detection technique to obtain N corresponding confidence values.
According to an embodiment of the present invention, the semantic comparison module is specifically configured to match the N enhanced audio results with the target text, respectively, by using a speech recognition technology, so as to obtain N corresponding matching results.
According to an embodiment of the present invention, the semantic comparison module is specifically configured to compare the N enhanced audio results with the target semantic to obtain N corresponding semantic similarities, respectively, by using a speech recognition and natural language understanding technique.
According to an embodiment of the present invention, the semantic comparison module is specifically configured to compare the N enhanced audio results with the target semantic by using a speech recognition and natural language understanding technology and using a neural network model, respectively, to obtain corresponding N semantic similarities.
According to a third aspect of embodiments of the present invention, there is provided a computer storage medium comprising a set of computer executable instructions for performing any of the above language generation methods when the instructions are executed.
Firstly, enhancing audio signals in N directions by utilizing a beam forming technology to respectively obtain corresponding N enhanced audio results, wherein the value of N is a positive integer greater than 1; next, comparing the N enhanced audio results with the target semantics respectively to obtain corresponding N matching degree values; then, selecting the audio signal with the highest matching degree value from the N matching degree values as a target audio signal; then, the direction in which the target audio signal is located is determined as the localization direction. It can be seen that, in the embodiments of the present invention, on the basis of DOA, a beam forming technique is used to obtain a directional enhanced audio signal, and consideration of semantic relevance is added. Therefore, after the audio signals are filtered, denoised and enhanced, the voice recognition is utilized to add semantic analysis, the target sound source can be positioned from a plurality of sound sources with similar audio features, the noise influence is removed, the anti-interference capability is greatly improved, and the method is particularly suitable for sound source positioning scenes with high semantic correlation.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
FIG. 1 is a flow chart illustrating an implementation of a semantic-based sound source localization method according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating a component structure of a semantic-based sound source localization apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
Fig. 1 shows a schematic flow chart of implementing the semantic-based sound source localization method according to the embodiment of the present invention. Referring to fig. 1, a semantic-based sound source localization method according to an embodiment of the present invention includes: operation 110, enhancing the audio signals in the N directions by using a beamforming technique to obtain N corresponding enhanced audio results, respectively, where a value of N is a positive integer greater than 1; operation 120, comparing the N enhanced audio results with the target semantics respectively to obtain N corresponding matching degree values; operation 130, selecting an audio signal with the highest matching degree value from the N matching degree values as a target audio signal; in operation 140, a direction in which the target audio signal is located is determined as a localization direction.
In operation 110, the beamforming technique mainly refers to a spatial filtering technique, and may obtain an audio signal of each channel in N directions at a same time by using a Field Programmable Gate Array (FPGA) to sample simultaneously through N Array elements at a certain time. The array elements here generally refer to the array elements of a microphone array, and the number of array elements, i.e. the number of microphones, may vary from two to thousands. Due to cost limitation, the number of array elements of a consumer microphone array does not exceed 12, so that the most common type in the market is a 6-microphone array. The enhancement of the audio signal comprises the steps of amplifying the audio signal by an amplifier, and then carrying out processes such as denoising and dereverberation, so that the audio is clearer and the subsequent speech recognition is easier to carry out.
In operation 120, the process of comparing the N enhanced audio results with the target semantics respectively, including the speech recognition and semantic analysis, may be implemented by a speech recognition module and a natural language understanding module in the intelligent speech dialog system.
If the direction of a certain enhanced audio result has no sound or the sound is particularly noisy, the voice recognition in the operation is recognized as semanteme-free, and the subsequent processing is not influenced.
The target semantics here are generally predefined target semantics according to an application scenario, that is, semantics that are closest to or related to the content of the desired interlocutor inquiry or response. In the intelligent voice conversation system, the target semantics can also be the target semantics given by the context prediction result of the natural language understanding module of the intelligent voice conversation system according to the current round of conversation and the previous rounds of conversation.
The numerical value of the degree of matching here is usually a real number of 0 or more and 1 or less, and closer to 1 indicates higher degree of matching, and closer to 0 indicates lower degree of matching.
It is easy to see that the semantic comparison of this operation is an important decision basis for selecting a target sound source, so the embodiment of the present invention is also particularly suitable for application scenarios with strong semantic relevance, such as an intelligent voice conversation system, a conversation monitoring system, and the like.
In operation 130, it can be seen that the audio signal with the semantic closest to the target semantic is selected as the target audio signal, so that other audio signals that are not related to the target semantic can be filtered out, even though their audio features are similar to the target audio features. Therefore, the problem that when a plurality of sound sources with similar audio frequency characteristics exist in a sound source collecting area, a target sound source cannot be accurately identified for positioning is solved; and when the sound source collecting area has a plurality of non-voice noises in the direction of the non-target sound source, the volume is large, and the target sound source is difficult to accurately position.
In operation 140, the direction in which the target audio signal is located is the return result of the embodiment of the present invention. The specific process of determining the location direction may be to calculate the TDOA of the target audio signal and substitute the TDOA into a direction angle formula, so as to obtain the sound source incident direction angle of the target audio signal. This orientation angle is referred to herein as the orientation direction. In some application scenarios, once the direction of the target audio signal is determined, the speech signal may be continuously received from the direction of the target audio signal based on the DOA. In other application scenarios, the speaker or the sound generating device may also be determined according to the positioning direction in which the target audio signal is located, and then subsequent tasks for the speaker or the sound generating device may be performed.
According to an embodiment of the present invention, the value of N is 6 or more. The larger the value of N is, the more sound sources can be processed, and the higher the positioning accuracy and precision are. However, the value of N cannot be too large, and the too large value of N increases the computational complexity, consumes more resources and costs more time, and reduces the positioning efficiency. On the other hand, N here also depends on the support of the hardware configuration, and the larger the value, the higher the hardware cost, and the suggested value is between 6 and 12. Certainly, with the continuous development of science and technology and the continuous enhancement of computing power, the N value capable of being effectively processed is also larger and larger.
According to an embodiment of the present invention, the N directions include: n directions are divided at equal intervals of 360 degrees on the omnibearing plane. For example, the target direction of beamforming is set to every 30 ° degrees on the plane, where N is 12.
According to an embodiment of the present invention, the enhancement of audio signals in N directions by using a beamforming technique includes: and enhancing the audio signals in the N directions by a multi-channel speech enhancement algorithm by utilizing a microphone array beam forming technology.
In a commonly used intelligent voice dialog system, such as an intelligent sound box, audio signals in N directions can be collected by a built-in microphone array, and the audio signals are enhanced by a multi-channel voice enhancement algorithm. The multi-channel voice enhancement algorithm considers the position information of the sound source, can realize spatial filtering and has better suppression effect on noise with directivity. The microphone array beamforming technology herein may select any suitable beamforming algorithm according to the application environment. For example, in a more stable noise interference environment, a fixed beamforming algorithm may be used. While in an easily variable noise interference environment, an adaptive beamforming algorithm may be used.
Microphone arrays are generally linear, circular and spherical, and circular and spherical microphone arrays are more effective in the embodiments of the present invention.
According to an embodiment of the present invention, comparing the N enhanced audio results with the target semantics to obtain N corresponding matching degree values, includes: and comparing the N enhanced audio results with the target keyword respectively by utilizing a keyword detection technology to obtain corresponding N confident values. The N enhanced audio results mainly refer to the denoised user voice, and the keyword detection technology is mainly used for detecting whether the user voice contains a certain target keyword. The target keywords are often related to a specific task, and the target sound source required to be collected for executing the specific task can be quickly located by detecting the target keywords. Keyword detection using keyword detection techniques is typically trained using a large amount of speech data to obtain a prediction model, and the confidence value is a prediction value obtained by inputting the speech of the user into the keyword detection prediction model. It is believed that the value is a real number between 0 and 1, with closer to 1 indicating a greater likelihood of including the target keyword.
This embodiment is suitable for the scene in which the target sound source can be located by keyword recognition. For example, in an intelligent voice conversation system, the direction of a target sound source starting a conversation can be identified by waking up a target keyword such as the text "hello, question", and then the sound source in the direction can be continuously tracked to perform subsequent conversations.
According to an embodiment of the present invention, comparing the N enhanced audio results with the target semantics to obtain N corresponding matching degree values, includes: and respectively matching the N enhanced audio results with the target text by utilizing a voice recognition technology to obtain corresponding N matching results. In this embodiment, the N enhanced audio results are typically converted to N pieces of text using speech recognition techniques, and then the target text is found in each piece of text. The target text may be a word, a sentence, or even an article fragment. The matching here usually refers to strict text matching, that is, a uniform text needs to be found to be matched, that is, a 1 is found in matching, and a 0 is found in non-matching. The target text is pre-specified according to specific application scenarios and purposes.
This embodiment is more suitable for scenes where it is desired to locate a target sound source that exactly matches the target text. For example, in a certain knowledge competition, by designating preset standard answers as target texts, target sound sources having answers identical to the standard answers are quickly located, and competitors who can score are identified.
According to an embodiment of the present invention, comparing the N enhanced audio results with the target semantics to obtain N corresponding matching degree values, includes: and respectively comparing the N enhanced audio results with the target semantics by utilizing a voice recognition and natural language understanding technology to obtain corresponding N semantic similarities. Here, natural language understanding and semantic analysis are also added on the basis of performing speech recognition on the N enhanced audio results. Therefore, even if the user semantics use different words or sentences, as long as the basic semantics to be expressed are matched with the target semantics, the prediction result of the natural language understanding module can be accurately positioned. Natural language understanding herein, including context understanding, may more accurately capture the user's intent.
The embodiment is suitable for scenes with high semantic relevance and positioning the target sound source through semantics. For example, in an intelligent voice dialogue system, a desired question or answer is generated as a target semantic through natural language understanding, and a speaker sound source direction can be positioned according to the target semantic.
According to an embodiment of the present invention, comparing N enhanced audio results with a target semantic respectively by using a speech recognition and natural language understanding technology to obtain corresponding N semantic similarities, including: and respectively comparing the N enhanced audio results with the target semantics through a neural network model by utilizing a voice recognition and natural language understanding technology to obtain corresponding N semantic similarities. The neural network model herein may be any suitable model, such as a convolutional neural network model or the like.
Further, based on the above-described semantic-based sound source localization method, an embodiment of the present invention further provides a semantic-based sound source localization apparatus. As shown in fig. 2, the apparatus 20 includes: the audio signal enhancement module 201 is configured to enhance audio signals in N directions by using a beamforming technology, and obtain N corresponding enhanced audio results, respectively, where a value of N is a positive integer greater than 1; a semantic comparison module 202, configured to compare the N enhanced audio results with the target semantic, respectively, to obtain N corresponding matching degree values; a target audio signal selecting module 203, configured to select an audio signal with the highest matching degree value from the N matching degree values as a target audio signal; and a positioning direction determining module 204, configured to determine a direction in which the target audio signal is located as a positioning direction.
According to an embodiment of the present invention, the audio signal enhancement module 201 is specifically configured to enhance the audio signals in N directions by a multi-channel speech enhancement algorithm using a microphone array beamforming technology.
According to an embodiment of the present invention, the semantic comparison module 202 is specifically configured to compare the N enhanced audio results with the target keyword by using a keyword detection technique to obtain N corresponding confidence values.
According to an embodiment of the present invention, the semantic comparison module 202 is specifically configured to match the N enhanced audio results with the target text respectively by using a speech recognition technology, so as to obtain N corresponding matching results.
According to an embodiment of the present invention, the semantic comparison module 202 is specifically configured to compare the N enhanced audio results with the target semantic to obtain N corresponding semantic similarities, respectively, by using a speech recognition and natural language understanding technology.
According to an embodiment of the present invention, the semantic comparison module 202 is specifically configured to compare the N enhanced audio results with the target semantic by using a speech recognition and natural language understanding technology and through a neural network model, respectively, to obtain corresponding N semantic similarities.
Also, based on the semantic-based sound source localization method as described above, an embodiment of the present invention also provides a computer storage medium storing a program that, when executed by a processor, causes the processor to perform at least the following operation steps: operation 110, enhancing the audio signals in the N directions by using a beamforming technique to obtain N corresponding enhanced audio results, respectively, where a value of N is a positive integer greater than 1; operation 120, comparing the N enhanced audio results with the target semantics respectively to obtain N corresponding matching degree values; operation 130, selecting an audio signal with the highest matching degree value from the N matching degree values as a target audio signal; in operation 140, a direction in which the target audio signal is located is determined as a localization direction.
Here, it should be noted that: the above description of the semantic-based sound source localization apparatus embodiment and the above description of the computer storage medium embodiment are similar to the foregoing description of the method embodiment shown in fig. 1, and have similar beneficial effects to the foregoing method embodiment shown in fig. 1, and therefore are not repeated. For the technical details of the semantic-based sound source localization apparatus of the present invention and the above description of the computer storage medium, which are not disclosed yet, please refer to the description of the method embodiment shown in fig. 1 of the present invention for understanding, and therefore, for brevity, will not be described again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of a unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another device, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage medium, a Read Only Memory (ROM), a magnetic disk, and an optical disk.
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage medium, a ROM, a magnetic disk, an optical disk, or the like, which can store the program code.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for semantic-based sound source localization, the method comprising:
enhancing the audio signals in the N directions by using a beam forming technology to respectively obtain corresponding N enhanced audio results, wherein the value of N is a positive integer greater than 1;
respectively comparing the N enhanced audio results with target semantics to obtain corresponding N matching degree values;
selecting the audio signal with the highest matching degree value from the N matching degree values as a target audio signal;
and determining the direction of the target audio signal as a positioning direction.
2. The method of claim 1, wherein the value of N is greater than or equal to 6.
3. The method of claim 1, wherein the N directions comprise: n directions are divided at equal intervals of 360 degrees on the omnibearing plane.
4. The method of claim 1, wherein the enhancing the audio signals in the N directions by using the beamforming technique comprises:
and enhancing the audio signals in the N directions by a multi-channel speech enhancement algorithm by utilizing a microphone array beam forming technology.
5. The method of claim 1, wherein the comparing the N enhanced audio results with the target semantics to obtain N corresponding matching degree values comprises:
and comparing the N enhanced audio results with the target keywords respectively by utilizing a keyword detection technology to obtain corresponding N confident values.
6. The method of claim 1, wherein the comparing the N enhanced audio results with the target semantics to obtain N corresponding matching degree values comprises:
and matching the N enhanced audio results with the target text respectively by utilizing a voice recognition technology to obtain corresponding N matching results.
7. The method of claim 1, wherein the comparing the N enhanced audio results with the target semantics to obtain N corresponding matching degree values comprises:
and respectively comparing the N enhanced audio results with the target semantics by utilizing a voice recognition and natural language understanding technology to obtain corresponding N semantic similarities.
8. The method of claim 7, wherein comparing the N enhanced audio results with the target semantics using speech recognition and natural language understanding techniques to obtain corresponding N semantic similarities comprises:
and respectively comparing the N enhanced audio results with the target semantics through a neural network model by utilizing a voice recognition and natural language understanding technology to obtain corresponding N semantic similarities.
9. A semantic-based sound source localization apparatus, the apparatus comprising:
the audio signal enhancement module is used for enhancing the audio signals in the N directions by utilizing a beam forming technology to respectively obtain corresponding N enhanced audio results, wherein the value of N is a positive integer greater than 1;
the semantic comparison module is used for comparing the N enhanced audio results with target semantics respectively to obtain corresponding N matching degree values;
the target audio signal selection module is used for selecting the audio signal with the highest matching degree value from the N matching degree values as a target audio signal;
and the positioning direction determining module is used for determining the direction of the target audio signal as the positioning direction.
10. A computer storage medium comprising a set of computer executable instructions for performing the method of any one of claims 1 to 8 when executed.
CN201910957856.0A 2019-10-10 2019-10-10 Sound source positioning method and device based on semantics and storage medium Pending CN110646763A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910957856.0A CN110646763A (en) 2019-10-10 2019-10-10 Sound source positioning method and device based on semantics and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910957856.0A CN110646763A (en) 2019-10-10 2019-10-10 Sound source positioning method and device based on semantics and storage medium

Publications (1)

Publication Number Publication Date
CN110646763A true CN110646763A (en) 2020-01-03

Family

ID=69012509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910957856.0A Pending CN110646763A (en) 2019-10-10 2019-10-10 Sound source positioning method and device based on semantics and storage medium

Country Status (1)

Country Link
CN (1) CN110646763A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113053368A (en) * 2021-03-09 2021-06-29 锐迪科微电子(上海)有限公司 Speech enhancement method, electronic device, and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108287824A (en) * 2018-03-07 2018-07-17 北京云知声信息技术有限公司 Semantic similarity calculation method and device
CN108628830A (en) * 2018-04-24 2018-10-09 北京京东尚科信息技术有限公司 A kind of method and apparatus of semantics recognition
CN108986838A (en) * 2018-09-18 2018-12-11 东北大学 A kind of adaptive voice separation method based on auditory localization
CN109599124A (en) * 2018-11-23 2019-04-09 腾讯科技(深圳)有限公司 A kind of audio data processing method, device and storage medium
CN109785838A (en) * 2019-01-28 2019-05-21 百度在线网络技术(北京)有限公司 Audio recognition method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108287824A (en) * 2018-03-07 2018-07-17 北京云知声信息技术有限公司 Semantic similarity calculation method and device
CN108628830A (en) * 2018-04-24 2018-10-09 北京京东尚科信息技术有限公司 A kind of method and apparatus of semantics recognition
CN108986838A (en) * 2018-09-18 2018-12-11 东北大学 A kind of adaptive voice separation method based on auditory localization
CN109599124A (en) * 2018-11-23 2019-04-09 腾讯科技(深圳)有限公司 A kind of audio data processing method, device and storage medium
CN109785838A (en) * 2019-01-28 2019-05-21 百度在线网络技术(北京)有限公司 Audio recognition method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113053368A (en) * 2021-03-09 2021-06-29 锐迪科微电子(上海)有限公司 Speech enhancement method, electronic device, and storage medium

Similar Documents

Publication Publication Date Title
WO2020103703A1 (en) Audio data processing method and apparatus, device and storage medium
Zhang et al. Deep learning for environmentally robust speech recognition: An overview of recent developments
CN109272989B (en) Voice wake-up method, apparatus and computer readable storage medium
CN110491403B (en) Audio signal processing method, device, medium and audio interaction equipment
US20200279552A1 (en) Pre-wakeword speech processing
Qian et al. Past review, current progress, and challenges ahead on the cocktail party problem
CN109712611B (en) Joint model training method and system
CN110556103B (en) Audio signal processing method, device, system, equipment and storage medium
US11158333B2 (en) Multi-stream target-speech detection and channel fusion
US9972339B1 (en) Neural network based beam selection
CN110610718B (en) Method and device for extracting expected sound source voice signal
US11495215B1 (en) Deep multi-channel acoustic modeling using frequency aligned network
Martinez et al. DNN-based performance measures for predicting error rates in automatic speech recognition and optimizing hearing aid parameters
Pujol et al. BeamLearning: An end-to-end deep learning approach for the angular localization of sound sources using raw multichannel acoustic pressure data
Kim et al. Acoustic Event Detection in Multichannel Audio Using Gated Recurrent Neural Networks with High‐Resolution Spectral Features
CN114120984A (en) Voice interaction method, electronic device and storage medium
CN113870893A (en) Multi-channel double-speaker separation method and system
CN110646763A (en) Sound source positioning method and device based on semantics and storage medium
CN113849793A (en) Role separation method, recording method of conference summary, role display method, device, electronic equipment and computer storage medium
CN110992977A (en) Method and device for extracting target sound source
CN113223552B (en) Speech enhancement method, device, apparatus, storage medium, and program
Firoozabadi et al. Combination of nested microphone array and subband processing for multiple simultaneous speaker localization
Jia et al. Two-dimensional detection based LRSS point recognition for multi-source DOA estimation
Ghiurcau et al. About classifying sounds in protected environments
Nguyen et al. Sound detection and localization in windy conditions for intelligent outdoor security cameras

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200103