CN112509584B - Sound source position determining method and device and electronic equipment - Google Patents

Sound source position determining method and device and electronic equipment Download PDF

Info

Publication number
CN112509584B
CN112509584B CN202011405877.0A CN202011405877A CN112509584B CN 112509584 B CN112509584 B CN 112509584B CN 202011405877 A CN202011405877 A CN 202011405877A CN 112509584 B CN112509584 B CN 112509584B
Authority
CN
China
Prior art keywords
sound
preset
signals
mixed
source position
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011405877.0A
Other languages
Chinese (zh)
Other versions
CN112509584A (en
Inventor
陈孝良
冯大航
常乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202011405877.0A priority Critical patent/CN112509584B/en
Publication of CN112509584A publication Critical patent/CN112509584A/en
Application granted granted Critical
Publication of CN112509584B publication Critical patent/CN112509584B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the disclosure discloses a sound source position determining method, a sound source position determining device, electronic equipment and a computer readable storage medium. The sound source position determining method comprises the following steps: acquiring a plurality of mixed sound signals acquired by a plurality of microphones, wherein each microphone corresponds to one mixed sound signal; performing sound separation on the mixed sound signals to obtain sound signals, wherein each sound signal corresponds to a sound source; performing preset sound detection on the plurality of sound signals; and determining the sound source position of the sound signal according to at least one sound signal of which the preset sound is detected. According to the method, the mixed sound is separated, and the sound source position is determined according to the preset sound detected by the separated sound, so that the technical problem that the sound source position cannot be accurately judged in the prior art is solved.

Description

Sound source position determining method and device and electronic equipment
Technical Field
The present disclosure relates to the field of speech recognition, and in particular, to a method, an apparatus, an electronic device, and a computer-readable storage medium for determining a sound source position.
Background
As a means of man-machine interaction, the acquisition technique of speech recognition is significant in freeing human hands. More and more intelligent devices add a trend of voice wake-up, which becomes a bridge for people to communicate with devices, so a voice wake-up (KWS) technology is becoming more important.
The intelligent interactive market of vehicle-mounted voice is larger and larger, and the intelligent voice control can be realized during driving, such as calling, sending short messages, listening to music, navigating and the like. In vehicle-mounted voice interaction, microphones for collecting sound are required to be distributed in an automobile, and at present, various microphone layouts are available, namely, a mode of picking up voice signals of a machine by only one microphone is available, a mode of picking up voice signals by a microphone array formed by a plurality of microphones is available, and different signal processing modes are adopted for each layout.
Because a lot of noise can be generated in the running process of the automobile, such as automobile cover tire noise, wind noise, in-car air conditioning noise, engine noise and other running environment noise in the running process, the current signal processing mode generally carries out noise suppression on the sound collected by the microphone so as to increase the recognition accuracy. However, in the prior art, only whether the voice includes the wake-up word can be recognized, and the position where the wake-up word is sent out cannot be accurately judged; in addition, because a plurality of people can speak in the vehicle, the detection of the wake-up words is not accurate enough. Therefore, how to obtain the position of the sound emitting the preset sound from the mixed sound signal is a urgent problem to be solved.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, an embodiment of the present disclosure provides a sound source position determining method, including:
Acquiring a plurality of mixed sound signals acquired by a plurality of microphones, wherein each microphone corresponds to one mixed sound signal;
performing sound separation on the mixed sound signals to obtain sound signals, wherein each sound signal corresponds to a sound source;
performing preset sound detection on the plurality of sound signals;
And determining the sound source position of the sound signal according to at least one sound signal of which the preset sound is detected.
Further, the acquiring the plurality of mixed sound signals acquired by the plurality of microphones includes:
acquiring N original sound signals acquired by N microphones, wherein each of the N sound signals is a mixed original sound signal of N sound sources;
And carrying out noise reduction processing on the N original sound signals to obtain M mixed sound signals corresponding to the M microphones, wherein each of the M mixed sound signals is a mixed sound signal of M sound sources, M and N are positive integers which are larger than 1, and M is smaller than or equal to N.
Further, the performing the sound separation on the plurality of mixed voice signals to obtain a plurality of sound signals includes:
Multiplying the mixed sound signals with a preset unmixed matrix to obtain sound signals, wherein each sound signal is the sum of products of the mixed sound signals and unmixed coefficients in the unmixed matrix.
Further, the performing voice separation on the plurality of mixed voice signals to obtain a plurality of voice signals includes:
multiplying the M mixed sound signals with a preset unmixed matrix corresponding to the M mixed sound signals to obtain M sound signals, wherein each of the M sound signals is the sum of products of the M mixed sound signals and unmixed coefficients in the unmixed matrix.
Further, the determining the sound source position of the sound signal according to at least one sound signal of detecting the preset sound includes:
In response to detecting a preset sound in at least one sound signal, calculating an energy value of the sound signal corresponding to the preset sound;
And taking the sound source position corresponding to the sound signal with the energy value higher than the energy threshold value as the sound source position of the sound signal.
Further, the calculating, in response to detecting a preset sound in the plurality of sound signals, an energy value of a sound signal corresponding to the preset sound includes:
Calculating an energy value of each sound signal at each time point in the time domain and storing the energy values in a memory;
Acquiring a start time point and an end time point of the preset sound in response to detecting the preset sound in a plurality of the sound signals;
the energy value between the start time point and the end time point is retrieved from the memory.
Further, the step of using the position of the sound source corresponding to the sound signal with the energy value higher than the energy threshold as the sound source position of the sound signal includes:
Screening sound signals with confidence coefficient of preset sound larger than a preset sound threshold value from the sound signals with energy values higher than the energy threshold value;
and taking the sound source position corresponding to the sound signal with the confidence coefficient of the preset sound larger than the preset sound threshold value as the sound source position of the sound signal.
Further, the unmixed matrix is obtained in advance by calculating the following steps:
playing the test sound signals at each sound source position to obtain a plurality of mixed test sound signals collected by the microphone;
Converting the plurality of mixed test sound signals into a plurality of frequency domain mixed test sound signals through short-time Fourier transform;
acquiring a calculation function of the unmixed matrix;
Adding direction constraint in the calculation function of the unmixed matrix and selecting a priori probability density function to obtain a cost function;
And performing iterative computation according to the plurality of frequency domain mixed test sound signals and preset iteration times to obtain a unmixed matrix with the minimum value of the cost function as the preset unmixed matrix.
Further, the number of microphones is equal to the number of sound sources.
Further, the sound signal is a voice signal, the preset sound is a wake-up word, and the sound source position of the sound signal is a wake-up position.
Further, the method further comprises:
and executing the function instruction related to the sound source position in the sound signal according to the sound source position of the sound signal.
In a second aspect, an embodiment of the present disclosure provides a sound source position determining apparatus including:
the mixed sound signal acquisition module is used for acquiring a plurality of mixed sound signals acquired by a plurality of microphones, wherein each microphone corresponds to one mixed sound signal;
the mixed sound signal separation module is used for carrying out sound separation on the mixed sound signals to obtain sound signals, wherein each sound signal corresponds to one sound source;
the preset sound detection module is used for carrying out preset sound detection on the plurality of sound signals;
And the sound source position determining module is used for determining the sound source position of the sound signal according to at least one sound signal of which the preset sound is detected.
Further, the mixed sound signal acquisition module is further configured to:
acquiring N original sound signals acquired by N microphones, wherein each of the N sound signals is a mixed original sound signal of N sound sources;
And carrying out noise reduction processing on the N original sound signals to obtain M mixed sound signals corresponding to the M microphones, wherein each of the M mixed sound signals is a mixed sound signal of M sound sources, M and N are positive integers which are larger than 1, and M is smaller than or equal to N.
Further, the mixed sound signal separation module is further configured to:
Multiplying the mixed sound signals with a preset unmixed matrix to obtain sound signals, wherein each sound signal is the sum of products of the mixed sound signals and unmixed coefficients in the unmixed matrix.
Further, the mixed sound signal separation module is further configured to:
multiplying the M mixed sound signals with a preset unmixed matrix corresponding to the M mixed sound signals to obtain M sound signals, wherein each of the M sound signals is the sum of products of the M mixed sound signals and unmixed coefficients in the unmixed matrix.
Further, the sound source position determining module is further configured to:
In response to detecting a preset sound in at least one sound signal, calculating an energy value of the sound signal corresponding to the preset sound;
And taking the sound source position corresponding to the sound signal with the energy value higher than the energy threshold value as the sound source position of the sound signal.
Further, the sound source position determining module is further configured to:
Calculating an energy value of each sound signal at each time point in the time domain and storing the energy values in a memory;
Acquiring a start time point and an end time point of the preset sound in response to detecting the preset sound in a plurality of the sound signals;
the energy value between the start time point and the end time point is retrieved from the memory.
Further, the sound source position determining module is further configured to:
Screening sound signals with confidence coefficient of preset sound larger than a preset sound threshold value from the sound signals with energy values higher than the energy threshold value;
and taking the sound source position corresponding to the sound signal with the confidence coefficient of the preset sound larger than the preset sound threshold value as the sound source position of the sound signal.
Further, the unmixed matrix is obtained in advance by calculating the following steps:
playing the test sound signals at each sound source position to obtain a plurality of mixed test sound signals collected by the microphone;
Converting the plurality of mixed test sound signals into a plurality of frequency domain mixed test sound signals through short-time Fourier transform;
acquiring a calculation function of the unmixed matrix;
Adding direction constraint in the calculation function of the unmixed matrix and selecting a priori probability density function to obtain a cost function;
And performing iterative computation according to the plurality of frequency domain mixed test sound signals and preset iteration times to obtain a unmixed matrix with the minimum value of the cost function as the preset unmixed matrix.
Further, the number of microphones is equal to the number of sound sources.
Further, the sound signal is a voice signal, the preset sound is a wake-up word, and the sound source position of the sound signal is a wake-up position.
Further, the sound source position determining apparatus further includes:
and the function execution module is used for executing the function instruction related to the sound source position in the sound signal according to the sound source position of the sound signal.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and
A memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the methods of the first aspect.
In a fourth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions for causing a computer to perform any one of the methods of the first aspect.
The embodiment of the disclosure discloses a sound source position determining method, a sound source position determining device, electronic equipment and a computer readable storage medium. The sound source position determining method comprises the following steps: acquiring a plurality of mixed sound signals acquired by a plurality of microphones, wherein each microphone corresponds to one mixed sound signal; performing sound separation on the mixed sound signals to obtain sound signals, wherein each sound signal corresponds to a sound source; performing preset sound detection on the plurality of sound signals; and determining the sound source position of the sound signal according to at least one sound signal of which the preset sound is detected. According to the method, the mixed sound is separated, and the sound source position is determined according to the preset sound detected by the separated sound, so that the technical problem that the sound source position cannot be accurately judged in the prior art is solved.
The foregoing description is only an overview of the disclosed technology, and may be implemented in accordance with the disclosure of the present disclosure, so that the above-mentioned and other objects, features and advantages of the present disclosure can be more clearly understood, and the following detailed description of the preferred embodiments is given with reference to the accompanying drawings.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.
Fig. 1 is a schematic view of an application scenario in an embodiment of the disclosure;
fig. 2 is a flowchart of a sound source position determining method according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of a specific implementation manner of step S201 of the sound source position determining method according to the embodiment of the present disclosure;
FIG. 4 is a calculation step of a unmixed matrix of a sound source position determination method according to an embodiment of the disclosure;
fig. 5 is a schematic diagram of a specific implementation manner of step S204 of the sound source position determining method according to the embodiment of the present disclosure;
Fig. 6 is a schematic structural diagram of an embodiment of a voice wake-up device according to an embodiment of the disclosure;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
Fig. 1 is a schematic diagram of an application scenario in an embodiment of the disclosure. As shown in fig. 1, the layout of the microphones on the vehicle is shown, in this exemplary layout, 4 microphones are laid out in the vehicle, and are respectively laid out above the main driver's seat, the co-driver's seat, and the rear two seats of the vehicle. The 4 microphones collect sound signals in the vehicle at the same time, when voice input is detected, the voice is sent to the voice recognition device for recognition, when target voice (namely, wake-up words) are recognized, the wake-up position is determined according to the position of a sound source of the recognized target voice, and the function corresponding to the wake-up words is executed, so that the position of the executing function can be determined by combining the position of the wake-up words.
Fig. 2 is a flowchart of an embodiment of a sound source position determining method according to an embodiment of the present disclosure, where the sound source position determining method according to the embodiment may be performed by a sound source position determining device, and the sound source position determining device may be implemented in software, or a combination of software and hardware, and the sound source position determining device may be integrally provided in a device in a sound source position determining system, such as a sound source position determining server or a sound source position determining terminal device. As shown in fig. 2, the method comprises the steps of:
step S201, a plurality of mixed sound signals collected by a plurality of microphones are obtained, wherein each microphone corresponds to one mixed sound signal;
The plurality of microphones is illustratively a plurality of microphones arranged in a specific space, such as a plurality of microphones laid out in a room, in a vehicle. The present disclosure embodiment is described by taking a microphone disposed in a vehicle as an example, but it is understood that the plurality of microphones may be a plurality of microphones disposed in any space, and is not limited herein.
In the embodiment of the present disclosure, the voice signal is used as the sound signal for illustration, but the sound signal in the technical solution in the present disclosure is not limited to the voice signal, and any other sound signal may use the solution in the present disclosure to determine the sound source position, which is not described herein.
In an embodiment of the present disclosure, the sound signal collected by each microphone is a mixed speech signal of speech signals received from a plurality of sound sources. In the present disclosure, the number of microphones is equal to the number of sound sources. The mixed speech signal can be expressed in the time domain by the following equation (1):
Where x j (t) represents the time t, the j-th microphone receives the mixed speech signal; a ji is a weighting coefficient, which is determined by the impulse response function; s i (t) is a sound source signal; m is the number of microphones and sound sources.
As can be seen from the above formula, the mixed speech signal is determined in the time domain by each sound source signal and its weighting coefficients, wherein both the weighting coefficients and the sound source signals are unknown.
Optionally, the step S201 further includes:
Step S301, N original sound signals acquired by N microphones are acquired, wherein each of the N sound signals is a mixed original signal of N sound sources;
Step S302, performing noise reduction processing on the N original sound signals to obtain M mixed sound signals corresponding to the M microphones, where each of the M mixed sound signals is a mixed sound signal of M sound sources, where M and N are positive integers greater than 1, and M is less than or equal to N.
In this embodiment, after the mixed original signals of a plurality of sound sources are acquired, noise reduction processing is required first. In this embodiment including N microphones and N sound sources, after noise reduction processing, M mixed speech signals mixed by signals of the M sound sources can be obtained, which is divided into two cases: if the sound areas are not distinguished, signals of M sound sources are included in the mixed voice signal after the noise reduction processing; if the sound areas are distinguished, the noise of the sound smaller can be filtered out as noise by the noise reduction process, at this time, only the signals of part of the sound sources are reserved, and the number of reserved sound sources is equal to the number of microphones used. For example, in an automobile, 1 microphone is respectively arranged on 4 seats, 4 microphones are provided in total, each microphone receives signals of sound sources of the 4 seats, and under the condition that sound areas are not distinguished, the 4 sound source signals are subjected to noise reduction processing to obtain mixed signals of the 4 sound source signals, and at the moment, the number of the microphones and the number of the sound sources are 2; under the condition of distinguishing the sound areas, the automobile can be divided into a front-row area and a rear-row area, at the moment, signals of rear-row sound sources received by two microphones of the front row are filtered out as noise, only signals of the two sound sources of the front row are reserved, the same is true for the rear row, signals of the two sound sources of the rear row are reserved when the signals of the two sound sources of the front row are filtered out as noise, at the moment, for the two microphones of the front row or the rear row, the collected mixed voice signals of the two microphones of the front row or the rear row only comprise signals of the sound sources of the front row or the signals of the rear row, and at the moment, the numbers of the microphones and the sound sources are 2.
It will be appreciated that the noise reduction process may use a pre-trained noise reduction model. The noise reduction model is a DNN (deep neural network) model trained by actually measuring noise data in a vehicle, and can effectively eliminate vehicle-mounted environment noise such as cover tire noise, wind noise, air conditioner noise, engine noise, small human noise and the like in the vehicle.
Step S202, performing sound separation on the plurality of mixed sound signals to obtain a plurality of sound signals, wherein each sound signal corresponds to one sound source;
in this step, a plurality of voice signals are separated from the plurality of mixed voice signals, the voice signals being voice signals corresponding to a sound source.
Optionally, the step S202 includes:
Multiplying the mixed sound signals with a preset unmixed matrix to obtain sound signals, wherein each sound signal is the sum of products of the mixed sound signals and unmixed coefficients in the unmixed matrix. The above formula (1) shows the mixed speech signal in the time domain, which usually needs to be converted into the frequency domain during the actual processing, assuming that the number of microphones and the number of sound sources are K, time frame Where N is the total number of frames for a segment of speech. Frequency indexThe matrix of the speech signals collected by the microphones can be expressed as formula (2):
xf,n=Afsf,n (2)
Where x f,n is the short-time Fourier transform of x, s f,n is the short-time Fourier transform of s, Is a matrix of acoustic transfer functions at frequency f. Assuming a f is reversible, there is one matrix W f such that:
yf,n=Wfxf,n (3)
Then W f is the unmixed matrix of x f,n and y f,n is the unmixed sound source signal.
Thus, in the case where the downmix matrix W f is calculated in advance, each of the mixed speech signals can be separated out using the downmix matrix. Wherein,The kth corresponding vector in the K unmixed filters is denoted as Whereby a plurality of mixed speech signals x f,n and the unmixed coefficients in the unmixed matrix W f The sum of the products of these is the speech signal y f,n of the separated sound source.
Optionally, the step S202 includes:
multiplying the M mixed sound signals with a preset unmixed matrix corresponding to the M mixed sound signals to obtain M sound signals, wherein each of the M sound signals is the sum of products of the M mixed sound signals and unmixed coefficients in the unmixed matrix.
In this alternative embodiment, the preset unmixed matrices are multiple, and each unmixed matrix corresponds to one sound area, and corresponds to M microphones and M sound sources. For example, the microphones are 4 microphones arranged above 4 seats of the automobile, the sound sources are 4 sound sources on 4 seats, the sound area is divided into 2, the front two microphones are one sound area, and the unmixed matrix w1 is used; the rear two microphones are a sound area, and a unmixed matrix w2 is used. Where w1 and w2 are each a pre-calculated unmixed matrix for a particular microphone and a particular sound source location.
Wherein, optionally, the step of pre-calculating the unmixed matrix is as follows:
Step S401, playing test sound signals at each sound source position to obtain a plurality of mixed test sound signals collected by a microphone;
Step S402, converting the plurality of mixed test sound signals into a plurality of frequency domain mixed test sound signals through short-time Fourier transform;
Step S403, obtaining a calculation function of the unmixed matrix;
step S404, adding a direction constraint in the calculation function of the unmixed matrix and selecting a priori probability density function to obtain a cost function;
step S405, performing iterative computation according to the multiple frequency domain mixed test sound signals and a preset iteration number to obtain a downmix matrix with the minimum value of the cost function as the preset downmix matrix.
In step S401, a test speech signal, such as a clean speech signal, is played at each sound source position, so that a plurality of mixed test speech signals collected by a plurality of microphones, which are time domain signals, are obtained, as shown in the above formula (1); in step S402, the time domain signal obtained in step S401 is converted into a frequency domain through a short-time fourier transform to generate a mixed test speech signal in a plurality of frequency domains, as shown in the above formula (2); in step S403, an estimated value of the unmixed matrix is obtained by an approximation method, where the estimated value is an approximate calculation function of the unmixed matrix, and the calculation function of the unmixed matrix may be obtained according to the maximum posterior probability optimization problem by way of example, as follows:
where Ω is a set of unmixed matrices, The sound source model G (y k,n)=-log p(yk,n) is introduced.
In step S404, after adding a direction constraint to the above formula (4) and selecting the prior probability density function, a cost function of the unmixed matrix is obtained:
wherein the method comprises the steps of D k is the distance between the microphones, c is the propagation velocity of sound in air, θ is the direction of the sound source.Is a preset value, typically 1.
In step S405, according to the plurality of frequency domain mixed test speech signals and the preset iteration number, a predetermined method is used to perform iterative computation to obtain a unmixed matrix with the smallest value of the formula (5) as the preset unmixed matrix. Wherein the predetermined method may typically be a gradient descent method or the like, or in the present disclosure, the following iterative method may be used for iteration to calculate the unmixed matrix:
let the covariance matrix of the weighted microphone signals be:
wherein the method comprises the steps of
Then the first time period of the first time period,
Where e k denotes a canonical unit vector with a k-th bit of 1.
By initialisingCan calculate the value of (2)The iteration is performed all the way through equation (7), where l is the number of iterations.
Finally, through:
Can be calculated out Thus yielding W f.
Step S203, carrying out preset sound detection on the plurality of sound signals;
in the embodiment of the present disclosure, the wake-up word is used as the preset sound for explanation, but the preset sound in the technical scheme in the present disclosure is not limited to the wake-up word, and any other preset sound may use the scheme of the present disclosure to determine the sound source position, which is not described herein.
After the voice signals corresponding to the sound source are separated in step S202, wake-up word detection is performed on each voice signal. The detection can be performed by a pre-trained wake-up word detection model, wherein the wake-up word detection model is obtained by training a large number of target wake-up words; optionally, the output of the wake word detection model includes whether wake words are included; optionally, the wake word detection model may be a two-class model, where the wake word detection model may detect whether a specific wake word is included in the voice, and output a result that the specific wake word is included or not included; or the wake word detection model may be a multi-classification model, which may detect whether one of a plurality of wake words is included in the voice, and output results thereof are that the wake word is included and the type of the wake word is included or that the wake word is not included.
In one embodiment, when the output result of the wake word detection model is that the wake word is included, it also outputs the confidence level of the wake word, that is, the probability that the wake word is included in the voice. When the wake-up word detection model is classified, classification is performed according to the calculated confidence coefficient of the wake-up word, if the model is set, when the confidence coefficient of a certain wake-up word is greater than 70%, the fact that the wake-up word is contained in the voice is recognized is indicated, at the moment, a recognition result is output, the wake-up word is recognized, and a specific value, such as 80%, of the confidence coefficient of the wake-up word is obtained.
Step S204, determining the sound source position of the sound signal according to at least one sound signal of the detected preset sound.
The sound signal and the position of the sound source have a corresponding relationship, and the corresponding relationship can be preset, for example, the first separated voice signal corresponds to the driver's seat, the second separated voice signal corresponds to the co-driver's seat, and the like. Thus, the sound source position of the sound signal can be determined according to the sound signal of the detected preset sound, namely, the position of voice wake-up can be determined according to the detected voice signal.
However, in the case of the car microphone, since the sound signal is relatively complex, a situation may occur in which a preset sound is erroneously recognized. To solve this problem, optionally, the step S204 includes:
step S501, in response to detecting a preset sound in at least one of the sound signals, calculating an energy value of a sound signal corresponding to the preset sound;
In step S502, the sound source position corresponding to the sound signal with the energy value higher than the energy threshold is used as the sound source position of the sound signal.
The speech signal has energy, the louder the speaking sound, the more energy the sound. Therefore, in order to prevent misrecognition, whether a person is speaking or not can be judged by detecting the energy of the voice signal, and the voice signal with low energy is filtered out as the misrecognized voice signal, so that the accuracy of recognition is increased. For example, the energy value can be calculated in the time domain of the signal, e.g., x t is the amplitude of the speech signal at time t, then the energy of the speech signal at that time isAnd when the energy in the time period corresponding to the wake-up word is larger than the preset energy threshold value, determining that the wake-up word is effective, and taking the position of the sound source corresponding to the voice signal as the voice wake-up position.
Optionally, the step S501 further includes:
Calculating an energy value of each sound signal at each time point in the time domain and storing the energy values in a memory;
Acquiring a start time point and an end time point of the preset sound in response to detecting the preset sound in a plurality of the sound signals;
the energy value between the start time point and the end time point is retrieved from the memory.
In this embodiment, the energy value of each speech signal at each point in time in the time domain is calculated in real time and stored in a memory such as a buffer memory. When the wake-up word is detected to be included in the voice signal, the position of the wake-up word in the voice signal, namely a start time point and an end time point, can be obtained at this time, then the energy value corresponding to each time point between the two time points is obtained from the memory according to the start time point and the end time point, and the sum of the energy values is calculated as the energy value of the wake-up word.
In another embodiment, even if the energy value is higher than the preset energy threshold, the detection error of the wake-up word may be caused by the misjudgment of the wake-up word detection, so that the step S502 further includes:
Screening sound signals with confidence coefficient of preset sound larger than a preset sound threshold value from the sound signals with energy values higher than the energy threshold value;
and taking the sound source position corresponding to the sound signal with the confidence coefficient of the preset sound larger than the preset sound threshold value as the sound source position of the sound signal.
For example, if the microphone collects a mixed voice signal of voice signals of 4 sound sources, at this time, a passenger sitting on the left side of the rear row speaks a wake-up word loudly, and because the voice is relatively loud, the wake-up word is detected in the voice signals corresponding to the positions on the right side of the rear row, but the confidence of the detected wake-up word is different, for example, the confidence of the detected wake-up word of the driver is 70%, the confidence of the detected wake-up word of the right side of the rear row is 75%, the confidence of the detected wake-up word of the left side of the rear row is 90%, at this time, a wake-up threshold value of the confidence of the wake-up word, for example, 85%, can be set, and then the position on the left side of the rear row can be filtered to the position on the driver and the right side of the rear row through the threshold value, and finally the position on the left side of the rear row is determined to be the position of voice wake-up.
It is understood that when the wake-up words in the plurality of voice signals satisfy the condition of the energy value or satisfy the condition of the energy value and the confidence coefficient, the plurality of positions may be determined to be the wake-up positions of the voice.
Through the step S204, the emission position of the preset sound can be more accurately determined, so that the function corresponding to the preset sound can be more accurately executed.
Further, the method further comprises: and executing the function instruction related to the sound source position in the sound signal according to the sound source position of the sound signal. If in the voice control scene of the automobile, when the user in the automobile speaks "open the window", the window corresponding to the wake-up position can be opened instead of all windows by determining the wake-up position of wake-up voice; when the user speaks "turn down the air conditioner temperature", it is possible to judge whether the user is located in the front or rear row of the car to perform the zone control of the temperature.
The embodiment of the disclosure discloses a sound source position determining method, which comprises the following steps: acquiring a plurality of mixed sound signals acquired by a plurality of microphones, wherein each microphone corresponds to one mixed sound signal; performing sound separation on the mixed sound signals to obtain sound signals, wherein each sound signal corresponds to a sound source; performing preset sound detection on the plurality of sound signals; and determining the sound source position of the sound signal according to at least one sound signal of which the preset sound is detected. According to the method, the mixed sound is separated, and the sound source position is determined according to the preset sound detected by the separated sound, so that the technical problem that the sound source position cannot be accurately judged in the prior art is solved.
In the foregoing, although the steps in the foregoing method embodiments are described in the foregoing order, it should be clear to those skilled in the art that the steps in the embodiments of the disclosure are not necessarily performed in the foregoing order, but may be performed in reverse order, parallel, cross, etc., and other steps may be further added to those skilled in the art on the basis of the foregoing steps, and these obvious modifications or equivalent manners are also included in the protection scope of the disclosure and are not repeated herein.
Fig. 6 is a schematic structural diagram of an embodiment of a sound source position determining apparatus according to an embodiment of the present disclosure, and as shown in fig. 6, the apparatus 600 includes: a mixed sound signal acquisition module 601, a mixed sound signal separation module 602, a preset sound detection module 603, and a sound source position determination module 604. Wherein,
A mixed sound signal obtaining module 601, configured to obtain a plurality of mixed sound signals collected by a plurality of microphones, where each microphone corresponds to one mixed sound signal;
A mixed sound signal separation module 602, configured to perform sound separation on the plurality of mixed sound signals to obtain a plurality of sound signals, where each sound signal corresponds to one sound source;
a preset sound detection module 603, configured to perform preset sound detection on the plurality of sound signals;
the sound source position determining module 604 is configured to determine a sound source position of the sound signal according to at least one sound signal that detects the preset sound.
Further, the mixed sound signal acquisition module 601 is further configured to:
acquiring N original sound signals acquired by N microphones, wherein each of the N original sound signals is a mixed original signal of N sound sources;
And carrying out noise reduction processing on the N original sound signals to obtain M mixed sound signals corresponding to the M microphones, wherein each of the M mixed sound signals is a mixed sound signal of M sound sources, M and N are positive integers which are larger than 1, and M is smaller than or equal to N.
Further, the mixed sound signal separation module 602 is further configured to:
Multiplying the mixed sound signals with a preset unmixed matrix to obtain sound signals, wherein each sound signal is the sum of products of the mixed sound signals and unmixed coefficients in the unmixed matrix.
Further, the mixed sound signal separation module 602 is further configured to:
multiplying the M mixed sound signals with a preset unmixed matrix corresponding to the M mixed sound signals to obtain M sound signals, wherein each of the M sound signals is the sum of products of the M mixed sound signals and unmixed coefficients in the unmixed matrix.
Further, the sound source position determining module 604 is further configured to:
In response to detecting a preset sound in at least one sound signal, calculating an energy value of the sound signal corresponding to the preset sound;
And taking the sound source position corresponding to the sound signal with the energy value higher than the energy threshold value as the sound source position of the sound signal.
Further, the sound source position determining module 604 is further configured to:
Calculating an energy value of each sound signal at each time point in the time domain and storing the energy values in a memory;
Acquiring a start time point and an end time point of the preset sound in response to detecting the preset sound in a plurality of the sound signals;
the energy value between the start time point and the end time point is retrieved from the memory.
Further, the sound source position determining module 604 is further configured to:
Screening sound signals with confidence coefficient of preset sound larger than a preset sound threshold value from the sound signals with energy values higher than the energy threshold value;
and taking the sound source position corresponding to the sound signal with the confidence coefficient of the preset sound larger than the preset sound threshold value as the sound source position of the sound signal.
Further, the unmixed matrix is obtained in advance by calculating the following steps:
playing the test sound signals at each sound source position to obtain a plurality of mixed test sound signals collected by the microphone;
Converting the plurality of mixed test sound signals into a plurality of frequency domain mixed test sound signals through short-time Fourier transform;
acquiring a calculation function of the unmixed matrix;
Adding direction constraint in the calculation function of the unmixed matrix and selecting a priori probability density function to obtain a cost function;
And performing iterative computation according to the plurality of frequency domain mixed test sound signals and preset iteration times to obtain a unmixed matrix with the minimum value of the cost function as the preset unmixed matrix.
Further, the number of microphones is equal to the number of sound sources.
Further, the sound signal is a voice signal, the preset sound is a wake-up word, and the sound source position of the sound signal is a wake-up position.
Further, the sound source position determining apparatus further includes:
and the function execution module is used for executing the function instruction related to the sound source position in the sound signal according to the sound source position of the sound signal.
The apparatus of fig. 6 may perform the method of the embodiment of fig. 2-5, and reference is made to the relevant description of the embodiment of fig. 2-5 for parts of this embodiment not described in detail. The implementation process and the technical effect of this technical solution are described in the embodiments shown in fig. 2 to 5, and are not described herein.
Referring now to fig. 7, a schematic diagram of an electronic device 700 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 7 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 7, the electronic device 700 may include a processing means (e.g., a central processor, a graphics processor, etc.) 701, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage means 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
In general, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 shows an electronic device 700 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication device 709, or installed from storage 708, or installed from ROM 702. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 701.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: the above sound source position determination method is performed.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Claims (10)

1. A sound source position determining method, characterized by comprising:
Acquiring a plurality of mixed sound signals acquired by a plurality of microphones, wherein each microphone corresponds to one mixed sound signal;
Multiplying the mixed sound signals with a preset unmixed matrix to obtain sound signals, wherein each sound signal is the sum of products of the mixed sound signals and unmixed coefficients in the unmixed matrix, and each sound signal corresponds to one sound source;
performing preset sound detection on the plurality of sound signals;
In response to detecting a preset sound in at least one sound signal, calculating an energy value of the sound signal corresponding to the preset sound;
Screening sound signals with confidence coefficient of preset sound larger than a preset sound threshold value from the sound signals with energy values higher than the energy threshold value, wherein the confidence coefficient of the preset sound refers to the probability that the sound signals comprise the preset sound;
Taking the sound source position corresponding to the sound signal with the confidence coefficient of the preset sound larger than the preset sound threshold value as the sound source position of the sound signal;
Wherein the unmixed matrix is obtained in advance by calculation by:
Playing test sound signals at each sound source position to obtain a plurality of mixed test sound signals collected by a microphone, wherein the mixed test sound signals are time domain signals; converting the plurality of mixed test sound signals into a plurality of frequency domain mixed test sound signals through short-time Fourier transform; acquiring a calculation function of the unmixed matrix; adding direction constraint in the calculation function of the unmixed matrix and selecting a priori probability density function to obtain a cost function; and performing iterative computation according to the plurality of frequency domain mixed test sound signals and preset iteration times to obtain a unmixed matrix with the minimum value of the cost function as the preset unmixed matrix.
2. The sound source position determining method according to claim 1, wherein the acquiring a plurality of mixed sound signals acquired by a plurality of microphones includes:
acquiring N original sound signals acquired by N microphones, wherein each of the N original sound signals is a mixed original signal of N sound sources;
And carrying out noise reduction processing on the N original sound signals to obtain M mixed sound signals corresponding to the M microphones, wherein each of the M mixed sound signals is a mixed sound signal of M sound sources, M and N are positive integers which are larger than l, and M is smaller than or equal to N.
3. The sound source position determining method as claimed in claim 2, wherein multiplying the plurality of mixed sound signals with a pre-set unmixed matrix to obtain a plurality of sound signals, comprises:
multiplying the M mixed sound signals with a preset unmixed matrix corresponding to the M mixed sound signals to obtain M sound signals, wherein each of the M sound signals is the sum of products of the M mixed sound signals and unmixed coefficients in the unmixed matrix.
4. The sound source position determining method as claimed in claim 1, wherein the calculating an energy value of a sound signal corresponding to a preset sound in response to detecting the preset sound in at least one of the sound signals comprises:
Calculating an energy value of each sound signal at each time point in the time domain and storing the energy values in a memory;
Acquiring a start time point and an end time point of the preset sound in response to detecting the preset sound in a plurality of the sound signals;
the energy value between the start time point and the end time point is retrieved from the memory.
5. The sound source position determining method according to claim 1, wherein the number of microphones is equal to the number of sound sources.
6. The sound source position determining method according to any one of claims 1 to 5, wherein the sound signal is a voice signal, the preset sound is a wake-up word, and the sound source position of the sound signal is a wake-up position.
7. The sound source position determining method according to claim 1, characterized in that the method further comprises:
and executing the function instruction related to the sound source position in the sound signal according to the sound source position of the sound signal.
8. A sound source position determining apparatus, characterized by comprising:
the mixed sound signal acquisition module is used for acquiring a plurality of mixed sound signals acquired by a plurality of microphones, wherein each microphone corresponds to one mixed sound signal;
the mixed sound signal separation module is used for multiplying the mixed sound signals with a preset unmixed matrix to obtain a plurality of sound signals, wherein each sound signal is the sum of products of the mixed sound signals and unmixed coefficients in the unmixed matrix, and each sound signal corresponds to one sound source;
the preset sound detection module is used for carrying out preset sound detection on the plurality of sound signals;
The sound source position determining module is used for responding to detection of preset sound in at least one sound signal and calculating an energy value of the sound signal corresponding to the preset sound; screening sound signals with confidence coefficient of preset sound larger than a preset sound threshold value from the sound signals with energy values higher than the energy threshold value, wherein the confidence coefficient of the preset sound refers to the probability that the sound signals comprise the preset sound; taking the sound source position corresponding to the sound signal with the confidence coefficient of the preset sound larger than the preset sound threshold value as the sound source position of the sound signal;
Wherein the unmixed matrix is obtained in advance by calculation by: playing test sound signals at each sound source position to obtain a plurality of mixed test sound signals collected by a microphone, wherein the mixed test sound signals are time domain signals; converting the plurality of mixed test sound signals into a plurality of frequency domain mixed test sound signals through short-time Fourier transform; acquiring a calculation function of the unmixed matrix; adding direction constraint in the calculation function of the unmixed matrix and selecting a priori probability density function to obtain a cost function; and performing iterative computation according to the plurality of frequency domain mixed test sound signals and preset iteration times to obtain a unmixed matrix with the minimum value of the cost function as the preset unmixed matrix.
9. An electronic device, a memory for storing computer readable instructions; and
A processor for executing the computer readable instructions such that the processor when run implements the method according to any of claims 1-7.
10. A non-transitory computer readable storage medium storing computer readable instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1-7.
CN202011405877.0A 2020-12-03 2020-12-03 Sound source position determining method and device and electronic equipment Active CN112509584B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011405877.0A CN112509584B (en) 2020-12-03 2020-12-03 Sound source position determining method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011405877.0A CN112509584B (en) 2020-12-03 2020-12-03 Sound source position determining method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN112509584A CN112509584A (en) 2021-03-16
CN112509584B true CN112509584B (en) 2024-08-20

Family

ID=74969946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011405877.0A Active CN112509584B (en) 2020-12-03 2020-12-03 Sound source position determining method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112509584B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113380267B (en) * 2021-04-30 2024-04-19 深圳地平线机器人科技有限公司 Method and device for positioning voice zone, storage medium and electronic equipment
CN113223548B (en) * 2021-05-07 2022-11-22 北京小米移动软件有限公司 Sound source positioning method and device
CN113362864B (en) * 2021-06-16 2022-08-02 北京字节跳动网络技术有限公司 Audio signal processing method, device, storage medium and electronic equipment
DE102021120246A1 (en) 2021-08-04 2023-02-09 Bayerische Motoren Werke Aktiengesellschaft voice recognition system
CN114290988A (en) * 2022-01-06 2022-04-08 北京地平线机器人技术研发有限公司 Method and device for adjusting light in vehicle, electronic equipment and storage medium
CN114678021B (en) * 2022-03-23 2023-03-10 小米汽车科技有限公司 Audio signal processing method and device, storage medium and vehicle
CN115440208A (en) * 2022-04-15 2022-12-06 北京罗克维尔斯科技有限公司 Vehicle control method, device, equipment and computer readable storage medium
CN115346527A (en) * 2022-08-08 2022-11-15 科大讯飞股份有限公司 Voice control method, device, system, vehicle and storage medium
CN115547327A (en) * 2022-09-23 2022-12-30 中国第一汽车股份有限公司 Data transmission method and device, storage medium and target vehicle

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110673096A (en) * 2019-09-30 2020-01-10 北京地平线机器人技术研发有限公司 Voice positioning method and device, computer readable storage medium and electronic equipment

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9099096B2 (en) * 2012-05-04 2015-08-04 Sony Computer Entertainment Inc. Source separation by independent component analysis with moving constraint
CN109410978B (en) * 2018-11-06 2021-11-09 北京如布科技有限公司 Voice signal separation method and device, electronic equipment and storage medium
CN109308909B (en) * 2018-11-06 2022-07-15 北京如布科技有限公司 Signal separation method and device, electronic equipment and storage medium
US20200184994A1 (en) * 2018-12-07 2020-06-11 Nuance Communications, Inc. System and method for acoustic localization of multiple sources using spatial pre-filtering
CN110164468B (en) * 2019-04-25 2022-01-28 上海大学 Speech enhancement method and device based on double microphones
CN110992977B (en) * 2019-12-03 2021-06-22 北京声智科技有限公司 Method and device for extracting target sound source
CN111883135A (en) * 2020-07-28 2020-11-03 北京声智科技有限公司 Voice transcription method and device and electronic equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110673096A (en) * 2019-09-30 2020-01-10 北京地平线机器人技术研发有限公司 Voice positioning method and device, computer readable storage medium and electronic equipment

Also Published As

Publication number Publication date
CN112509584A (en) 2021-03-16

Similar Documents

Publication Publication Date Title
CN112509584B (en) Sound source position determining method and device and electronic equipment
US9953634B1 (en) Passive training for automatic speech recognition
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN108922553B (en) Direction-of-arrival estimation method and system for sound box equipment
US12046237B2 (en) Speech interaction method and apparatus, computer readable storage medium and electronic device
CN113986187B (en) Audio region amplitude acquisition method and device, electronic equipment and storage medium
US9311930B2 (en) Audio based system and method for in-vehicle context classification
CN109920410B (en) Apparatus and method for determining reliability of recommendation based on environment of vehicle
CN112949708A (en) Emotion recognition method and device, computer equipment and storage medium
CN111883135A (en) Voice transcription method and device and electronic equipment
CN110673096A (en) Voice positioning method and device, computer readable storage medium and electronic equipment
CN111343410A (en) Mute prompt method and device, electronic equipment and storage medium
CN107274892A (en) Method for distinguishing speek person and device
US10757248B1 (en) Identifying location of mobile phones in a vehicle
US10991363B2 (en) Priors adaptation for conservative training of acoustic model
CN112382266B (en) Speech synthesis method, device, electronic equipment and storage medium
CN113763976B (en) Noise reduction method and device for audio signal, readable medium and electronic equipment
CN116580713A (en) Vehicle-mounted voice recognition method, device, equipment and storage medium
US20190214037A1 (en) Recommendation device, recommendation method, and non-transitory computer-readable storage medium storing recommendation program
CN113113038B (en) Echo cancellation method and device and electronic equipment
CN110941455B (en) Active wake-up method and device and electronic equipment
CN111653271B (en) Sample data acquisition and model training method and device and computer equipment
CN114882879A (en) Audio noise reduction method, method and device for determining mapping information and electronic equipment
CN114299975A (en) Voice noise reduction method and device, computer equipment and storage medium
CN114220177A (en) Lip syllable recognition method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant