CN111383629B - Voice processing method and device, electronic equipment and storage medium - Google Patents

Voice processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111383629B
CN111383629B CN202010199908.5A CN202010199908A CN111383629B CN 111383629 B CN111383629 B CN 111383629B CN 202010199908 A CN202010199908 A CN 202010199908A CN 111383629 B CN111383629 B CN 111383629B
Authority
CN
China
Prior art keywords
sound source
voice data
processed
frame
arrival angle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010199908.5A
Other languages
Chinese (zh)
Other versions
CN111383629A (en
Inventor
张铖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Weiai Intelligent Co ltd
Original Assignee
Shenzhen Weiai Intelligent Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Weiai Intelligent Co ltd filed Critical Shenzhen Weiai Intelligent Co ltd
Priority to CN202010199908.5A priority Critical patent/CN111383629B/en
Publication of CN111383629A publication Critical patent/CN111383629A/en
Application granted granted Critical
Publication of CN111383629B publication Critical patent/CN111383629B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the disclosure discloses a voice processing method and device, an electronic device and a storage medium. One embodiment of the method comprises: calculating an arrival angle corresponding to each frame of voice data in the voice data to be processed, which is acquired by the microphone array; estimating and updating the Gaussian mixture model and the target sound source identification set based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, and determining the sound source identification corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed; determining the sound source identifier with the largest number of arrival angles corresponding to each target sound source identifier in the target sound source identifier set and the environmental noise sound source identifier as a main sound source identifier; and in response to the fact that the main sound source identification is not the environmental noise sound source identification, performing automatic gain control on the voice data to be processed and outputting the voice data. The embodiment realizes that the gain is dynamically adjusted according to the target sound source, and the environmental noise sound source does not influence the effect of automatic gain control.

Description

Voice processing method and device, electronic equipment and storage medium
Technical Field
The embodiment of the disclosure relates to the technical field of computers, in particular to a voice processing method and device, electronic equipment and a storage medium.
Background
Automatic Gain Control (AGC) is a commonly used technique to Control the amplitude of a signal within a target amplitude range by dynamic Gain conversion. Automatic gain control, i.e., tracking the signal amplitude mean, depressing the signal when the signal amplitude mean is greater than a given range upper limit, and increasing the signal when it is less than the range lower limit.
The traditional AGC is based on time domain energy and is used for handheld conversation, but the following defects exist in the process of using the AGC for hands-free conversation or identifying scenes: when a user speaks in a far field, the sound collected by the microphone is small, and the signal-to-noise ratio is low. The AGC will normally operate in an amplification mode, adding a large gain. However, at this time, the near-field small noise (such as click mouse sound and page turning sound) is also amplified greatly, and the sound signal is processed and output by the AGC to generate an undesirable large noise. Conversely, if the near field of the microphone has large background noise, the AGC will tend to compress the signal amplitude, and the speaker's voice in the far-field region will also be compressed, resulting in the speaker's voice in the far-field region being unrecognizable after the voice signal is output. Similarly, similar problems exist when multiple users are speaking at different distances from the microphone. This is mainly a problem caused by the inability of conventional AGCs to distinguish between different target sound sources and between the target sound source and the ambient noise. In a conversation scene, noise interference is caused to a user at a voice receiving end or the speaking voice of the user in a far-field area cannot be identified; in a speech recognition scenario, this will affect the speech recognition rate.
Disclosure of Invention
The embodiment of the disclosure provides a voice processing method and device, electronic equipment and a storage medium.
In a first aspect, an embodiment of the present disclosure provides a speech processing method, where the method includes: calculating an arrival angle corresponding to each frame of voice data in the voice data to be processed, which is acquired by the microphone array; estimating and updating model parameters of the Gaussian mixture model based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, correspondingly updating a target sound source identification set, determining a sound source identification corresponding to an arrival angle corresponding to each frame of voice data in the voice data to be processed, wherein each Gaussian model in the Gaussian mixture model corresponds to each target sound source identifier in the target sound source identifier set one by one, each Gaussian model in the Gaussian mixture models is used for representing arrival angle distribution of a target sound source indicated by a target sound source identification corresponding to the Gaussian model in the target sound source identification set relative to the microphone array, and the sound source identification corresponding to the arrival angle is the target sound source identification in the target sound source identification set or is used for indicating environmental noise sound source identification of a non-target sound source; determining each target sound source identifier in the target sound source identifier set and a sound source identifier with the largest number of arrival angles corresponding to the environmental noise sound source identifier as a main sound source identifier corresponding to the voice data to be processed, wherein the number of arrival angles corresponding to the target sound source identifier in the target sound source identifier set or the environmental noise sound source identifier is the number of arrival angles corresponding to the target sound source identifier or the environmental noise sound source identifier in each arrival angle corresponding to each frame of voice data corresponding to the voice data to be processed; and in response to the fact that the main sound source identification is not the environmental noise sound source identification, performing automatic gain control on the voice data to be processed and then outputting the voice data.
In some embodiments, the method further comprises: in response to determining that the primary sound source identification is the ambient noise sound source identification, outputting the speech data to be processed; or in response to determining that the main sound source identifier is the ambient noise sound source identifier, outputting the to-be-processed data after a constant is preset on the gain.
In some embodiments, the calculating, for each frame of speech data in the to-be-processed speech data collected by the microphone array, an arrival angle corresponding to the frame of speech data includes: for each frame of voice data in the voice data to be processed collected by the microphone array, calculating an arrival angle corresponding to the frame of voice data and a corresponding arrival angle confidence coefficient; and estimating and updating model parameters of a Gaussian mixture model based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, correspondingly updating a target sound source identifier set, and determining a sound source identifier corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed, including: in response to determining that a preset environmental noise condition is met, determining a sound source identifier corresponding to an arrival angle corresponding to each frame of voice data in the voice data to be processed as the environmental noise sound source identifier, where the preset environmental noise condition includes at least one of: the average value of the confidence degrees of arrival angles corresponding to each frame of voice data of the voice data to be processed is smaller than a preset confidence degree threshold value, the ratio of the number of the noise arrival angles to the total number of the arrival angles is larger than a preset ratio threshold value, and the variance of each arrival angle corresponding to each frame of voice data in the voice data to be processed is larger than a preset arrival angle variance threshold value, wherein the number of the noise arrival angles is the number of arrival angles of which the confidence degrees of the corresponding arrival angles in the arrival angles corresponding to each frame of voice data of the voice data to be processed are smaller than the preset confidence degree threshold value, and the total arrival angle number is the sum of the arrival angles corresponding to each frame of voice data of the voice data to be processed; and in response to the fact that the preset environmental noise condition is not met, estimating and updating model parameters of the Gaussian mixture model based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, correspondingly updating a target sound source identification set, and determining a sound source identification corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed.
In some embodiments, the performing automatic gain control on the voice data to be processed and outputting the voice data to be processed includes: and according to the average voice amplitude of the target sound source indicated by the main sound source identification, outputting the voice data to be processed after automatic gain control.
In some embodiments, the estimating and updating the model parameters of the gaussian mixture model based on the arrival angle corresponding to each frame of speech data in the to-be-processed speech data includes: and estimating and updating model parameters of the Gaussian mixture model by using a maximum expectation algorithm based on the arrival angle corresponding to each frame of voice data in the voice data to be processed.
In a second aspect, an embodiment of the present disclosure provides a speech processing apparatus, including: the arrival angle calculation unit is configured to calculate an arrival angle corresponding to each frame of voice data in the voice data to be processed collected by the microphone array; a model updating unit configured to update the model based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, estimating and updating the model parameters of the Gaussian mixture model, correspondingly updating the target sound source identification set, and determining the sound source identification corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed, wherein each Gaussian model in the Gaussian mixture model corresponds to each target sound source identifier in the target sound source identifier set one by one, each Gaussian model in the Gaussian mixture models is used for representing arrival angle distribution of a target sound source indicated by a target sound source identification corresponding to the Gaussian model in the target sound source identification set relative to the microphone array, and the sound source identification corresponding to the arrival angle is the target sound source identification in the target sound source identification set or is used for indicating environmental noise sound source identification of a non-target sound source; a primary sound source determining unit configured to determine, as a primary sound source identifier corresponding to the to-be-processed speech data, a sound source identifier with a largest number of arrival angles in each of target sound source identifiers in the target sound source identifier set and corresponding to the environmental noise sound source identifier, where the number of arrival angles corresponding to the target sound source identifier in the target sound source identifier set or the environmental noise sound source identifier is the number of arrival angles corresponding to the target sound source identifier or the environmental noise sound source identifier in each of arrival angles corresponding to each frame of speech data corresponding to the to-be-processed speech data; a first output unit configured to output the to-be-processed voice data after performing automatic gain control in response to determining that the primary sound source identifier is not the ambient noise sound source identifier.
In some embodiments, the apparatus further comprises: a second output unit configured to output the to-be-processed voice data in response to determining that the primary sound source identification is the ambient noise sound source identification; or a third output unit configured to output the to-be-processed data after a gain of the to-be-processed data is preset to be constant in response to the determination that the primary sound source identifier is the ambient noise sound source identifier.
In some embodiments, the arrival angle calculation unit is further configured to: for each frame of voice data in the voice data to be processed collected by the microphone array, calculating an arrival angle corresponding to the frame of voice data and a corresponding arrival angle confidence coefficient; and the model update unit is further configured to: in response to determining that a preset environmental noise condition is met, determining a sound source identifier corresponding to an arrival angle corresponding to each frame of voice data in the voice data to be processed as the environmental noise sound source identifier, where the preset environmental noise condition includes at least one of: the average value of the confidence degrees of arrival angles corresponding to each frame of voice data of the voice data to be processed is smaller than a preset confidence degree threshold value, the ratio of the number of the noise arrival angles to the total number of the arrival angles is larger than a preset ratio threshold value, and the variance of each arrival angle corresponding to each frame of voice data in the voice data to be processed is larger than a preset arrival angle variance threshold value, wherein the number of the noise arrival angles is the number of arrival angles of which the confidence degrees of the corresponding arrival angles in the arrival angles corresponding to each frame of voice data of the voice data to be processed are smaller than the preset confidence degree threshold value, and the total arrival angle number is the sum of the arrival angles corresponding to each frame of voice data of the voice data to be processed; and in response to the fact that the preset environmental noise condition is not met, estimating and updating model parameters of the Gaussian mixture model based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, correspondingly updating a target sound source identification set, and determining a sound source identification corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed.
In some embodiments, the first output unit is further configured to: and according to the average voice amplitude of the target sound source indicated by the main sound source identification, outputting the voice data to be processed after automatic gain control.
In some embodiments, the model update unit is further configured to: and estimating and updating model parameters of the Gaussian mixture model by using a maximum expectation algorithm based on the arrival angle corresponding to each frame of voice data in the voice data to be processed.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any implementation manner of the first aspect.
In a fourth aspect, the present disclosure provides a computer-readable storage medium on which a computer program is stored, wherein the computer program, when executed by one or more processors, implements the method as described in any implementation manner of the first aspect.
According to the voice processing method and device, the electronic device and the storage medium provided by the embodiment of the disclosure, the arrival angle of each target sound source is assumed to be in Gaussian distribution, and then the Gaussian mixture model corresponding to each target sound source in the current call process is estimated by using each arrival angle corresponding to each frame of voice data in the passing process. When performing automatic gain control on voice data, the automatic gain control is performed only when it is determined that the voice data is voice data uttered by a target sound source, and the automatic gain control is not performed if it is determined that the voice data is an ambient noise voice. The technical effects may include, but are not limited to: firstly, in the conversation scene, huge noise interference cannot occur, and the situation that the speaking voice of a user in a far-field area cannot be identified cannot occur. Second, in a speech recognition scenario, the recognition rate of speech recognition may be improved.
Drawings
Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;
FIG. 2 is a flow diagram for one embodiment of a method of speech processing according to the present disclosure;
FIG. 3 is a schematic diagram of an application scenario of a speech processing method according to the present disclosure;
FIG. 4 is a flow diagram of yet another embodiment of a speech processing method according to the present disclosure;
FIG. 5 is a schematic block diagram of one embodiment of a speech processing apparatus according to the present disclosure;
FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing a speech processing device of an embodiment of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the speech processing method or speech processing apparatus of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include a voice capture terminal device 101, a network 102, and a voice processing device 103. The network 102 is used to provide a medium of a communication link between the voice collecting terminal apparatus 101 and the voice processing apparatus 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The voice collecting terminal device 101 can collect various sounds of the surrounding environment to generate voice data, and send the collected voice data to the voice processing device 103 through the network 102 for processing.
The voice collecting terminal device 101 may be various electronic devices provided with a microphone array for collecting various sounds of the surrounding environment to generate voice data. Here, the microphone array may take various forms, and the present disclosure does not specifically limit thereto. For example, an Additive Microphone Array (Additive Microphone Array), a Differential Microphone Array (Differential Microphone Array), a Planar Microphone Array (Planar Microphone Array), a stereo Microphone Array (3-D Microphone Array), or the like may be used.
In some optional implementations, the voice collecting terminal device 101 may send the collected voice data to the voice processing device 103 through the network 102, and the collected voice data is processed by the voice processing device 103, accordingly, the voice processing method provided by the embodiment of the present disclosure may be executed by the voice processing device 103, and the voice processing apparatus may also be disposed in the voice processing device 103.
In some optional implementations, the voice capture terminal device 101 may also directly process the captured voice data, and in this case, the system architecture 100 may not include the voice processing device 103. For example, the voice collecting terminal apparatus 101 may be a conference telephone terminal apparatus. Accordingly, the voice processing method provided by the embodiment of the present disclosure may be executed by the voice collecting terminal device 101, and the voice processing apparatus may also be disposed in the voice collecting terminal device 101.
The speech processing device 103 may be hardware or software. When the speech processing device 103 is hardware, it may be implemented as a distributed cluster formed by a plurality of speech processing devices, or may be implemented as a single speech processing device. When the speech processing device 103 is software, it may be implemented as a plurality of software or software modules (e.g., to provide distributed speech processing services) or as a single software or software module. And is not particularly limited herein.
It should be understood that the number of voice collecting terminal devices, networks, and voice processing devices in fig. 1 is merely illustrative. Any number of voice acquisition terminal devices, networks and voice processing devices may be present, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a speech processing method according to the present disclosure is shown. The voice processing method comprises the following steps:
step 201, for each frame of voice data in the voice data to be processed collected by the microphone array, calculating an arrival angle corresponding to the frame of voice data.
In this embodiment, an execution subject of the speech processing method (e.g., the speech processing apparatus shown in fig. 1) may calculate, for each frame of speech data in the speech data to be processed collected by the microphone array, an arrival angle corresponding to the frame of speech data.
In some alternative implementations, the microphone array that collects the speech data to be processed may be located in the same electronic device as the execution body described above. Therefore, the execution main body can locally acquire the voice data to be processed acquired by the microphone array, and then calculate the arrival angle corresponding to each frame of voice data in the acquired voice data to be processed.
In some alternative implementations, the microphone array for collecting the voice data to be processed may also be located in a different electronic device from the execution subject. In this way, the execution main body can remotely receive the voice data to be processed from the electronic equipment where the microphone array is located through the network, and then calculate the arrival angle corresponding to each frame of voice data in the received voice data to be processed.
In some alternative implementations, the speech data to be processed may be segments of speech data collected by the microphone array in real time. For example, the speech data to be processed may be a speech data segment of a specified length of time or a specified number of data frames acquired by the microphone array in real time. For example, assuming that each frame of speech data acquired by the microphone array corresponds to 10 milliseconds of speech, 200 frames (i.e., corresponding to 2 seconds) of speech data continuously acquired by the microphone array in real time can be taken as the speech data to be processed. The voice data to be processed may also be a voice data segment acquired by the microphone array in real time by using a sliding window method, that is, each time the voice data of a frame with a preset window length is acquired, the first frame data of the voice data acquired next time slides backwards by a preset sliding step length frame relative to the first frame of the voice data acquired last time, wherein the preset window length and the preset sliding step length can be customized.
In some optional implementations, the voice data to be processed may also be a piece of voice data in the sound recording data collected by the microphone array for a certain call process.
Here, the above-described execution subject may employ various known and future developed microphone array-based sound source localization methods when calculating the arrival angle corresponding to each frame of voice data. For example, a microphone array based sound source localization method may include, but is not limited to: time Difference Of Arrival (TDOA), Generalized Cross Correlation (GCC), High Resolution Spectrum Estimation (HRSE), and the like.
It should be noted that, in practice, when an arrival angle is calculated for each frame of voice data in the voice data to be processed by using a sound source localization method based on a microphone array, at least one arrival angle corresponding to the frame of voice data and related parameters such as a confidence coefficient, an energy intensity, or a voice density corresponding to each arrival angle can be generally calculated. In order to improve the accuracy of the subsequent estimation of the gaussian mixture model, here, the arrival angle of at least one arrival angle calculated for the frame of speech data, whose corresponding confidence, energy intensity or speech density is greater than the corresponding preset confidence threshold, preset energy intensity threshold or preset speech density threshold, may be determined as the arrival angle corresponding to the frame of speech data. Or, of the at least one arrival angle calculated for the frame of speech data, the arrival angle at which the corresponding confidence, energy intensity, or speech density is the largest may be determined as the arrival angle corresponding to the frame of speech data. Still alternatively, a preset number of arrival angles with the highest corresponding confidence, energy intensity or speech density in the at least one arrival angle calculated for the frame of speech data may be determined as the arrival angle corresponding to the frame of speech data, where the preset number is a positive integer.
Step 202, based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, estimating and updating the model parameters of the gaussian mixture model, correspondingly updating the target sound source identifier set, and determining the sound source identifier corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed.
In order to distinguish and identify each target sound source (i.e., a sound source which is relatively fixed in position and emits meaningful sound, such as the sound of a person speaking) in the surrounding environment in the call process in which the microphone array collects the voice data to be processed and an environmental noise sound source (i.e., a sound source corresponding to meaningless sound, such as background noise of book flipping, mouse clicking, keyboard knocking and the like), the inventors have analyzed and researched that the arrival angle distribution of each target sound source relative to the microphone array basically conforms to gaussian distribution, and then a corresponding gaussian model can be established for each target sound source, wherein the mean value of the gaussian model is the mean value of the arrival angle of the corresponding target sound source relative to the microphone array. And further, a Gaussian mixture model comprising Gaussian models corresponding to all target sound sources can be established for a certain call process.
Therefore, after the arrival angle corresponding to each frame of voice data in the voice data to be processed has been calculated in step 201, in step 202, the executing entity may perform estimation and update on the model parameters of the gaussian mixture model, perform corresponding update on the target sound source identifier set, and determine the sound source identifier corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed. Each Gaussian model in the Gaussian mixture model corresponds to each target sound source identifier in the target sound source identifier set one by one, each Gaussian model in the Gaussian mixture model is used for representing arrival angle distribution of a target sound source indicated by the target sound source identifier corresponding to the Gaussian model in the target sound source identifier set relative to the microphone array, and the sound source identifier corresponding to the arrival angle can be the target sound source identifier in the target sound source identifier set or an environmental noise sound source identifier used for indicating a non-target sound source.
It should be noted that the gaussian mixture model here may be a gaussian mixture model corresponding to a call in which the voice data to be processed is located. The target sound source identification set may also be a target sound source identification set corresponding to a call in which the to-be-processed voice data is located. In practice, before a new call is opened, the gaussian mixture model corresponding to the call does not exist, and the gaussian mixture model corresponding to the call process can be gradually estimated and updated in the call process based on the voice data collected successively by the microphone array. Similarly, before a new call is opened, the target sound source identification set corresponding to the call does not exist, and the target sound source identification set corresponding to the call can be correspondingly updated along with the estimation and the update of the gaussian mixture model in the call process. Specifically, there are the following cases:
firstly, if the voice data to be processed is collected at the initial stage of the call where the voice data to be processed is located, the call where the voice data to be processed is located does not generate any arrival angle data, or the generated arrival angle data is not large in quantity, or the generated arrival angles are not distributed uniformly, the gaussian mixture model corresponding to the call where the voice data to be processed is located does not form any gaussian model.
At this time, in step 202, based on the arrival angle corresponding to each frame of speech data in the speech data to be processed, the model parameters of the gaussian mixture model are estimated and updated as follows: estimating a gaussian mixture model based on an arrival angle corresponding to each frame of voice data in the voice data to be processed and an arrival angle corresponding to each frame of voice data acquired before the voice data to be processed in a call where the voice data to be processed is located, wherein the estimation result can be in two cases, which is specifically described as follows:
first, at least one new gaussian model is estimated. In this case, the gaussian mixture model may be a linear combination of at least one new gaussian model resulting from the above estimation. The executing main body may update the target sound source identifier by adding a target sound source identifier corresponding to each newly obtained gaussian model to the target sound source identifier set, and record a one-to-one correspondence relationship between each gaussian model in the gaussian mixture model and the corresponding target sound source identifier in the target sound source identifier set. In the estimation process, each arrival angle corresponding to each frame of voice data in the voice data to be processed may be classified into a certain gaussian model of the estimated gaussian mixture model or not classified into any gaussian model of the estimated gaussian mixture model. If a certain arrival angle is classified into a certain gaussian model of the estimated gaussian mixture model, the executing entity may determine the sound source identifier corresponding to the arrival angle as the target sound source identifier corresponding to the gaussian model into which the arrival angle is classified, that is, the arrival angle conforms to the arrival angle distribution of the target sound source indicated by the sound source identifier corresponding to the arrival angle. If the arrival angle is not classified into any gaussian model of the estimated gaussian mixture models, the executing entity may determine the sound source identifier corresponding to the arrival angle as the ambient noise sound source identifier, that is, the arrival angle does not conform to the target sound source indicated by any target sound source identifier in the target sound source identifier set, and the arrival angle is the arrival angle corresponding to the ambient noise sound source.
Secondly, a new gaussian model is not estimated. At this time, it is not necessary to perform any update operation on the target sound source identifier set, and the sound source identifier corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed can be determined as the ambient noise sound source identifier.
Secondly, the Gaussian mixture model corresponding to the call where the voice data to be processed is located already comprises at least one Gaussian model. At this time, in step 202, based on the arrival angle corresponding to each frame of speech data in the speech data to be processed, the model parameters of the gaussian mixture model are estimated and updated as follows: the gaussian mixture model is estimated based on the arrival angle corresponding to each frame of speech data in the speech data to be processed, and the estimation result may have two cases, which are specifically described as follows:
first, at least one new gaussian model is estimated. While the model parameters of the gaussian models originally present in the gaussian mixture model may remain unchanged or may be modified in the course of the estimation. In this case, the gaussian mixture model may be a linear combination of the at least one new gaussian model estimated as described above and the originally existing gaussian model of the modified or unmodified parameters. The executing main body may update the target sound source identifier by adding a target sound source identifier corresponding to each newly obtained gaussian model to the target sound source identifier set, and record a one-to-one correspondence relationship between each gaussian model in the estimated gaussian mixture model and a corresponding target sound source identifier in the target sound source identifier set. In the estimation process, each arrival angle corresponding to each frame of speech data in the speech data to be processed may be classified into a certain gaussian model of the estimated gaussian mixture model or not classified into any gaussian model of the estimated gaussian mixture model. If a certain arrival angle is classified into a certain gaussian model of the estimated gaussian mixture model, the executing entity may determine the sound source identifier corresponding to the arrival angle as the target sound source identifier corresponding to the gaussian model into which the arrival angle is classified, that is, the arrival angle conforms to the arrival angle distribution of the target sound source indicated by the sound source identifier corresponding to the arrival angle. If the arrival angle is not classified into any gaussian model of the estimated gaussian mixture models, the executing entity may determine the sound source identifier corresponding to the arrival angle as the ambient noise sound source identifier, that is, the arrival angle does not conform to the target sound source indicated by any target sound source identifier in the target sound source identifier set, and the arrival angle is the arrival angle corresponding to the ambient noise sound source.
Secondly, a new gaussian model is not estimated. At this time, it is not necessary to perform any update operation on the target sound source identifier set, and the sound source identifier corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed can be determined as the ambient noise sound source identifier.
It should be noted that how to estimate the gaussian mixture model based on data is a widely studied and applied prior art, and the disclosure is not limited thereto. For example, the model parameters of the gaussian mixture model may be estimated and updated based on the arrival angle corresponding to each frame of speech data in the speech data to be processed by using a maximum expectation algorithm, a Dempster-Laird-Rubin algorithm, and the like.
Step 203, determining the sound source identifier with the largest number of arrival angles corresponding to each target sound source identifier in the target sound source identifier set and the environmental noise sound source identifier as the main sound source identifier corresponding to the voice data to be processed.
In this embodiment, the executing entity may first determine, for each target sound source identifier in the target sound source identifier set, the number of arrival angles, which is the sound source identifier corresponding to each arrival angle corresponding to each frame of voice data corresponding to the to-be-processed voice data, as the number of arrival angles corresponding to the target sound source identifier, and determine, as the number of arrival angles corresponding to the ambient noise sound source identifier, the number of arrival angles, which is the sound source identifier corresponding to each arrival angle corresponding to each frame of voice data corresponding to the to-be-processed voice data, as the number of arrival angles corresponding to the ambient noise sound source identifier. Then, the executing entity may determine, as the primary sound source identifier corresponding to the to-be-processed speech data, the sound source identifier with the largest number of arrival angles in each of the target sound source identifiers in the target sound source identifier set and the ambient noise sound source identifier.
In practice, in a general call process, a person speaks at the same time, and other people do not speak, but weak environmental noise may exist, or no person speaks at the same time but only the environmental noise. The speech data to be processed collected by the microphone array is not long enough, and a person is in the speech or the ambient noise is generated later. Therefore, it can be considered that there is only one primary sound source corresponding to the to-be-processed voice data, or one target sound source, or a noise sound source, and the arrival angle corresponding to the primary sound source should be the highest ratio at each arrival angle corresponding to each frame of voice data of the to-be-processed voice data, so that the sound source identifier with the largest number of arrival angles in each target sound source identifier of the target sound source identifier set and the environmental noise sound source identifier can be determined as the primary sound source identifier corresponding to the to-be-processed voice data, that is, the target sound source indicated by the primary sound source identifier is the primary sound source corresponding to the to-be-processed voice data.
And 204, in response to the fact that the main sound source identification is not the environmental noise sound source identification, outputting the voice data to be processed after automatic gain control.
In this embodiment, after determining the primary sound source identifier corresponding to the voice data to be processed in step 203, the executing entity may determine whether the primary sound source identifier is an ambient noise sound source identifier, and if it is determined that the primary sound source identifier is not the ambient noise sound source identifier, it indicates that the primary sound source identifier is a target sound source identifier in the target sound source identifier set. That is to say, the sound source corresponding to the voice data to be processed is the target sound source, and is not the environmental noise sound source, so the voice data to be processed can be output after automatic gain control.
It should be noted that how to perform automatic gain control on the voice data to be processed is the prior art that is widely researched and applied at present, and this disclosure does not specifically limit this.
In some alternative implementations, step 204 may proceed as follows: and according to the average voice amplitude of the target sound source indicated by the main sound source identification, outputting the voice data to be processed after automatic gain control. In a scene where more than one user speaks, it often happens that different users switch speaking, i.e. the frequency of switching back and forth between target sound sources is high. In the process of switching between two different target sound sources, if the conventional automatic gain control method is adopted, because the conventional automatic gain control method needs a relatively long attack time (attack time) before the output sound signal is within the amplitude range of the desired sound signal, the sound signal output within the attack time is often not within the amplitude range of the desired sound signal. Therefore, the amplitude of the sound signal heard by the listener after the sound signal is output is large or small, or the speech recognition rate is low. In order to solve the above problem, different automatic gain control instances may be established for each target sound source, and the automatic gain control instances are also switched correspondingly when the target sound source is switched. Specifically, the voice amplitude of each target sound source may be tracked during the call, and the average voice amplitude of each target sound source may be updated at any time. When the automatic gain control is carried out on the voice data to be processed corresponding to a certain target sound source, the automatic gain control is carried out according to the average voice amplitude corresponding to the target sound source, so that the signal amplitude of the output voice signal is always kept in the expected voice signal amplitude range when the target sound source is switched.
In addition, the speech data to be processed may be output after performing automatic gain control, and may be output to various software or hardware circuits, devices or electronic devices. For example, the information may be directly output to a speaker for playing, may be directly output to a speech recognition application for speech recognition, and may also be sent to other electronic devices via a network for playing or speech recognition, etc.
In some optional implementations, the execution body may further perform the following operations: and outputting the voice data to be processed in response to the fact that the main sound source identification is determined to be the environmental noise sound source identification, or outputting the voice data to be processed after the gain of the voice data to be processed is preset to be constant in response to the fact that the main sound source identification is determined to be the environmental noise sound source identification. That is, if the voice data to be processed is not emitted by the target sound source but is environmental noise, the voice data to be processed may be directly output or output after the gain of the voice data to be processed is preset to a constant value, instead of being output after automatic gain control is performed according to the voice data emitted by the target sound source, so as to avoid that the near-field tiny noise (such as mouse clicking sound and book turning sound) is also amplified greatly, resulting in undesirable huge noise.
With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the speech processing method according to the present embodiment. In the application scenario of fig. 3, the user 301 and the user 302 may speak at any time, and the microphone array 303 may collect sounds in the current environment in real time, wherein there may be the sound of the user 301, the sound of the user 302, and the ambient noise, and accordingly generate the to-be-processed speech 304. For example, 2 seconds of speech data to be processed are acquired in real time each time. Then, the speech processing apparatus 305 may acquire the above-described speech data to be processed 304, and calculate the arrival angle 3051 corresponding to each frame of speech data in the speech data to be processed. The speech processing device 305 further performs estimation and update on model parameters of the gaussian mixture model 3052 based on the arrival angle 3051 corresponding to each frame of speech data in the speech data 304 to be processed, performs corresponding update on the target sound source identifier set 3053, and determines a sound source identifier 3054 corresponding to the arrival angle corresponding to each frame of speech data in the speech data to be processed. Next, the speech processing apparatus 305 determines the sound source identifier with the largest number of arrival angles corresponding to each of the target sound source identifiers of the target sound source identifier set 3053 and the ambient noise sound source identifier as the primary sound source identifier 3055 corresponding to the speech data 304 to be processed. If the primary sound source identification 3055 is the target sound source identification in the target sound source identification set 3053, the speech processing device 305 performs automatic gain control on the speech data 304 to be processed and outputs the result. If the primary sound source identifier 3055 is the ambient noise sound source identifier 3056, the to-be-processed speech data 304 is directly output or the to-be-processed data 304 is output after the gain is preset to be constant.
It should be noted that the microphone array 303 shown in fig. 3 may be independent from the speech processing device 305, and the microphone array 303 may also be a part of the speech processing device 305.
According to the method provided by the embodiment of the disclosure, the target sound source and the environmental noise sound source are distinguished, and then only the automatic gain control is executed on the voice data of the target sound source, and the automatic gain control is not executed on the voice data of the environmental noise sound source.
With further reference to FIG. 4, a flow 400 of yet another embodiment of a speech processing method is shown. The flow 400 of the speech processing method includes the following steps:
step 401, for each frame of voice data in the voice data to be processed collected by the microphone array, calculating an arrival angle corresponding to the frame of voice data and a corresponding arrival angle confidence.
In this embodiment, an executing body of the speech processing method (e.g., the speech processing device shown in fig. 1) may calculate, for each frame of speech data in the speech data to be processed collected by the microphone array, an arrival angle and a corresponding arrival angle confidence corresponding to the frame of speech data.
In some alternative implementations, the microphone array that collects the speech data to be processed may be located in the same electronic device as the execution body described above. Therefore, the execution main body can locally acquire the voice data to be processed acquired by the microphone array, and then calculate the arrival angle corresponding to each frame of voice data and the corresponding arrival angle confidence coefficient for each frame of voice data in the acquired voice data to be processed.
In some alternative implementations, the microphone array for collecting the voice data to be processed may also be located in a different electronic device from the execution subject. In this way, the execution main body can remotely receive the voice data to be processed from the electronic equipment where the microphone array is located through the network, and then calculate the arrival angle corresponding to each frame of voice data and the corresponding arrival angle confidence for each frame of voice data in the received voice data to be processed.
In some alternative implementations, the speech data to be processed may be segments of speech data collected by the microphone array in real time. For example, the speech data to be processed may be a speech data segment of a specified length of time or a specified number of data frames acquired by the microphone array in real time. For example, assuming that each frame of speech data acquired by the microphone array corresponds to 10 milliseconds of speech, 200 frames (i.e., corresponding to 2 seconds) of speech data continuously acquired by the microphone array in real time can be taken as the speech data to be processed. The voice data to be processed may also be a voice data segment acquired by the microphone array in real time by using a sliding window method, that is, each time the voice data of a frame with a preset window length is acquired, the first frame data of the voice data acquired next time slides backwards by a preset sliding step length frame relative to the first frame of the voice data acquired last time, wherein the preset window length and the preset sliding step length can be customized.
In some optional implementations, the voice data to be processed may also be a piece of voice data in the sound recording data collected by the microphone array for a certain call process.
Here, the above-described implementation subject may employ various known and future developed microphone array-based sound source localization methods when calculating the arrival angle and the corresponding arrival angle confidence corresponding to each frame of voice data. For example, a microphone array based sound source localization method may include, but is not limited to: time difference of arrival methods, generalized cross-correlation methods, high-resolution spectrum estimation methods, and the like.
Here, for each frame of voice data of the voice data to be processed, an arrival angle of the at least one arrival angle calculated for the frame of voice data, whose corresponding arrival angle confidence is greater than a preset confidence threshold, may be determined as the arrival angle corresponding to the frame of voice data. Or, of at least one arrival angle calculated for the frame of speech data, the arrival angle with the highest confidence of the corresponding arrival angle may be determined as the arrival angle corresponding to the frame of speech data. Still alternatively, a preset number of arrival angles with the highest confidence of the corresponding arrival angles in the at least one arrival angle calculated for the frame of voice data may be determined as the arrival angles corresponding to the frame of voice data, where the preset number is a positive integer.
At step 402, it is determined whether a preset ambient noise condition is satisfied.
In this embodiment, the executing agent may determine whether a preset ambient noise condition is satisfied, execute step 403 if it is determined that the preset ambient noise condition is satisfied, and execute step 404 if it is determined that the preset ambient noise condition is not satisfied.
Here, the preset ambient noise condition may include at least one of the following three conditions:
and in the condition 1, the arrival angle confidence coefficient average value corresponding to each frame of voice data of the voice data to be processed is smaller than a preset confidence coefficient threshold value. In practice, if the voice data to be processed is voice data sent by a target sound source, the calculated arrival angle confidence is higher due to the stronger directivity of the target sound source, and the calculated arrival angle confidence is lower due to the weaker directivity of the environmental noise. Therefore, if the condition 1 is satisfied, which indicates that the confidence of the average arrival angle corresponding to each frame of voice data of the voice data to be processed is low, the possibility that the sound source corresponding to the voice data to be processed is an ambient noise sound source is also high.
And 2, the ratio of the noise arrival angle quantity to the total arrival angle quantity is larger than a preset ratio threshold value. Here, the noise arrival angle number is the arrival angle number in which the confidence of the corresponding arrival angle in the arrival angles corresponding to each frame of voice data of the voice data to be processed is smaller than the preset confidence threshold, and the total arrival angle number is the sum of the arrival angles corresponding to each frame of voice data of the voice data to be processed. Based on the above description about the environmental noise sound source and the target sound source, if condition 2 is satisfied, it indicates that the arrival angle with lower arrival angle confidence is more in the arrival angles corresponding to each frame of voice data of the to-be-processed voice data, and the possibility that the sound source corresponding to the to-be-processed voice data is the environmental noise sound source is also higher.
And 3, the variance of each arrival angle corresponding to each frame of voice data in the voice data to be processed is greater than a preset arrival angle variance threshold. In practice, if the voice data to be processed is voice data sent by a target sound source, the calculated arrival angles are distributed more intensively due to the stronger directivity of the target sound source, and are distributed unevenly and dispersed due to the weaker directivity of the environmental noise. Therefore, if the condition 3 is satisfied, which indicates that the distributions of the arrival angles corresponding to the frames of voice data in the voice data to be processed are not uniform, the possibility that the sound source corresponding to the voice data to be processed is an ambient noise sound source is also high.
As can be seen from the above description, since the preset ambient noise condition includes at least one of the above three conditions, if the preset ambient noise condition is satisfied, that is, it indicates that the probability that the speech data to be processed is the ambient noise data is high, and the arrival angle corresponding to each frame of speech data in the speech data to be processed is not suitable for estimating and updating the gaussian mixture model, the process may go to step 403 to execute. If the preset ambient noise condition is not satisfied, i.e., the probability that the speech data to be processed is ambient noise data is not high, then the step 404 may be proceeded to perform further operations.
Step 403, determining a sound source identifier corresponding to an arrival angle corresponding to each frame of voice data in the voice data to be processed as an ambient noise sound source identifier.
Here, the executing entity may determine, as the ambient noise sound source identifier, a sound source identifier corresponding to an arrival angle corresponding to each frame of the voice data to be processed in the case where it is determined in step 402 that the preset ambient noise condition is satisfied. That is, the confidence of the arrival angle corresponding to each frame of speech data in the speech data to be processed is low or the distribution is uneven, which is not suitable for updating the gaussian mixture model, and the gaussian mixture model does not update the corresponding target sound source identifier set, which is not needed to be updated, but directly determines the sound source identifier corresponding to the arrival angle corresponding to each frame of speech data in the speech data to be processed as the ambient noise sound source identifier. The Gaussian mixture model is not estimated and updated by using the arrival angle corresponding to each frame of voice data in the voice data to be processed, so that the calculation amount is reduced, and the calculation speed is increased.
After step 403 is executed, go to step 405 to execute.
And 404, estimating and updating model parameters of the Gaussian mixture model based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, correspondingly updating a target sound source identifier set, and determining a sound source identifier corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed.
Here, the executing entity may, in the case that it is determined in step 402 that the preset environmental noise condition is not satisfied, perform estimation update on model parameters of the gaussian mixture model based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, perform corresponding update on the target sound source identifier set, and determine a sound source identifier corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed. That is, if the confidence of the arrival angle corresponding to each frame of voice data in the voice data to be processed is not very low, or the distribution of the arrival angle is relatively uniform, the arrival angle corresponding to each frame of voice data in the voice data to be processed is suitable for estimating and updating the model parameters of the gaussian mixture model, and then the target sound source identification set also needs to be updated correspondingly, and the sound source identification corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed can be determined based on the updated result.
In this embodiment, the specific operation of step 404 is substantially the same as the operation of step 202 in the embodiment shown in fig. 2, and is not repeated here.
After step 404 is executed, go to step 405 to execute.
And 405, determining the sound source identifier with the largest number of arrival angles in each target sound source identifier and the environmental noise sound source identifiers in the target sound source identifier set as the main sound source identifier corresponding to the voice data to be processed.
In this embodiment, the specific operation of step 405 is substantially the same as the operation of step 203 in the embodiment shown in fig. 2, and is not described herein again.
And step 406, in response to determining that the primary sound source identifier is not the ambient noise sound source identifier, performing automatic gain control on the voice data to be processed and outputting the voice data.
In this embodiment, the specific operation of step 406 is substantially the same as the operation of step 204 in the embodiment shown in fig. 2, and is not repeated herein.
In some optional implementations, the execution body may further perform the following operations: and outputting the voice data to be processed in response to the fact that the main sound source identification is determined to be the environmental noise sound source identification, or outputting the voice data to be processed after the gain of the voice data to be processed is preset to be constant in response to the fact that the main sound source identification is determined to be the environmental noise sound source identification. That is, if the voice data to be processed is not emitted by the target sound source but is environmental noise, the voice data to be processed may be directly output or output after the gain of the voice data to be processed is preset to a constant value, instead of being output after automatic gain control is performed according to the voice data emitted by the target sound source, so as to avoid that the near-field tiny noise (such as mouse clicking sound and book turning sound) is also amplified greatly, resulting in undesirable huge noise.
As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, in the process 400 of the speech processing method in this embodiment, under the condition that the confidence of the arrival angle corresponding to each frame of speech data of the speech data to be processed is calculated to be low or is not uniformly distributed, that is, under the condition that the preset environmental noise condition is satisfied, it is predetermined that the speech data to be processed is speech data sent by an environmental noise source, and then the arrival angle of the arrival angle corresponding to each frame of speech data of the speech data to be processed is no longer used to estimate and update the model parameters of the gaussian mixture model, that is, the target sound source identification set is not required to be updated correspondingly, but the sound source identification corresponding to the arrival angle corresponding to each frame of speech data in the speech data to be processed is directly determined as the environmental noise sound source identification. And only under the condition that the preset environmental noise condition is not met, estimating and updating model parameters of the Gaussian mixture model by using the arrival angle of the arrival angle corresponding to each frame of voice data of the voice data to be processed, correspondingly updating a target sound source identification set, and determining a sound source identification corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed based on the estimation and updating result. Therefore, the scheme described in the embodiment can reduce the calculation amount of the voice processing process, and further improve the speed of voice processing.
With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a speech processing apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 5, the speech processing apparatus 500 of the present embodiment includes: an arrival angle calculation unit 501, a model update unit 502, a primary sound source determination unit 503, and a first output unit 504. The arrival angle calculating unit 501 is configured to calculate, for each frame of voice data in the to-be-processed voice data acquired by the microphone array, an arrival angle corresponding to the frame of voice data; a model updating unit 502 configured to update the model based on the arrival angle corresponding to each frame of speech data in the speech data to be processed, estimating and updating the model parameters of the Gaussian mixture model, correspondingly updating the target sound source identification set, and determining the sound source identification corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed, wherein each Gaussian model in the Gaussian mixture model corresponds to each target sound source identifier in the target sound source identifier set one by one, each gaussian model in the gaussian mixture models is used for representing the arrival angle distribution of a target sound source indicated by a target sound source identifier corresponding to the gaussian model in the target sound source identifier set relative to the microphone array, and the sound source identifier corresponding to the arrival angle is the target sound source identifier in the target sound source identifier set or an environmental noise sound source identifier used for indicating a non-target sound source; a primary sound source determining unit 503, configured to determine, as a primary sound source identifier corresponding to the to-be-processed speech data, a sound source identifier with a largest number of arrival angles in each of the target sound source identifiers and the ambient noise sound source identifiers in the target sound source identifier set, where the number of arrival angles corresponding to the target sound source identifier or the ambient noise sound source identifier in the target sound source identifier set is the number of arrival angles corresponding to the target sound source identifier or the ambient noise sound source identifier in each of the arrival angles corresponding to each frame of speech data corresponding to the to-be-processed speech data; and a first output unit 504 configured to output the to-be-processed voice data after performing automatic gain control in response to determining that the primary sound source identifier is not the ambient noise sound source identifier.
In this embodiment, specific processes of the arrival angle calculating unit 501, the model updating unit 502, the primary sound source determining unit 503, and the first output unit 504 of the speech processing apparatus 500 and technical effects brought by the specific processes can refer to the related descriptions of step 201, step 202, step 203, and step 204 in the corresponding embodiment of fig. 2, and are not repeated herein.
In some optional implementations, the apparatus 500 may further include: a second output unit (not shown in fig. 5) configured to output the to-be-processed voice data in response to determining that the primary sound source identification is the ambient noise sound source identification; or a third output unit (not shown in fig. 5) configured to output the to-be-processed data after the gain is preset to be constant in response to determining that the primary sound source identifier is the ambient noise sound source identifier.
In some optional implementations, the arrival angle calculation unit 501 may be further configured to: for each frame of voice data in the voice data to be processed collected by the microphone array, calculating an arrival angle corresponding to the frame of voice data and a corresponding arrival angle confidence coefficient; and the model update unit 502 described above may be further configured to: in response to determining that a preset environmental noise condition is met, determining a sound source identifier corresponding to an arrival angle corresponding to each frame of voice data in the voice data to be processed as the environmental noise sound source identifier, where the preset environmental noise condition includes at least one of: an average value of confidence degrees of arrival angles corresponding to each frame of voice data of the voice data to be processed is smaller than a preset confidence degree threshold, a ratio of a noise arrival angle number to a total arrival angle number is larger than a preset ratio threshold, and a variance of each arrival angle corresponding to each frame of voice data in the voice data to be processed is larger than a preset arrival angle variance threshold, wherein the noise arrival angle number is an arrival angle number in which a confidence degree of a corresponding arrival angle in an arrival angle corresponding to each frame of voice data of the voice data to be processed is smaller than the preset confidence degree threshold, and the total arrival angle number is a sum of arrival angles corresponding to each frame of voice data of the voice data to be processed; and in response to the fact that the preset environmental noise condition is not met, estimating and updating model parameters of the Gaussian mixture model based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, correspondingly updating a target sound source identification set, and determining a sound source identification corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed.
In some optional implementations, the first output unit 504 may be further configured to: and performing automatic gain control on the voice data to be processed according to the average voice amplitude of the target sound source indicated by the main sound source identification, and then outputting the voice data to be processed.
In some optional implementations, the model updating unit 502 described above may be further configured to: and estimating and updating model parameters of the Gaussian mixture model by using a maximum expectation algorithm based on the arrival angle corresponding to each frame of voice data in the voice data to be processed.
It should be noted that details of implementation and technical effects of each unit in the speech processing apparatus provided in the embodiment of the present disclosure may refer to descriptions of other embodiments in the present disclosure, and are not described herein again.
Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use in implementing a speech processing device according to an embodiment of the present disclosure. The speech processing device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 6, the computer system 600 includes one or more Central Processing Units (CPUs) 601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An Input/Output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including, for example, a microphone array, keyboard, mouse, touch screen, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), a touch panel, and the like, and a speaker and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from the network through the communication section 609. The above-described functions defined in the method of the present disclosure are performed when the computer program is executed by a Central Processing Unit (CPU) 601. It should be noted that the computer readable medium in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, Python, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an arrival angle calculation unit, a model update unit, a primary sound source determination unit, and a first output unit. The names of these units do not in some cases constitute a limitation on the units themselves, and for example, the first output unit may also be described as a "unit that outputs speech data to be processed after performing automatic gain control".
As another aspect, the present disclosure also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: calculating an arrival angle corresponding to each frame of voice data in the voice data to be processed, which is acquired by the microphone array; estimating and updating model parameters of the Gaussian mixture model based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, correspondingly updating the target sound source identification set, and determining the sound source identification corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed, wherein each Gaussian model in the Gaussian mixture model corresponds to each target sound source identifier in the target sound source identifier set one by one, each gaussian model in the gaussian mixture models is used for representing the arrival angle distribution of a target sound source indicated by a target sound source identifier corresponding to the gaussian model in the target sound source identifier set relative to the microphone array, and the sound source identifier corresponding to the arrival angle is the target sound source identifier in the target sound source identifier set or an environmental noise sound source identifier used for indicating a non-target sound source; determining a sound source identifier with the largest number of arrival angles corresponding to each target sound source identifier in the target sound source identifier set and the corresponding environmental noise sound source identifier as a main sound source identifier corresponding to the to-be-processed voice data, wherein the number of arrival angles corresponding to the target sound source identifier in the target sound source identifier set or the environmental noise sound source identifier is the number of arrival angles corresponding to the target sound source identifier or the environmental noise sound source identifier in each arrival angle corresponding to each frame of voice data corresponding to the to-be-processed voice data; and in response to the fact that the main sound source identification is not the environmental noise sound source identification, performing automatic gain control on the voice data to be processed and outputting the voice data.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept as defined above. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (10)

1. A method of speech processing comprising:
calculating an arrival angle corresponding to each frame of voice data in the voice data to be processed, which is acquired by the microphone array;
estimating and updating model parameters of the Gaussian mixture model based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, correspondingly updating a target sound source identification set, determining a sound source identification corresponding to an arrival angle corresponding to each frame of voice data in the voice data to be processed, wherein each Gaussian model in the Gaussian mixture model corresponds to each target sound source identifier in the target sound source identifier set one by one, each Gaussian model in the Gaussian mixture models is used for representing arrival angle distribution of a target sound source indicated by a target sound source identification corresponding to the Gaussian model in the target sound source identification set relative to the microphone array, and the sound source identification corresponding to the arrival angle is the target sound source identification in the target sound source identification set or is used for indicating environmental noise sound source identification of a non-target sound source;
determining each target sound source identifier in the target sound source identifier set and a sound source identifier with the largest number of arrival angles corresponding to the environmental noise sound source identifier as a main sound source identifier corresponding to the voice data to be processed, wherein the number of arrival angles corresponding to the target sound source identifier in the target sound source identifier set or the environmental noise sound source identifier is the number of arrival angles corresponding to the target sound source identifier or the environmental noise sound source identifier in each arrival angle corresponding to each frame of voice data corresponding to the voice data to be processed;
and in response to the fact that the main sound source identification is not the environmental noise sound source identification, performing automatic gain control on the voice data to be processed and then outputting the voice data.
2. The method of claim 1, wherein the method further comprises:
in response to determining that the primary sound source identification is the ambient noise sound source identification, outputting the speech data to be processed; or
And in response to the fact that the main sound source identification is the environmental noise sound source identification, outputting the to-be-processed data after the gain of the to-be-processed data is preset to be constant.
3. The method of claim 2, wherein the calculating, for each frame of speech data in the speech data to be processed collected by the microphone array, an arrival angle corresponding to the frame of speech data comprises:
for each frame of voice data in the voice data to be processed collected by the microphone array, calculating an arrival angle corresponding to the frame of voice data and a corresponding arrival angle confidence coefficient; and
the estimating and updating model parameters of the Gaussian mixture model based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, correspondingly updating a target sound source identifier set, and determining the sound source identifier corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed comprises:
in response to determining that a preset environmental noise condition is met, determining a sound source identifier corresponding to an arrival angle corresponding to each frame of voice data in the voice data to be processed as the environmental noise sound source identifier, where the preset environmental noise condition includes at least one of: the average value of the confidence degrees of arrival angles corresponding to each frame of voice data of the voice data to be processed is smaller than a preset confidence degree threshold value, the ratio of the number of the noise arrival angles to the total number of the arrival angles is larger than a preset ratio threshold value, and the variance of each arrival angle corresponding to each frame of voice data in the voice data to be processed is larger than a preset arrival angle variance threshold value, wherein the number of the noise arrival angles is the number of arrival angles of which the confidence degrees of the corresponding arrival angles in the arrival angles corresponding to each frame of voice data of the voice data to be processed are smaller than the preset confidence degree threshold value, and the total arrival angle number is the sum of the arrival angles corresponding to each frame of voice data of the voice data to be processed;
and in response to the fact that the preset environmental noise condition is not met, estimating and updating model parameters of the Gaussian mixture model based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, correspondingly updating a target sound source identification set, and determining a sound source identification corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed.
4. The method according to any one of claims 1-3, wherein the performing automatic gain control on the voice data to be processed for output comprises:
and according to the average voice amplitude of the target sound source indicated by the main sound source identification, outputting the voice data to be processed after automatic gain control.
5. The method according to claim 4, wherein the estimating and updating the model parameters of the gaussian mixture model based on the arrival angle corresponding to each frame of speech data in the speech data to be processed comprises:
and estimating and updating model parameters of the Gaussian mixture model by using a maximum expectation algorithm based on the arrival angle corresponding to each frame of voice data in the voice data to be processed.
6. A speech processing apparatus comprising:
the arrival angle calculation unit is configured to calculate an arrival angle corresponding to each frame of voice data in the voice data to be processed collected by the microphone array;
a model updating unit configured to update the model based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, estimating and updating the model parameters of the Gaussian mixture model, correspondingly updating the target sound source identification set, and determining the sound source identification corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed, wherein each Gaussian model in the Gaussian mixture model corresponds to each target sound source identifier in the target sound source identifier set one by one, each Gaussian model in the Gaussian mixture models is used for representing arrival angle distribution of a target sound source indicated by a target sound source identification corresponding to the Gaussian model in the target sound source identification set relative to the microphone array, and the sound source identification corresponding to the arrival angle is the target sound source identification in the target sound source identification set or is used for indicating environmental noise sound source identification of a non-target sound source;
a primary sound source determining unit configured to determine, as a primary sound source identifier corresponding to the to-be-processed speech data, a sound source identifier with a largest number of arrival angles in each of target sound source identifiers in the target sound source identifier set and corresponding to the environmental noise sound source identifier, where the number of arrival angles corresponding to the target sound source identifier in the target sound source identifier set or the environmental noise sound source identifier is the number of arrival angles corresponding to the target sound source identifier or the environmental noise sound source identifier in each of arrival angles corresponding to each frame of speech data corresponding to the to-be-processed speech data;
a first output unit configured to output the to-be-processed voice data after performing automatic gain control in response to determining that the primary sound source identifier is not the ambient noise sound source identifier.
7. The apparatus of claim 6, wherein the apparatus further comprises:
a second output unit configured to output the to-be-processed voice data in response to determining that the primary sound source identification is the ambient noise sound source identification; or
A third output unit configured to output the to-be-processed data after a gain of the to-be-processed data is preset by a constant in response to determining that the primary sound source identifier is the ambient noise sound source identifier.
8. The apparatus of claim 7, wherein the arrival angle calculation unit is further configured to:
for each frame of voice data in the voice data to be processed collected by the microphone array, calculating an arrival angle corresponding to the frame of voice data and a corresponding arrival angle confidence coefficient; and
the model update unit is further configured to:
in response to determining that a preset environmental noise condition is met, determining a sound source identifier corresponding to an arrival angle corresponding to each frame of voice data in the voice data to be processed as the environmental noise sound source identifier, where the preset environmental noise condition includes at least one of: the average value of the confidence degrees of arrival angles corresponding to each frame of voice data of the voice data to be processed is smaller than a preset confidence degree threshold value, the ratio of the number of the noise arrival angles to the total number of the arrival angles is larger than a preset ratio threshold value, and the variance of each arrival angle corresponding to each frame of voice data in the voice data to be processed is larger than a preset arrival angle variance threshold value, wherein the number of the noise arrival angles is the number of arrival angles of which the confidence degrees of the corresponding arrival angles in the arrival angles corresponding to each frame of voice data of the voice data to be processed are smaller than the preset confidence degree threshold value, and the total arrival angle number is the sum of the arrival angles corresponding to each frame of voice data of the voice data to be processed;
and in response to the fact that the preset environmental noise condition is not met, estimating and updating model parameters of the Gaussian mixture model based on the arrival angle corresponding to each frame of voice data in the voice data to be processed, correspondingly updating a target sound source identification set, and determining a sound source identification corresponding to the arrival angle corresponding to each frame of voice data in the voice data to be processed.
9. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-5.
10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by one or more processors, implements the method of any one of claims 1-5.
CN202010199908.5A 2020-03-20 2020-03-20 Voice processing method and device, electronic equipment and storage medium Active CN111383629B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010199908.5A CN111383629B (en) 2020-03-20 2020-03-20 Voice processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010199908.5A CN111383629B (en) 2020-03-20 2020-03-20 Voice processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111383629A CN111383629A (en) 2020-07-07
CN111383629B true CN111383629B (en) 2022-03-29

Family

ID=71218785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010199908.5A Active CN111383629B (en) 2020-03-20 2020-03-20 Voice processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111383629B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562730A (en) * 2020-11-24 2021-03-26 北京华捷艾米科技有限公司 Sound source analysis method and system
CN112652320B (en) * 2020-12-04 2024-04-12 深圳地平线机器人科技有限公司 Sound source positioning method and device, computer readable storage medium and electronic equipment

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5724125B2 (en) * 2011-03-30 2015-05-27 株式会社国際電気通信基礎技術研究所 Sound source localization device
CN108231085A (en) * 2016-12-14 2018-06-29 杭州海康威视数字技术股份有限公司 A kind of sound localization method and device
CN107172018A (en) * 2017-04-27 2017-09-15 华南理工大学 The vocal print cryptosecurity control method and system of activation type under common background noise
KR102236471B1 (en) * 2018-01-26 2021-04-05 서강대학교 산학협력단 A source localizer using a steering vector estimator based on an online complex Gaussian mixture model using recursive least squares
CN108831495B (en) * 2018-06-04 2022-11-29 桂林电子科技大学 Speech enhancement method applied to speech recognition in noise environment
CN110610718B (en) * 2018-06-15 2021-10-08 炬芯科技股份有限公司 Method and device for extracting expected sound source voice signal
CN110515034B (en) * 2019-08-26 2022-12-27 西安电子科技大学 Acoustic signal azimuth angle measurement system and method

Also Published As

Publication number Publication date
CN111383629A (en) 2020-07-07

Similar Documents

Publication Publication Date Title
JP6889698B2 (en) Methods and devices for amplifying audio
US9626970B2 (en) Speaker identification using spatial information
CN111418012B (en) Method for processing an audio signal and audio processing device
CN110675887B (en) Multi-microphone switching method and system for conference system
WO2016176329A1 (en) Impulsive noise suppression
CN112306448A (en) Method, apparatus, device and medium for adjusting output audio according to environmental noise
CN111383629B (en) Voice processing method and device, electronic equipment and storage medium
CN109905808B (en) Method and apparatus for adjusting intelligent voice device
CN110611861B (en) Directional sound production control method and device, sound production equipment, medium and electronic equipment
CN113257283B (en) Audio signal processing method and device, electronic equipment and storage medium
EP3320311B1 (en) Estimation of reverberant energy component from active audio source
CN111415653B (en) Method and device for recognizing speech
CN111341333B (en) Noise detection method, noise detection device, medium, and electronic apparatus
CN112992190B (en) Audio signal processing method and device, electronic equipment and storage medium
CN108962226B (en) Method and apparatus for detecting end point of voice
EP2745293B1 (en) Signal noise attenuation
US20170206898A1 (en) Systems and methods for assisting automatic speech recognition
Jeon et al. Acoustic surveillance of hazardous situations using nonnegative matrix factorization and hidden Markov model
CN111048096B (en) Voice signal processing method and device and terminal
CN111147655B (en) Model generation method and device
CN112309418A (en) Method and device for inhibiting wind noise
US20190027164A1 (en) System and method for voice activity detection and generation of characteristics respective thereof
CN111145769A (en) Audio processing method and device
CN111624554A (en) Sound source positioning method and device
CN111145776B (en) Audio processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 1211 Dongfang Science and Technology Building, No. 16 Keyuan Road, Science and Technology Park Community, Yuehai Street, Nanshan District, Shenzhen, Guangdong Province, 518000

Patentee after: Shenzhen weiai intelligent Co.,Ltd.

Address before: 518000 room 306, building 21, Xili Industrial Zone, No. 111, Xinguang Road, Xili community, Xili street, Nanshan District, Shenzhen, Guangdong

Patentee before: Shenzhen weiai intelligent Co.,Ltd.

CP02 Change in the address of a patent holder