CN110211599B - Application awakening method and device, storage medium and electronic equipment - Google Patents

Application awakening method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN110211599B
CN110211599B CN201910478400.6A CN201910478400A CN110211599B CN 110211599 B CN110211599 B CN 110211599B CN 201910478400 A CN201910478400 A CN 201910478400A CN 110211599 B CN110211599 B CN 110211599B
Authority
CN
China
Prior art keywords
audio data
preset
processor
filter coefficient
adaptive filter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910478400.6A
Other languages
Chinese (zh)
Other versions
CN110211599A (en
Inventor
陈喆
刘耀勇
陈岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority to CN201910478400.6A priority Critical patent/CN110211599B/en
Publication of CN110211599A publication Critical patent/CN110211599A/en
Application granted granted Critical
Publication of CN110211599B publication Critical patent/CN110211599B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02165Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the application awakening method and device, a storage medium and electronic equipment are disclosed, wherein the electronic equipment comprises two microphones, and two paths of audio data can be acquired through the two microphones, and background audio data played during audio acquisition is acquired; then, echo cancellation processing is carried out on the two paths of audio data according to the background audio data so as to eliminate self-noise; then, performing beam forming processing on the two paths of audio data after echo cancellation to eliminate external noise and obtain enhanced audio data; and then, performing two-stage verification on the text characteristic and the voiceprint characteristic of the enhanced audio data, and awakening the voice interaction application when the two-stage verification is passed, thereby realizing the voice interaction between the electronic equipment and the user. Therefore, the method and the device can eliminate the interference of self noise and external noise, ensure the verification accuracy by utilizing two-stage verification and achieve the purpose of improving the awakening rate of the voice interaction application.

Description

Application awakening method and device, storage medium and electronic equipment
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to an application wake-up method, apparatus, storage medium, and electronic device.
Background
Currently, with the development of voice recognition technology, an electronic device (such as a mobile phone, a tablet computer, etc.) may perform voice interaction with a user through a running voice interaction application, for example, the user may say "i want to listen to a song", and then the voice interaction application recognizes the voice of the user and plays the song after recognizing the intention of the user that wants to listen to the song. It can be understood that the premise of voice interaction between the user and the electronic device is to wake up the voice interaction application, however, in an actual use environment, various noises often exist, so that the wake-up rate of the voice interaction application is low.
Disclosure of Invention
The embodiment of the application awakening method and device, the storage medium and the electronic equipment can improve the awakening rate of the voice interaction application.
In a first aspect, an embodiment of the present application provides an application wake-up method, which is applied to an electronic device, where the electronic device includes two microphones, and the application wake-up method includes:
acquiring two paths of audio data through the two microphones, and acquiring background audio data played in an audio acquisition period;
performing echo cancellation processing on the two paths of audio data according to the background audio data to obtain two paths of audio data after echo cancellation;
performing beam forming processing on the two paths of audio data after the echo cancellation to obtain enhanced audio data;
performing primary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data, and performing secondary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data after the primary verification is passed;
and if the secondary verification passes, awakening the voice interactive application.
In a second aspect, an embodiment of the present application provides an application waking device, which is applied to an electronic device, where the electronic device includes two microphones, and the application waking device includes:
the audio acquisition module is used for acquiring two paths of audio data through the two microphones and acquiring background audio data played in an audio acquisition period;
the echo cancellation module is used for carrying out echo cancellation processing on the two paths of audio data according to the background audio data to obtain two paths of audio data after echo cancellation;
the beam forming module is used for carrying out beam forming processing on the two paths of audio data after the echo cancellation to obtain enhanced audio data;
the audio verification module is used for performing primary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data and performing secondary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data after the primary verification is passed;
and the application awakening module is used for awakening the voice interaction application when the secondary verification is passed.
In a third aspect, the present application provides a storage medium, on which a computer program is stored, and when the computer program is run on an electronic device including two microphones, the electronic device is caused to execute the application wakeup method provided in the present application.
In a fourth aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a processor, a memory, and two microphones, where the memory stores a computer program, and the processor is used to execute the application wake-up method provided in the embodiment of the present application by calling the processor.
In the embodiment of the application, the electronic equipment comprises two microphones, and the two microphones can acquire two paths of audio data and acquire background audio data played in an audio acquisition period; then, echo cancellation processing is carried out on the two paths of audio data according to the background audio data so as to eliminate self-noise; then, performing beam forming processing on the two paths of audio data after echo cancellation to eliminate external noise and obtain enhanced audio data; secondly, performing primary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data, and performing secondary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data after the primary verification is passed; and finally, if the secondary verification passes, awakening the voice interaction application so as to realize the voice interaction between the electronic equipment and the user. Therefore, the method and the device can eliminate the interference of self noise and external noise, ensure the verification accuracy by utilizing two-stage verification and achieve the purpose of improving the awakening rate of the voice interaction application.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart illustrating an application wake-up method according to an embodiment of the present disclosure.
Fig. 2 is a schematic diagram of the arrangement positions of two microphones in the embodiment of the present application.
Fig. 3 is a schematic flow chart of training a voiceprint feature extraction model in the embodiment of the present application.
Fig. 4 is a schematic diagram of a spectrogram extracted in the example of the present application.
Fig. 5 is another flowchart of an application wake-up method according to an embodiment of the present application.
Fig. 6 is a schematic structural diagram of an application wake-up apparatus according to an embodiment of the present application.
Fig. 7 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Fig. 8 is another schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Referring to the drawings, wherein like reference numbers refer to like elements, the principles of the present application are illustrated as being implemented in a suitable computing environment. The following description is based on illustrated embodiments of the application and should not be taken as limiting the application with respect to other embodiments that are not detailed herein.
The embodiment of the present application first provides an application wake-up method, where an execution main body of the application wake-up method may be an electronic device provided in the embodiment of the present application, the electronic device includes two microphones, and the electronic device may be a device with processing capability and configured with a processor, such as a smart phone, a tablet computer, a palmtop computer, a notebook computer, or a desktop computer.
Referring to fig. 1, fig. 1 is a flowchart illustrating an application wake-up method according to an embodiment of the present disclosure. The application wake-up method is applied to the electronic device provided by the present application, where the electronic device includes two microphones, as shown in fig. 1, a flow of the application wake-up method provided by the embodiment of the present application may be as follows:
in 101, two paths of audio data are acquired by two microphones, and background audio data played during audio acquisition is acquired.
For example, two microphones included in the electronic device are arranged back to back and separated by a preset distance, where the arrangement of the two microphones back to back means that sound pickup holes of the two microphones face opposite directions. For example, referring to fig. 2, the electronic device includes two microphones, which are a microphone 1 disposed on a lower side of the electronic device and a microphone 2 disposed on an upper side of the electronic device, respectively, wherein a sound-collecting hole of the microphone 1 faces downward, a sound-collecting hole of the microphone 2 faces upward, and a connection line between the microphone 2 and the microphone 1 is parallel to left/right sides of the electronic device. Furthermore, the two microphones included in the electronic device may be non-directional microphones (or, omni-directional microphones).
In the embodiment of the application, the electronic equipment can collect sound through two microphones arranged back to back during playing audio and video, so that two paths of audio data with the same time length are collected. In addition, the electronic device may also obtain audio data played during audio acquisition, which may be independent audio data, such as audio files, songs, etc. played, or audio data appended to the video data, etc. It should be noted that, in order to distinguish between audio data obtained by performing sound acquisition and audio data played during audio acquisition, the audio data obtained during audio acquisition is referred to as background audio data in the present application.
At 102, echo cancellation processing is performed on the two paths of audio data according to the background audio data, so as to obtain two paths of audio data after echo cancellation.
It should be noted that, during playing audio and video, the electronic device performs sound collection through two microphones, and will collect and obtain the sound of the playing background audio data, that is, echo (or self-noise). In the application, in order to eliminate echoes in the two collected audio data, echo cancellation processing is further performed on the two audio data by using an echo cancellation algorithm according to background audio data so as to eliminate echoes in the two audio data, and the two audio data after echo cancellation are obtained. It should be noted that, in the embodiment of the present application, there is no particular limitation on what echo cancellation algorithm is used, and a person skilled in the art may select the echo cancellation algorithm according to actual needs.
For example, the electronic device may perform anti-phase processing on the background audio data to obtain anti-phase background audio data, and then superimpose the anti-phase background audio data with the two paths of audio data respectively to eliminate echoes in the two paths of audio data, so as to obtain two paths of audio data after echo cancellation.
In a popular way, the echo cancellation process performed above cancels the self-noise carried in the audio data.
In 103, the two paths of audio data after echo cancellation are processed by beamforming to obtain enhanced audio data.
After completing echo cancellation processing on the two paths of audio data and obtaining the two paths of audio data after echo cancellation, the electronic device further performs beam forming processing on the two paths of audio data after echo cancellation to obtain a path of audio data with a higher signal-to-noise ratio, and the audio data is recorded as enhanced audio data.
In colloquial terms, the beamforming process performed above eliminates external noise carried in the audio data. Therefore, the electronic device obtains the enhanced audio data with self-noise and external noise removed through echo cancellation processing and beam forming processing of the two paths of acquired audio data.
At 104, a primary check is performed on the text features and the voiceprint features of the enhanced audio data, and a secondary check is performed on the text features and the voiceprint features of the enhanced audio data after the primary check is passed.
As described above, the enhanced audio data eliminates self-noise and external noise compared to the collected original two-way audio data, which has a higher signal-to-noise ratio. At this time, the electronic device further performs two-stage verification on the text feature and the voiceprint feature of the enhanced audio data, wherein the electronic device performs one-stage verification on the text feature and the voiceprint feature of the enhanced audio data based on the first wake-up algorithm, and if the one-stage verification passes, the electronic device performs two-stage verification on the text feature and the voiceprint feature of the enhanced audio data based on the second wake-up algorithm.
It should be noted that, in the embodiment of the present application, whether the primary verification or the secondary verification is performed on the text feature and the voiceprint feature of the enhanced audio data, it is verified whether the enhanced audio data includes a preset wake-up word spoken by a preset user (for example, an owner of the electronic device, or another user who the owner authorizes to use the electronic device), if the enhanced audio data includes the preset wake-up word spoken by the preset user, the text feature and the voiceprint feature of the enhanced audio data are verified to be passed, and otherwise, the verification is not passed. For example, the enhanced audio data includes a preset wake-up word set by a preset user, and if the preset wake-up word is spoken by the preset user, the text feature and the voiceprint feature of the enhanced audio data pass the verification. For another example, when the enhanced audio data includes a preset wake-up word spoken by a user other than the preset user, or the enhanced audio data does not include any preset wake-up word spoken by the user, the verification fails (or the verification fails).
In addition, it should be further noted that, in the embodiment of the present application, the first wake-up algorithm and the second wake-up algorithm adopted by the electronic device are different. For example, the first voice wake-up algorithm is a voice wake-up algorithm based on a gaussian mixture model, and the second voice wake-up algorithm is a voice wake-up algorithm based on a neural network.
In 105, if the secondary check passes, the voice interaction application is awakened.
Among them, the voice interactive application is a so-called voice assistant, such as the voice assistant "xiaoho" of the european.
Based on the above description, it can be understood by those skilled in the art that when the secondary verification of the enhanced audio data passes, it indicates that a preset user currently speaks a preset wake-up word, and at this time, the voice interaction application is woken up, so as to implement voice interaction between the electronic device and the user.
As can be seen from the above, in the embodiment of the application, the electronic device may acquire two paths of audio data through two microphones, and acquire background audio data played during audio acquisition; then, echo cancellation processing is carried out on the two paths of audio data according to the background audio data so as to eliminate self-noise; then, performing beam forming processing on the two paths of audio data after echo cancellation to eliminate external noise and obtain enhanced audio data; secondly, performing primary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data, and performing secondary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data after the primary verification is passed; and finally, if the secondary verification passes, awakening the voice interaction application so as to realize the voice interaction between the electronic equipment and the user. Therefore, the method and the device can eliminate the interference of self noise and external noise, ensure the verification accuracy by utilizing two-stage verification and achieve the purpose of improving the awakening rate of the voice interaction application.
In one embodiment, "echo cancellation processing two audio data according to background audio data" includes:
(1) obtaining an initial adaptive filter coefficient, and iteratively updating the initial adaptive filter coefficient according to background audio data and audio data to obtain a target adaptive filter coefficient;
(2) and performing echo cancellation processing on the audio data according to the target self-adaptive filter coefficient.
In the embodiment of the present application, when the electronic device performs echo cancellation processing on two paths of audio data according to background audio data, the following description will take echo cancellation processing on one path of audio data as an example.
The electronic equipment firstly acquires an initial adaptive filter coefficient, and then iteratively updates the initial adaptive filter coefficient according to background audio data and one path of audio data to obtain a target adaptive filter coefficient. Then, the electronic device estimates echo audio data carried in the audio data according to the target adaptive filter coefficient obtained by iterative update, so as to eliminate the echo audio data carried in the audio data, and complete echo cancellation processing on the audio data, as shown in the following formula:
X’=X-WT*X;
where X' denotes audio data after echo cancellation, X denotes audio data before echo cancellation, W denotes a target adaptive filter coefficient, and T denotes transposition.
In one embodiment, "iteratively updating the initial adaptive filter coefficients according to the background audio data and the audio data to obtain the target adaptive filter coefficients" includes:
(1) obtaining the self-adaptive filter coefficient at the current moment according to the initial self-adaptive filter coefficient;
(2) estimating echo audio data carried in the audio data and corresponding to the current moment according to the coefficient of the adaptive filter at the current moment;
(3) acquiring error audio data at the current moment according to the background audio data and the echo audio data obtained by estimation;
(4) and identifying the active part of the adaptive filter coefficient at the current moment, updating the active part of the adaptive filter coefficient at the current moment according to the error audio data at the current moment, and adjusting the order of the adaptive filter coefficient at the current moment to obtain the adaptive filter coefficient at the next moment.
How to iteratively update the initial adaptive filter coefficients is described in one update process below.
The current time is not specific to a certain time, but refers to a time when the initial adaptive filter coefficient is updated once.
Taking the first update of the initial adaptive filter coefficient as an example, the electronic device obtains the initial adaptive filter coefficient as the adaptive filter coefficient at the current time k. For example, the adaptive filter coefficient obtained at current time k is W(k)=[w0,w1,w3...wL-1]TAnd has a length L.
Then, the electronic device estimates, according to the adaptive filter coefficient at the current time k, that the audio data carries echo audio data corresponding to the current time, as shown in the following formula:
Figure BDA0002083017220000071
wherein the content of the first and second substances,
Figure BDA0002083017220000072
representing the estimated echo audio data corresponding to the current time k, and x (k) representing the portion of the audio data corresponding to the current time k.
Then, the electronic device obtains error audio data of the current time k according to the portion of the background audio data corresponding to the current time k and the echo audio data obtained by estimation, as shown in the following formula:
Figure BDA0002083017220000073
where e (k) represents error audio data at the current time k, and r (k) represents a portion of the background audio data corresponding to the current time k.
It should be noted that larger filter orders increase computational complexity, while smaller filter orders do not fully converge the echo. In the application, the coefficients of the adaptive filter are mostly 0, and only a small part of the coefficients play a role of iterative update, so that only the active part of the adaptive filter can be iteratively updated, and the order of the adaptive filter can be adjusted in real time.
Correspondingly, in this embodiment of the present application, after acquiring the error audio data at the current time, the electronic device further identifies an active part of the adaptive filter coefficient at the current time k, so as to update the active part of the adaptive filter coefficient at the current time according to the error audio data at the current time, as shown in the following formula:
W(k+1)=W(k)+ux(k)e(k);
wherein u represents a preset convergence step length, which can be set by a person skilled in the art according to actual needs, and the present applicationThe examples do not specifically limit this. It is emphasized that the adaptive filter coefficient W at the current time k(k)When an update is made, only the active part thereof is updated. For example, W(k)=[w0,w1,w3...wL-1]TWherein [ w0,w1,w3...wL-3]Determined to be active, the electronic device pairs w as per the formula above0,w1,w3...wL-3]And (6) updating.
In addition, the electronic device adjusts the order of the adaptive filter coefficient at the current time according to the identified active portion, thereby obtaining an adaptive filter coefficient W (k +1) at the next time.
In one embodiment, "identifying an active portion of adaptive filter coefficients for a current time instant" includes:
(1) dividing the adaptive filter coefficient at the current moment into a plurality of sub-filter coefficients with equal length;
(2) acquiring the average value and the variance of each sub-filter coefficient from the back to the front, and determining the first sub-filter coefficient and the previous sub-filter coefficient of which the average value is greater than the preset average value and the variance is greater than the preset variance as active parts;
adjusting the order of the adaptive filter coefficient at the current time comprises:
(3) and judging whether the first sub-filter coefficient is the last sub-filter coefficient, if so, increasing the order of the adaptive filter coefficient at the current moment, and otherwise, reducing the order of the adaptive filter coefficient at the current moment.
In the embodiment of the present application, when identifying an active portion of an adaptive filter coefficient at a current time, an electronic device first divides the adaptive filter coefficient at the current time into a plurality of sub-filter coefficients with equal length (the length is greater than 1), for example, the electronic device divides the adaptive filter coefficient W at the current time into [ W ═ W0,w1,w2...wL-1]TDividing the coefficient into M sub-filters with equal length, each sub-filter having a length of L/M, so that the mth sub-filter is setNumber Wm=[wmL/M,wmL/M+1,wmL/M+2…w(m+1)L/M]TAnd M has a value range of [0, M]。
Then, the electronic device obtains the average value and the variance of each sub-filter coefficient from the back to the front, that is, the average value and the variance of the Mth sub-filter coefficient are obtained first, and then the average value and the scheme of the M-1 th sub-filter coefficient are obtained until the first sub-filter coefficient of which the average value is larger than the preset average value and the opposite side difference is larger than the preset variance is obtained, and the first sub-filter coefficient and the sub-filter coefficients before the first sub-filter coefficient are determined as the active part of the adaptive filter coefficient at the current moment.
The preset average value and the preset variance may be obtained by a person skilled in the art through experimental adjustment, which is not specifically limited in the embodiment of the present application, for example, in the embodiment of the present application, the preset average value may be 0.000065, and the preset variance may be 0.003.
In addition, when the order of the adaptive filter coefficient at the current time is adjusted, the electronic device may determine whether the first sub-filter coefficient is the last sub-filter coefficient, if so, it indicates that the order of the adaptive filter coefficient at the current time is insufficient, and increase the order of the adaptive filter coefficient at the current time, otherwise, it indicates that the order of the adaptive filter coefficient at the current time is sufficient, and may decrease the order of the adaptive filter coefficient at the current time.
In this embodiment, for the variation of increasing or decreasing the order, a person skilled in the art can take an empirical value according to actual needs, and the embodiment of the present application does not specifically limit this.
In an embodiment, the "performing beamforming processing on the two paths of audio data after echo cancellation to obtain enhanced audio data" includes:
and respectively carrying out beam forming processing on the two paths of audio data after echo cancellation at a plurality of preset angles by adopting a preset beam forming algorithm to obtain a plurality of enhanced audio data.
In the embodiment of the present application, a plurality of preset angles are provided for a microphone of an electronic device, for example, in a process of performing voice interaction with a user, the electronic device counts incoming wave angles of voice of the user to obtain a plurality of incoming wave angles at which a user usage probability reaches a preset probability, and uses the plurality of incoming wave angles as the plurality of preset angles.
Therefore, the electronic equipment can preset a beam forming algorithm to perform beam forming processing on the two paths of audio data after echo cancellation at a plurality of preset angles respectively to obtain a plurality of enhanced audio data.
For example, assume that 3 preset angles are provided, each being θ1,θ2And theta3The GSC algorithm can be adopted for beam forming processing, and since the GSC algorithm needs to estimate the beam forming angle in advance, the electronic equipment will estimate the angle theta1,θ2And theta3As the beam forming angles estimated by the GSC algorithm, the GSC algorithm is adopted to respectively aim at theta1,θ2And theta3And performing beam forming processing to obtain 3 paths of enhanced audio data.
As described above, in the embodiment of the present application, the preset angle is used instead of the beam forming angle of the angle estimation, so that time-consuming angle estimation is not required, and the overall efficiency of beam forming can be improved.
In one embodiment, "performing a primary check on text features and voiceprint features of the enhanced audio data" includes:
(1) extracting Mel frequency cepstrum coefficients of the enhanced audio data corresponding to each preset angle;
(2) calling a target voiceprint characteristic model related to a preset text to match the extracted mel frequency cepstrum coefficients;
(3) if the matched mel frequency cepstrum coefficient exists, judging that the primary check is passed;
the target voiceprint feature model is obtained by a Gaussian mixture general background model related to a preset text in a self-adaptive mode according to the Mel frequency cepstrum coefficient of the preset audio data, and the preset audio data are audio data of a preset text spoken by a preset user.
The first-order wake-up algorithm is explained below.
It should be noted that, in the embodiment of the present application, a gaussian mixture general background model related to a preset text is trained in advance. The preset text is the above mentioned preset wake-up word. For example, audio data of a plurality of people (e.g., 200 people) who speak a preset wake-up word may be collected in advance, mel-frequency cepstrum coefficients of the audio data are extracted respectively, and a gaussian mixture general background model related to a preset text (i.e., the preset wake-up word) is obtained through training according to the mel-frequency cepstrum coefficients of the audio data.
Then, further training is carried out on the Gaussian mixture general background model, wherein self-adaptive processing (such as self-adaptive algorithms of maximum posterior probability MAP, maximum likelihood linear regression MLLR and the like) is carried out on the Gaussian mixture general background model according to the Mel frequency cepstrum coefficients of the preset audio data, the preset audio data is audio data of a preset text (namely a preset awakening word) spoken by a preset user, therefore, each Gaussian distribution of the Gaussian mixture general background model is close to the Mel frequency cepstrum coefficient corresponding to the preset user, the Gaussian mixture general background model carries the voiceprint characteristics of the preset user, and the Gaussian mixture general background model carrying the voiceprint characteristics of the preset user is recorded as a target voiceprint characteristic model.
Therefore, when the electronic equipment performs primary verification on the text features and the voiceprint features of the enhanced audio data, Mel frequency cepstrum coefficients of the enhanced audio data corresponding to each preset angle are respectively extracted, then target voiceprint feature models related to preset texts are called to respectively match the extracted Mel frequency cepstrum coefficients, the electronic equipment inputs the extracted Mel frequency cepstrum coefficients into the target voiceprint feature models, the target voiceprint feature models identify the input Mel frequency cepstrum coefficients and output a score, when the output score reaches a preset threshold value, the input Mel frequency cepstrum coefficients can be judged to be matched with the target voiceprint feature models, and otherwise, the input Mel frequency cepstrum coefficients are not matched with the target voiceprint feature models. For example, in the embodiment of the present application, the interval of the output score of the target voiceprint feature model is [0,1], and the preset threshold is configured to be 0.28, that is, when the score corresponding to the mel-frequency cepstrum coefficient of the input target voiceprint feature model reaches 0.28, the electronic device determines that the mel-frequency cepstrum coefficient matches with the target voiceprint feature model.
After the electronic equipment calls a target voiceprint feature model related to the preset text to match the extracted mel frequency cepstrum coefficients, if the matched mel frequency cepstrum coefficients exist, the electronic equipment judges that the primary check is passed.
In one embodiment, "performing a secondary check on text features and voiceprint features of enhanced audio data" includes:
(1) dividing the enhanced audio data corresponding to the preset angle into a plurality of sub audio data;
(2) extracting a voiceprint characteristic vector of each sub audio data according to a voiceprint characteristic extraction model related to a preset text;
(3) acquiring similarity between each voiceprint feature vector and a target voiceprint feature vector, wherein the target voiceprint feature vector is a voiceprint feature vector of preset audio data;
(4) according to the similarity corresponding to each sub audio data, verifying the text characteristic and the voiceprint characteristic of the enhanced audio data corresponding to the preset angle;
(5) and if the enhanced audio data corresponding to the preset angle passing the verification exists, judging that the secondary verification passes.
The secondary wake-up algorithm is explained below.
In the embodiment of the present application, it is considered that the enhanced audio data may not only include the preset wake-up word, for example, the preset wake-up word is "small europe and small europe", and the enhanced audio data is "small europe and small europe with your good". Therefore, in the embodiment of the present application, the voice part is divided into a plurality of sub-audio data according to the length of the preset wakeup word, where the length of each sub-audio data is greater than or equal to the length of the preset wakeup word, and two adjacent sub-audio data have an overlapped part, and as for the length of the overlapped part, the length of the overlapped part may be set by a person skilled in the art according to actual needs, for example, the length of the overlapped part is set to be 25% of the length of the sub-audio data in the embodiment of the present application.
It should be noted that in the embodiment of the present application, a voiceprint feature extraction model related to a preset text (i.e., a preset wake-up word) is also trained in advance. For example, in the embodiment of the present application, a voiceprint feature extraction model based on a convolutional neural network is trained, as shown in fig. 3, audio data of a preset wakeup word spoken by multiple persons (for example, 200 persons) is collected in advance, then endpoint detection is performed on the audio data, a preset wakeup word part is segmented out, then preprocessing (for example, high-pass filtering) and windowing are performed on the segmented preset wakeup word part, then fourier transform (for example, short-time fourier transform) is performed, and then energy density is calculated, a spectrogram of a gray scale is generated (as shown in fig. 4, wherein a horizontal axis represents time, a vertical axis represents frequency, and a gray scale represents an energy value), and finally, the generated spectrogram is trained by using the convolutional neural network, and a voiceprint feature extraction model related to a preset text is generated. In addition, in the embodiment of the application, a spectrogram of audio data of a preset user speaking a preset wakeup word (that is, a preset text) is extracted and input into a previously trained voiceprint feature extraction model, and after passing through a plurality of convolution layers, pooling layers and full-link layers of the voiceprint feature extraction model, a corresponding group of feature vectors are output and recorded as a target voiceprint feature vector.
Correspondingly, after the electronic device divides the enhanced audio data corresponding to the preset angle into a plurality of sub audio data, the spectrogram of each sub audio data is respectively extracted. For how to extract the spectrogram, details are not repeated here, and specific reference may be made to the above related description. After extracting the spectrogram of the sub-audio data, the electronic device inputs the spectrogram of the sub-audio data into a previously trained voiceprint feature extraction model, so as to extract a voiceprint feature vector of each sub-audio data.
After extracting the voiceprint feature vectors of the sub-audio data, the electronic device respectively obtains the similarity between the voiceprint feature vectors of the sub-audio data and the target voiceprint feature vector, and then verifies the text feature and the voiceprint feature of the enhanced audio data corresponding to the preset angle according to the similarity corresponding to the sub-audio data. For example, the electronic device may determine whether there is sub audio data whose similarity between the voiceprint feature vector and the target voiceprint feature vector reaches a preset similarity (an empirical value may be taken by a person of ordinary skill in the art according to actual needs, and may be set to 75%, for example), and if there is, determine a text feature and a voiceprint feature of the enhanced audio data corresponding to the preset angle.
After the electronic equipment completes the verification of the text features and the voiceprint features of the enhanced audio data corresponding to the preset angle, if the enhanced audio data corresponding to the preset angle passes the verification, the electronic equipment judges that the secondary verification passes.
In an embodiment, the verifying the text feature and the voiceprint feature of the enhanced audio data corresponding to the predetermined angle according to the similarity corresponding to each sub-audio data includes:
according to the similarity corresponding to each sub audio data and a preset identification function, verifying the text characteristic and the voiceprint characteristic of the enhanced audio data corresponding to the preset angle;
wherein the preset identification function is gamman=γn-1+f(ln),γnRepresenting the state value, gamma, of the recognition function corresponding to the nth sub-audio datan-1Represents the state value of the recognition function corresponding to the n-1 th sub audio data,
Figure BDA0002083017220000121
a is a correction value of the recognition function, b is a predetermined similarity, lnIf the similarity exists between the voiceprint characteristic vector of the nth sub-audio data and the target voiceprint characteristic vector, the similarity is larger than the gamma of the preset identification function state valuenAnd judging that the text characteristic and the voiceprint characteristic of the enhanced audio data corresponding to the preset angle pass verification.
It should be noted that the value of a in the recognition function can be an empirical value according to actual needs by those skilled in the art, for example, a can be set to 1.
In addition, the value of b in the recognition function is positively correlated with the recognition rate of the voiceprint feature extraction model, and the value of b is determined according to the recognition rate of the voiceprint feature extraction model obtained through actual training.
In addition, the preset recognition function state value can also be obtained by a person skilled in the art according to actual needs, and the higher the value is, the higher the accuracy of verification on the voice part is.
Therefore, through the identification function, even if other information except the preset awakening words is included in the enhanced audio data, the enhanced audio data can be accurately verified.
Optionally, when obtaining the similarity between the voiceprint feature vector of each sub-audio data and the target voiceprint feature training, the similarity between the voiceprint feature vector of each sub-audio data and the target voiceprint feature vector may be calculated according to a dynamic time warping algorithm.
Or, a feature distance between the voiceprint feature vector of each sub-audio data and the target voiceprint feature vector may be calculated as a similarity, and as to what feature distance is used to measure the similarity between the two vectors, no specific limitation is imposed in this embodiment of the application, for example, an euclidean distance may be used to measure the similarity between the voiceprint feature vector of the sub-audio data and the target voiceprint feature vector.
Fig. 5 is another flowchart of an application wake-up method according to an embodiment of the present application. The application wake-up method is applied to the electronic device provided by the present application, where the electronic device includes two microphones, as shown in fig. 5, a flow of the application wake-up method provided by the embodiment of the present application may be as follows:
in 201, the electronic device determines whether the electronic device is in an audio/video playing state based on a processor, if so, the electronic device proceeds to 202, and if not, the electronic device proceeds to 206.
In the embodiment of the application, the electronic device firstly judges whether the electronic device is in the audio and video playing state based on the processor, for example, taking an android system as an example, the electronic device receives an android internal message based on the processor, and judges whether the electronic device is in the audio and video playing state according to the android internal message.
In 202, the electronic device acquires two paths of audio data through two microphones, and acquires background audio data played during audio acquisition.
For example, two microphones included in the electronic device are arranged back to back and separated by a preset distance, where the arrangement of the two microphones back to back means that sound pickup holes of the two microphones face opposite directions. For example, referring to fig. 2, the electronic device includes two microphones, which are a microphone 1 disposed on a lower side of the electronic device and a microphone 2 disposed on an upper side of the electronic device, respectively, wherein a sound-collecting hole of the microphone 1 faces downward, a sound-collecting hole of the microphone 2 faces upward, and a connection line between the microphone 2 and the microphone 1 is parallel to left/right sides of the electronic device. Furthermore, the two microphones included in the electronic device may be non-directional microphones (or, omni-directional microphones).
In the embodiment of the application, the electronic equipment can collect sound through two microphones arranged back to back during playing audio and video, so that two paths of audio data with the same time length are collected. In addition, the electronic device may also obtain audio data played during audio acquisition, which may be independent audio data, such as audio files, songs, etc. played, or audio data appended to the video data, etc. It should be noted that, in order to distinguish between audio data obtained by performing sound acquisition and audio data played during audio acquisition, the audio data obtained during audio acquisition is referred to as background audio data in the present application.
At 203, the electronic device performs echo cancellation processing on the two paths of audio data based on the processor according to the background audio data to obtain two paths of audio data after echo cancellation.
It should be noted that, during playing audio and video, the electronic device performs sound collection through two microphones, and will collect and obtain the sound of the playing background audio data, that is, echo (or self-noise). In the application, in order to eliminate echoes in the two collected audio data, an echo cancellation algorithm is called based on the processor to perform echo cancellation processing on the two audio data further according to background audio data so as to eliminate echoes in the two audio data and obtain two audio data after echo cancellation. It should be noted that, in the embodiment of the present application, there is no particular limitation on what echo cancellation algorithm is used, and a person skilled in the art may select the echo cancellation algorithm according to actual needs.
For example, the electronic device may perform anti-phase processing on the background audio data based on the processor to obtain anti-phase background audio data, and then superimpose the anti-phase background audio data with the two paths of audio data respectively to eliminate echoes in the two paths of audio data, so as to obtain two paths of audio data after echo cancellation.
In a popular way, the echo cancellation process performed above cancels the self-noise carried in the audio data.
At 204, the electronic device performs beamforming processing on the two paths of audio data after echo cancellation based on the processor, so as to obtain enhanced audio data.
After the electronic device completes echo cancellation processing on the two paths of audio data to obtain two paths of audio data after echo cancellation, the electronic device further performs beam forming processing on the two paths of audio data after echo cancellation based on the processor to obtain one path of audio data with a higher signal-to-noise ratio, and the audio data is recorded as enhanced audio data.
In colloquial terms, the beamforming process performed above eliminates external noise carried in the audio data. Therefore, the electronic device obtains the enhanced audio data with self-noise and external noise removed through echo cancellation processing and beam forming processing of the two paths of acquired audio data.
In 205, the electronic device performs a primary check on the text feature and the voiceprint feature of the enhanced audio data based on the processor, performs a secondary check on the text feature and the voiceprint feature of the enhanced audio data based on the processor after the primary check is passed, and wakes up the voice interaction application based on the processor if the secondary check is passed.
As described above, the enhanced audio data eliminates self-noise and external noise compared to the collected original two-way audio data, which has a higher signal-to-noise ratio. At this time, the electronic device further performs two-stage verification on the text feature and the voiceprint feature of the enhanced audio data based on the processor, wherein the first wake-up algorithm is called based on the processor to perform one-stage verification on the text feature and the voiceprint feature of the enhanced audio data, and if the one-stage verification passes, the second wake-up algorithm is called based on the processor to perform two-stage verification on the text feature and the voiceprint feature of the enhanced audio data.
It should be noted that, in the embodiment of the present application, whether the primary verification or the secondary verification is performed on the text feature and the voiceprint feature of the enhanced audio data, it is verified whether the enhanced audio data includes a preset wake-up word spoken by a preset user (for example, an owner of the electronic device, or another user who the owner authorizes to use the electronic device), if the enhanced audio data includes the preset wake-up word spoken by the preset user, the text feature and the voiceprint feature of the enhanced audio data are verified to be passed, and otherwise, the verification is not passed. For example, the enhanced audio data includes a preset wake-up word set by a preset user, and if the preset wake-up word is spoken by the preset user, the text feature and the voiceprint feature of the enhanced audio data pass the verification. For another example, when the enhanced audio data includes a preset wake-up word spoken by a user other than the preset user, or the enhanced audio data does not include any preset wake-up word spoken by the user, the verification fails (or the verification fails).
In addition, it should be further noted that, in the embodiment of the present application, the first wake-up algorithm and the second wake-up algorithm adopted by the electronic device are different. For example, the first voice wake-up algorithm is a voice wake-up algorithm based on a gaussian mixture model, and the second voice wake-up algorithm is a voice wake-up algorithm based on a neural network.
At 206, the electronic device acquires a channel of audio data through any of the microphones.
When the electronic equipment does not play audio and video, sound collection is carried out through any microphone, and one path of audio data is obtained.
In 207, the electronic device performs a primary verification on the one path of audio data based on the dedicated voice recognition chip, and performs a secondary verification on the one path of audio data based on the processor after the primary verification passes.
The dedicated voice recognition chip is a dedicated chip designed for voice recognition, such as a digital signal processing chip designed for voice, an application specific integrated circuit chip designed for voice, and the like, and has lower power consumption than a general-purpose processor.
After the electronic equipment acquires the path of audio data, calling a third awakening algorithm based on a special voice recognition chip to verify the path of audio data, wherein text characteristics and voiceprint characteristics of the path of audio data can be verified simultaneously, and text characteristics of the path of audio data can be verified only.
For example, the electronic device may extract the mel-frequency cepstrum coefficient of the aforementioned audio data based on a dedicated speech recognition chip; then, calling a Gaussian mixture general background model related to a preset text based on a special voice recognition chip to match the extracted Mel frequency cepstrum coefficient; if the matching is successful, the text characteristic check of the path of audio data is judged to be passed.
After the first-level verification of the one path of audio data is passed, the electronic device further performs second-level verification on the one path of audio data based on the processor, wherein when the electronic device performs the second-level verification on the one path of audio data based on the processor, the first awakening algorithm or the second awakening algorithm is called based on the processor to verify the text feature and the sound pattern feature of the one path of audio data.
At 208, if the secondary check passes, the electronic device wakes up the voice interaction application based on the processor.
When the secondary verification of the audio data passes, the electronic equipment can wake up the voice interaction application based on the processor, and the voice interaction between the electronic equipment and the user is realized.
Referring to fig. 6, fig. 6 is a schematic structural diagram of an application wake-up apparatus according to an embodiment of the present application. The application waking device can be applied to an electronic device which comprises two microphones. The wake-on-app device may include an audio acquisition module 401, an echo cancellation module 402, a beamforming module 403, an audio verification module 404, and a wake-on-app module 405, wherein,
the audio acquisition module 401 is configured to acquire two paths of audio data through two microphones and acquire background audio data played during audio acquisition;
the echo cancellation module 402 is configured to perform echo cancellation processing on the two paths of audio data according to the background audio data to obtain two paths of audio data after echo cancellation;
a beam forming module 403, configured to perform beam forming processing on the two paths of audio data after echo cancellation to obtain enhanced audio data;
the audio verification module 404 is configured to perform primary verification on the text features and the voiceprint features of the enhanced audio data, and perform secondary verification on the text features and the voiceprint features of the enhanced audio data after the primary verification is passed;
and an application wake-up module 405, configured to wake up the voice interaction application when the secondary verification passes.
In an embodiment, when performing echo cancellation processing on two audio data according to background audio data, the echo cancellation module 402 may be configured to:
obtaining an initial adaptive filter coefficient, and iteratively updating the initial adaptive filter coefficient according to background audio data and audio data to obtain a target adaptive filter coefficient;
and performing echo cancellation processing on the audio data according to the target self-adaptive filter coefficient.
In one embodiment, when iteratively updating the initial adaptive filter coefficients according to the background audio data and the audio data to obtain the target adaptive filter coefficients, the echo cancellation module 402 may be configured to:
obtaining the self-adaptive filter coefficient at the current moment according to the initial self-adaptive filter coefficient;
estimating echo audio data carried in the audio data and corresponding to the current moment according to the coefficient of the adaptive filter at the current moment;
acquiring error audio data at the current moment according to the background audio data and the echo audio data obtained by estimation;
and identifying the active part of the adaptive filter coefficient at the current moment, updating the active part of the adaptive filter coefficient at the current moment according to the error audio data at the current moment, and adjusting the order of the adaptive filter coefficient at the current moment to obtain the adaptive filter coefficient at the next moment.
In an embodiment, in identifying the active portion of the adaptive filter coefficients at the current time, the echo cancellation module 402 may be configured to:
dividing the adaptive filter coefficient at the current moment into a plurality of sub-filter coefficients with equal length;
acquiring the average value and the variance of each sub-filter coefficient from the back to the front, and determining the first sub-filter coefficient and the previous sub-filter coefficient of which the average value is greater than the preset average value and the variance is greater than the preset variance as active parts;
while adjusting the order of the adaptive filter coefficients at the current time, the echo cancellation module 402 may be configured to:
and judging whether the first sub-filter coefficient is the last sub-filter coefficient, if so, increasing the order of the adaptive filter coefficient at the current moment, and otherwise, reducing the order of the adaptive filter coefficient at the current moment.
In an embodiment, when performing beamforming on the two paths of audio data after echo cancellation to obtain enhanced audio data, the beamforming module 403 may be configured to:
and respectively carrying out beam forming processing on the two paths of audio data after echo cancellation at a plurality of preset angles by adopting a preset beam forming algorithm to obtain a plurality of enhanced audio data.
In one embodiment, in performing the primary verification on the text feature and the voiceprint feature of the enhanced audio data, the audio verification module 404 may be configured to:
extracting Mel frequency cepstrum coefficients of the enhanced audio data corresponding to each preset angle;
calling a target voiceprint characteristic model related to a preset text to match the extracted mel frequency cepstrum coefficients;
if the matched mel frequency cepstrum coefficient exists, judging that the primary check is passed;
the target voiceprint feature model is obtained by a Gaussian mixture general background model related to a preset text in a self-adaptive mode according to the Mel frequency cepstrum coefficient of the preset audio data, and the preset audio data are audio data of a preset text spoken by a preset user.
In one embodiment, in performing the secondary verification on the text feature and the voiceprint feature of the enhanced audio data, the audio verification module 404 may be configured to:
dividing the enhanced audio data corresponding to the preset angle into a plurality of sub audio data;
extracting a voiceprint characteristic vector of each sub audio data according to a voiceprint characteristic extraction model related to a preset text;
acquiring similarity between each voiceprint feature vector and a target voiceprint feature vector, wherein the target voiceprint feature vector is a voiceprint feature vector of preset audio data;
according to the similarity corresponding to each sub audio data, verifying the text characteristic and the voiceprint characteristic of the enhanced audio data corresponding to the preset angle;
and if the enhanced audio data corresponding to the preset angle passing the verification exists, judging that the secondary verification passes.
In an embodiment, when the text feature and the voiceprint feature of the enhanced audio data corresponding to the preset angle are checked according to the similarity corresponding to each piece of sub audio data, the audio checking module 404 may be configured to:
according to the similarity corresponding to each sub audio data and a preset identification function, verifying the text characteristic and the voiceprint characteristic of the enhanced audio data corresponding to the preset angle;
wherein the preset identification function is gamman=γn-1+f(ln),γnRepresenting the state value, gamma, of the recognition function corresponding to the nth sub-audio data-1Represents the state value of the recognition function corresponding to the n-1 th sub audio data,
Figure BDA0002083017220000181
a is a correction value of the recognition function, b is a predetermined similarity, lnFor the nth sub audio dataIf there is a gamma greater than the preset recognition function state valuenAnd judging that the text characteristic and the voiceprint characteristic of the enhanced audio data corresponding to the preset angle pass verification.
In an embodiment, when obtaining the similarity between the voiceprint feature vector of each sub-audio data and the target voiceprint feature training, the audio verification module 404 may be configured to:
calculating the similarity between the vocal print characteristic vector of each sub audio data and the target vocal print characteristic vector according to a dynamic time warping algorithm;
or, calculating a feature distance between the voiceprint feature vector of each sub-audio data and the target voiceprint feature vector as a similarity.
The embodiment of the present application provides a storage medium, on which an instruction execution program is stored, and when the stored instruction execution program is executed on an electronic device provided in the embodiment of the present application, the electronic device is caused to execute the steps in the application wake-up method provided in the embodiment of the present application. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.
Referring to fig. 7, the electronic device includes a processor 501, a memory 502, and a microphone 503.
The processor 501 in the present embodiment is a general purpose processor, such as an ARM architecture processor.
The memory 502 stores an instruction execution program, which may be a high speed random access memory, or a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid state storage device. Accordingly, the memory 502 may also include a memory controller to provide the processor 501 access to the memory 502 to implement the following functions:
acquiring two paths of audio data through two microphones, and acquiring background audio data played in an audio acquisition period;
performing echo cancellation processing on the two paths of audio data according to the background audio data to obtain two paths of audio data after echo cancellation;
performing beam forming processing on the two paths of audio data after echo cancellation to obtain enhanced audio data;
performing primary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data, and performing secondary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data after the primary verification is passed;
and if the secondary verification passes, awakening the voice interactive application.
Referring to fig. 8, fig. 8 is another schematic structural diagram of the electronic device according to the embodiment of the present disclosure, and the difference from the electronic device shown in fig. 7 is that the electronic device further includes components such as an input unit 504 and an output unit 505.
The input unit 504 may be used for receiving input numbers, character information, or user characteristic information (such as fingerprints), and generating a keyboard, a mouse, a joystick, an optical or trackball signal input, and the like, related to user settings and function control, among others.
The output unit 505 may be used to display information input by a user or information provided to a user, such as a screen.
In this embodiment of the present application, the processor 501 in the electronic device loads instructions corresponding to one or more processes of the computer program into the memory 502 according to the following steps, and the processor 501 runs the computer program stored in the memory 502, so as to implement various functions, as follows:
acquiring two paths of audio data through two microphones, and acquiring background audio data played in an audio acquisition period;
performing echo cancellation processing on the two paths of audio data according to the background audio data to obtain two paths of audio data after echo cancellation;
performing beam forming processing on the two paths of audio data after echo cancellation to obtain enhanced audio data;
performing primary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data, and performing secondary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data after the primary verification is passed;
and if the secondary verification passes, awakening the voice interactive application.
In an embodiment, when performing echo cancellation processing on two audio data according to background audio data, the processor 501 may perform:
obtaining an initial adaptive filter coefficient, and iteratively updating the initial adaptive filter coefficient according to background audio data and audio data to obtain a target adaptive filter coefficient;
and performing echo cancellation processing on the audio data according to the target self-adaptive filter coefficient.
In one embodiment, when iteratively updating the initial adaptive filter coefficients according to the background audio data and the audio data to obtain the target adaptive filter coefficients, the processor 501 may perform:
obtaining the self-adaptive filter coefficient at the current moment according to the initial self-adaptive filter coefficient;
estimating echo audio data carried in the audio data and corresponding to the current moment according to the coefficient of the adaptive filter at the current moment;
acquiring error audio data at the current moment according to the background audio data and the echo audio data obtained by estimation;
and identifying the active part of the adaptive filter coefficient at the current moment, updating the active part of the adaptive filter coefficient at the current moment according to the error audio data at the current moment, and adjusting the order of the adaptive filter coefficient at the current moment to obtain the adaptive filter coefficient at the next moment.
In an embodiment, in identifying the active portion of the adaptive filter coefficients at the current time, processor 501 may perform:
dividing the adaptive filter coefficient at the current moment into a plurality of sub-filter coefficients with equal length;
acquiring the average value and the variance of each sub-filter coefficient from the back to the front, and determining the first sub-filter coefficient and the previous sub-filter coefficient of which the average value is greater than the preset average value and the variance is greater than the preset variance as active parts;
while adjusting the order of the adaptive filter coefficients at the current time, processor 501 may perform:
and judging whether the first sub-filter coefficient is the last sub-filter coefficient, if so, increasing the order of the adaptive filter coefficient at the current moment, and otherwise, reducing the order of the adaptive filter coefficient at the current moment.
In an embodiment, when performing beamforming processing on the two paths of audio data after echo cancellation to obtain enhanced audio data, the processor 501 may perform:
and respectively carrying out beam forming processing on the two paths of audio data after echo cancellation at a plurality of preset angles by adopting a preset beam forming algorithm to obtain a plurality of enhanced audio data.
In one embodiment, in performing a primary check on the text feature and the voiceprint feature of the enhanced audio data, the processor 501 may perform:
extracting Mel frequency cepstrum coefficients of the enhanced audio data corresponding to each preset angle;
calling a target voiceprint characteristic model related to a preset text to match the extracted mel frequency cepstrum coefficients;
if the matched mel frequency cepstrum coefficient exists, judging that the primary check is passed;
the target voiceprint feature model is obtained by a Gaussian mixture general background model related to a preset text in a self-adaptive mode according to the Mel frequency cepstrum coefficient of the preset audio data, and the preset audio data are audio data of a preset text spoken by a preset user.
In one embodiment, in performing the secondary verification on the text feature and the voiceprint feature of the enhanced audio data, the processor 501 may perform:
dividing the enhanced audio data corresponding to the preset angle into a plurality of sub audio data;
extracting a voiceprint characteristic vector of each sub audio data according to a voiceprint characteristic extraction model related to a preset text;
acquiring similarity between each voiceprint feature vector and a target voiceprint feature vector, wherein the target voiceprint feature vector is a voiceprint feature vector of preset audio data;
according to the similarity corresponding to each sub audio data, verifying the text characteristic and the voiceprint characteristic of the enhanced audio data corresponding to the preset angle;
and if the enhanced audio data corresponding to the preset angle passing the verification exists, judging that the secondary verification passes.
In an embodiment, when the text feature and the voiceprint feature of the enhanced audio data corresponding to the preset angle are checked according to the similarity corresponding to each sub audio data, the processor 501 may perform:
according to the similarity corresponding to each sub audio data and a preset identification function, verifying the text characteristic and the voiceprint characteristic of the enhanced audio data corresponding to the preset angle;
wherein the preset identification function is gamman=γn-1+f(ln),γnRepresenting the state value, gamma, of the recognition function corresponding to the nth sub-audio datan-1Represents the state value of the recognition function corresponding to the n-1 th sub audio data,
Figure BDA0002083017220000221
a is a correction value of the recognition function, b is a predetermined similarity, lnIf the similarity exists between the voiceprint characteristic vector of the nth sub-audio data and the target voiceprint characteristic vector, the similarity is larger than the gamma of the preset identification function state valuenAnd judging that the text characteristic and the voiceprint characteristic of the enhanced audio data corresponding to the preset angle pass verification.
In an embodiment, when obtaining the similarity between the voiceprint feature vector of each sub-audio data and the target voiceprint feature training, the processor 501 may perform:
calculating the similarity between the vocal print characteristic vector of each sub audio data and the target vocal print characteristic vector according to a dynamic time warping algorithm;
or, calculating a feature distance between the voiceprint feature vector of each sub-audio data and the target voiceprint feature vector as a similarity.
It should be noted that the electronic device provided in the embodiment of the present application and the application wake-up method in the foregoing embodiments belong to the same concept, and any method provided in the embodiment of the application wake-up method may be run on the electronic device, and a specific implementation process thereof is described in detail in the embodiment of the feature extraction method, and is not described herein again.
It should be noted that, for the application wake-up method in the embodiment of the present application, it can be understood by a person skilled in the art that all or part of the process for implementing the application wake-up method in the embodiment of the present application can be completed by controlling the relevant hardware through a computer program, where the computer program can be stored in a computer readable storage medium, such as a memory of an electronic device, and executed by a processor and a dedicated voice recognition chip in the electronic device, and the process of executing the process can include, for example, the process of implementing the embodiment of the application wake-up method. The storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, etc.
The application wake-up method, the storage medium, and the electronic device provided in the embodiments of the present application are described in detail above, and specific examples are applied in the present application to explain the principles and implementations of the present application, and the description of the above embodiments is only used to help understand the method and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (9)

1. An application wake-up method applied to an electronic device, wherein the electronic device comprises two microphones, the application wake-up method comprising:
when the electronic equipment is in an audio and video playing state, the processor acquires two paths of audio data through the two microphones and acquires background audio data played during audio acquisition;
the processor performs echo cancellation processing on the two paths of audio data according to the background audio data to obtain two paths of audio data after echo cancellation;
the processor performs beamforming processing on the two paths of audio data after echo cancellation at a plurality of preset angles respectively by adopting a preset beamforming algorithm to obtain enhanced audio data corresponding to each preset angle, wherein the preset angle is obtained according to an incoming wave angle at which the counted use probability of a preset user reaches a preset probability;
the processor performs primary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data corresponding to each preset angle, and performs secondary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data after the primary verification;
and if the secondary verification passes, the processor wakes up the voice interaction application.
2. The application wake-up method according to claim 1, wherein the processor performs echo cancellation processing on the two paths of audio data according to the background audio data, and the echo cancellation processing includes:
the processor obtains an initial adaptive filter coefficient, and iteratively updates the initial adaptive filter coefficient according to the background audio data and the audio data to obtain a target adaptive filter coefficient;
and the processor performs echo cancellation processing on the audio data according to the target adaptive filter coefficient.
3. The wake-up application method according to claim 2, wherein the processor iteratively updates the initial adaptive filter coefficients according to the background audio data and the audio data to obtain target adaptive filter coefficients, comprising:
the processor acquires the self-adaptive filter coefficient at the current moment according to the initial self-adaptive filter coefficient;
the processor estimates echo audio data carried in the audio data and corresponding to the current moment according to the adaptive filter coefficient of the current moment;
the processor acquires error audio data at the current moment according to the background audio data and the echo audio data;
and the processor identifies the active part of the adaptive filter coefficient at the current moment, updates the active part according to the error audio data, and adjusts the order of the adaptive filter coefficient at the current moment to obtain the adaptive filter coefficient at the next moment.
4. The application wakeup method according to claim 3, wherein the identifying, by the processor, the active portion of the adaptive filter coefficient at the current time comprises:
the processor divides the adaptive filter coefficient of the current moment into a plurality of sub-filter coefficients with equal length;
the processor obtains the average value and the variance of each sub-filter coefficient from the back to the front, and determines the first sub-filter coefficient and the previous sub-filter coefficient of which the average value is greater than the preset average value and the variance is greater than the preset variance as the active part;
the adjusting the order of the adaptive filter coefficient at the current time includes:
and the processor judges whether the first sub-filter coefficient is the last sub-filter coefficient, if so, the order of the self-adaptive filter coefficient at the current moment is increased, and otherwise, the order of the self-adaptive filter coefficient at the current moment is reduced.
5. The application wake-up method according to any one of claims 1 to 4, wherein the processor performs primary verification on the text feature and the voiceprint feature of the enhanced audio data corresponding to each preset angle, including:
the processor extracts a Mel frequency cepstrum coefficient of the enhanced audio data corresponding to each preset angle;
the processor calls a target voiceprint characteristic model related to a preset text to match the extracted mel frequency cepstrum coefficients;
if the matched mel frequency cepstrum coefficient exists, the processor judges that the primary check is passed;
the target voiceprint feature model is obtained by a Gaussian mixture general background model related to a preset text in a self-adaptive mode according to a Mel frequency cepstrum coefficient of preset audio data, and the preset audio data are audio data of the preset text spoken by a preset user.
6. The application wake-up method according to claim 5, wherein the secondary verification of the text feature and the voiceprint feature of the enhanced audio data after the primary verification comprises:
the processor divides the enhanced audio data passing the primary check into a plurality of sub audio data;
the processor extracts the voiceprint feature vectors of the sub audio data according to the voiceprint feature extraction model related to the preset text;
the processor obtains the similarity between each voiceprint feature vector and a target voiceprint feature vector, wherein the target voiceprint feature vector is the voiceprint feature vector of the preset audio data;
the processor checks the text characteristic and the voiceprint characteristic of the enhanced audio data which passes the primary check according to the corresponding similarity of each sub audio data;
and if the enhanced audio data passing the primary verification passes the verification again, the processor judges that the secondary verification passes.
7. An application wakeup apparatus applied to a processor of an electronic device, wherein the electronic device includes two microphones, the application wakeup apparatus comprising:
the audio acquisition module is used for acquiring two paths of audio data through the two microphones when the electronic equipment is in an audio and video playing state and acquiring background audio data played in an audio acquisition period;
the echo cancellation module is used for carrying out echo cancellation processing on the two paths of audio data according to the background audio data to obtain two paths of audio data after echo cancellation;
the beam forming module is used for performing beam forming processing on the two paths of audio data after echo cancellation at a plurality of preset angles respectively by adopting a preset beam forming algorithm to obtain enhanced audio data corresponding to each preset angle, wherein the preset angle is obtained according to a statistical incoming wave angle at which the use probability of a preset user reaches a preset probability;
the audio verification module is used for performing primary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data corresponding to each preset angle and performing secondary verification on the text characteristic and the voiceprint characteristic of the enhanced audio data after the primary verification is passed;
and the application awakening module is used for awakening the voice interaction application when the secondary verification is passed.
8. An electronic device, characterized in that the electronic device comprises a processor, a memory and two microphones, the memory storing a computer program, characterized in that the processor is adapted to execute the application wake-up method according to any of claims 1 to 6 by invoking the computer program.
9. A storage medium, characterized in that, when a computer program stored in the storage medium is run on an electronic device comprising two microphones, the electronic device is caused to perform an application wake-up method according to any of claims 1 to 6.
CN201910478400.6A 2019-06-03 2019-06-03 Application awakening method and device, storage medium and electronic equipment Active CN110211599B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910478400.6A CN110211599B (en) 2019-06-03 2019-06-03 Application awakening method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910478400.6A CN110211599B (en) 2019-06-03 2019-06-03 Application awakening method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN110211599A CN110211599A (en) 2019-09-06
CN110211599B true CN110211599B (en) 2021-07-16

Family

ID=67790514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910478400.6A Active CN110211599B (en) 2019-06-03 2019-06-03 Application awakening method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110211599B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111048071B (en) * 2019-11-11 2023-05-30 京东科技信息技术有限公司 Voice data processing method, device, computer equipment and storage medium
CN111179931B (en) * 2020-01-03 2023-07-21 青岛海尔科技有限公司 Method and device for voice interaction and household appliance
CN112307161B (en) * 2020-02-26 2022-11-22 北京字节跳动网络技术有限公司 Method and apparatus for playing audio
CN111369992A (en) * 2020-02-27 2020-07-03 Oppo(重庆)智能科技有限公司 Instruction execution method and device, storage medium and electronic equipment
CN111755002B (en) * 2020-06-19 2021-08-10 北京百度网讯科技有限公司 Speech recognition device, electronic apparatus, and speech recognition method
CN112581972A (en) * 2020-10-22 2021-03-30 广东美的白色家电技术创新中心有限公司 Voice interaction method, related device and corresponding relation establishing method
CN115148197A (en) * 2021-03-31 2022-10-04 华为技术有限公司 Voice wake-up method, device, storage medium and system
CN115171703B (en) * 2022-05-30 2024-05-24 青岛海尔科技有限公司 Distributed voice awakening method and device, storage medium and electronic device

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002374588A (en) * 2001-06-15 2002-12-26 Sony Corp Device and method for reducing acoustic noise
CN101763858A (en) * 2009-10-19 2010-06-30 瑞声声学科技(深圳)有限公司 Method for processing double-microphone signal
CN101917527A (en) * 2010-09-02 2010-12-15 杭州华三通信技术有限公司 Method and device of echo elimination
CN103680515A (en) * 2013-11-21 2014-03-26 苏州大学 Proportional adaptive filter coefficient vector updating method using coefficient reusing
CN104520925A (en) * 2012-08-01 2015-04-15 杜比实验室特许公司 Percentile filtering of noise reduction gains
CN105575395A (en) * 2014-10-14 2016-05-11 中兴通讯股份有限公司 Voice wake-up method and apparatus, terminal, and processing method thereof
CN105654959A (en) * 2016-01-22 2016-06-08 韶关学院 Self-adaptive filtering coefficient updating method and device
CN107123430A (en) * 2017-04-12 2017-09-01 广州视源电子科技股份有限公司 Echo cancel method, device, meeting flat board and computer-readable storage medium
CN107464565A (en) * 2017-09-20 2017-12-12 百度在线网络技术(北京)有限公司 A kind of far field voice awakening method and equipment
US9842606B2 (en) * 2015-09-15 2017-12-12 Samsung Electronics Co., Ltd. Electronic device, method of cancelling acoustic echo thereof, and non-transitory computer readable medium
US10013995B1 (en) * 2017-05-10 2018-07-03 Cirrus Logic, Inc. Combined reference signal for acoustic echo cancellation
CN109218882A (en) * 2018-08-16 2019-01-15 歌尔科技有限公司 The ambient sound monitor method and earphone of earphone
US20190074025A1 (en) * 2017-09-01 2019-03-07 Cirrus Logic International Semiconductor Ltd. Acoustic echo cancellation (aec) rate adaptation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10194259B1 (en) * 2018-02-28 2019-01-29 Bose Corporation Directional audio selection

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002374588A (en) * 2001-06-15 2002-12-26 Sony Corp Device and method for reducing acoustic noise
CN101763858A (en) * 2009-10-19 2010-06-30 瑞声声学科技(深圳)有限公司 Method for processing double-microphone signal
CN101917527A (en) * 2010-09-02 2010-12-15 杭州华三通信技术有限公司 Method and device of echo elimination
CN104520925A (en) * 2012-08-01 2015-04-15 杜比实验室特许公司 Percentile filtering of noise reduction gains
CN103680515A (en) * 2013-11-21 2014-03-26 苏州大学 Proportional adaptive filter coefficient vector updating method using coefficient reusing
CN105575395A (en) * 2014-10-14 2016-05-11 中兴通讯股份有限公司 Voice wake-up method and apparatus, terminal, and processing method thereof
US9842606B2 (en) * 2015-09-15 2017-12-12 Samsung Electronics Co., Ltd. Electronic device, method of cancelling acoustic echo thereof, and non-transitory computer readable medium
CN105654959A (en) * 2016-01-22 2016-06-08 韶关学院 Self-adaptive filtering coefficient updating method and device
CN107123430A (en) * 2017-04-12 2017-09-01 广州视源电子科技股份有限公司 Echo cancel method, device, meeting flat board and computer-readable storage medium
US10013995B1 (en) * 2017-05-10 2018-07-03 Cirrus Logic, Inc. Combined reference signal for acoustic echo cancellation
US20190074025A1 (en) * 2017-09-01 2019-03-07 Cirrus Logic International Semiconductor Ltd. Acoustic echo cancellation (aec) rate adaptation
CN107464565A (en) * 2017-09-20 2017-12-12 百度在线网络技术(北京)有限公司 A kind of far field voice awakening method and equipment
CN109218882A (en) * 2018-08-16 2019-01-15 歌尔科技有限公司 The ambient sound monitor method and earphone of earphone

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《基于预测残差和自适应阶数的回声消除方法研究》;王正腾等;《中国优秀硕士学位论文全文数据库(电子期刊)》;20170215;第4.3-4.4小节 *
《自 适应回声消除的初期迭代统计学模型及改进算法》;文昊翔等;《数据采集与处理》;20120131;全文 *

Also Published As

Publication number Publication date
CN110211599A (en) 2019-09-06

Similar Documents

Publication Publication Date Title
CN110211599B (en) Application awakening method and device, storage medium and electronic equipment
US11823679B2 (en) Method and system of audio false keyphrase rejection using speaker recognition
CN110021307B (en) Audio verification method and device, storage medium and electronic equipment
US11042616B2 (en) Detection of replay attack
CN110400571B (en) Audio processing method and device, storage medium and electronic equipment
CN110310623B (en) Sample generation method, model training method, device, medium, and electronic apparatus
US9633652B2 (en) Methods, systems, and circuits for speaker dependent voice recognition with a single lexicon
US20200227071A1 (en) Analysing speech signals
CN110232933B (en) Audio detection method and device, storage medium and electronic equipment
CN110600048B (en) Audio verification method and device, storage medium and electronic equipment
CN110556103A (en) Audio signal processing method, apparatus, system, device and storage medium
EP0822539A2 (en) Two-staged cohort selection for speaker verification system
TW201419270A (en) Method and apparatus for utterance verification
US11308946B2 (en) Methods and apparatus for ASR with embedded noise reduction
CN110223687B (en) Instruction execution method and device, storage medium and electronic equipment
US9953633B2 (en) Speaker dependent voiced sound pattern template mapping
CN110689887B (en) Audio verification method and device, storage medium and electronic equipment
US11081115B2 (en) Speaker recognition
CN113889091A (en) Voice recognition method and device, computer readable storage medium and electronic equipment
CN110992977B (en) Method and device for extracting target sound source
CN111369992A (en) Instruction execution method and device, storage medium and electronic equipment
WO2020015546A1 (en) Far-field speech recognition method, speech recognition model training method, and server
CN111192569B (en) Double-microphone voice feature extraction method and device, computer equipment and storage medium
CN114464188A (en) Voiceprint awakening algorithm based on distributed edge calculation
CN112509556B (en) Voice awakening method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant