CN115223551A

CN115223551A - Voice awakening method and system based on voice similarity matching

Info

Publication number: CN115223551A
Application number: CN202110341328.XA
Authority: CN
Inventors: 熊浩; 龚科
Original assignee: DMAI Guangzhou Co Ltd
Current assignee: DMAI Guangzhou Co Ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2022-10-21

Abstract

The invention discloses a voice awakening method and a voice awakening system based on voice similarity matching, wherein an audio characteristic extraction model is obtained by training a neural network model; reasoning sample data of the preset number of awakening words with large difference by using an audio feature extraction model to obtain corresponding content feature vectors, and further obtain an awakening word sample library; the method comprises the steps that non-silent audio data in audio streams are input into an audio feature extraction model to extract content feature vectors of the audio streams by monitoring the audio streams in real time; and comparing the extracted content characteristic vector with the awakening word sample base to obtain a similarity numerical value, comparing the similarity numerical value with a preset threshold, successfully matching when the similarity numerical value is smaller than the preset threshold, and sending a signal for monitoring the awakening word. The voice awakening method and the voice awakening system simplify the process of activating interaction by a user during voice interaction of intelligent equipment, select the adaptive word awakening words according to the actual application scene, and improve user experience through the mode of awakening by the specific words.

Description

Voice awakening method and system based on voice similarity matching

Technical Field

The invention relates to the technical field of voice awakening, in particular to a voice awakening method and system based on voice similarity matching.

Background

With the advent of deep learning technology, the improvement of big data and computing power, voice technology has been well developed, and is widely applied to industries such as intelligent terminals, mobile internet applications, finance, telecommunications, automobiles, home furnishing, education and the like, and many intelligent devices can interact with users through voice. The voice wake-up function is used in an online voice wake-up system, and voice wake-up is a form of voice recognition technology, and the voice wake-up system can wake up a device through voice without directly contacting a hardware device. The existing awakening voice assistant awakening words cannot be well recognized by a machine, the awakening efficiency is poor when the words arrive, the use of a user is inconvenient, the use frequency of the voice assistant is low, the true value of the voice assistant cannot be reflected, and the user experience is poor. Due to the characteristic of deep learning, some voice awakening methods which are also suitable for deep learning need to collect a large amount of data, and when an awakening word is changed, a large amount of data needs to be collected again and retrained, so that the change of the awakening word is inconvenient.

Disclosure of Invention

Therefore, the defects that the existing awakening voice assistant is poor in awakening efficiency, inconvenient to replace awakening words and poor in customer experience are overcome.

In order to achieve the purpose, the invention provides the following technical scheme:

in a first aspect, an embodiment of the present invention provides a voice wake-up method based on voice similarity matching, including:

training a preset deep neural network model to obtain an audio characteristic extraction model;

acquiring preset quantity of wake-up word sample data with larger difference according to a first preset length, inputting the sample data into the audio feature extraction model, reasoning the sample data to obtain content feature vectors corresponding to the sample data, and screening the content feature vector set of the sample under preset conditions to obtain a wake-up word sample library of the current wake-up word;

monitoring the audio stream in real time by using a sliding window with a first preset length, and filtering the mute audio data to obtain non-mute audio data;

inputting non-silent audio data into an audio feature extraction model to extract content feature vectors of audio in a current sliding window;

and comparing the similarity of the extracted content feature vector with a wake-up word sample library, calculating to obtain a feature distance value, comparing the feature distance value with a preset threshold value, and sending a signal for monitoring the wake-up word if the matching is successful when the feature distance value is smaller than the preset threshold value.

Preferably, training the preset deep neural network to obtain an audio feature extraction model, includes:

constructing a training data set comprising: acquiring a large amount of audio data and original data of content texts corresponding to the audio data, aligning the sound and the characters of the original data, and cutting out a plurality of sections of audio segments according to the alignment relation between an original audio frame and a text sequence and the length of a preset text sequence;

carrying out data preprocessing and data enhancement on the divided multiple sections of audio segments to obtain an amplitude spectrum for subsequent deep neural network reasoning;

migrating the initialization parameters obtained by training on the large data set to a preset deep neural network model, and initializing the parameters of the model;

and inputting the amplitude spectrum into a deep neural network model, adjusting the initialized deep neural network model according to the data labels, and circularly executing training operation by using all training data in the training data set until the model converges to finish training to obtain an audio characteristic extraction model.

Preferably, the process of monitoring the audio stream by using the wakeup word includes:

intercepting audio with a specified length from the audio stream through a sliding window with a first preset length to obtain an audio clip;

further sampling the audio clip sampled by the sliding window to obtain a plurality of audio frames, judging whether the sampled audio frames are silent, and judging whether the current audio clip is a silent clip according to the silence judgment condition of the audio frames;

and filtering the mute audio to obtain the non-mute audio.

Preferably, when the ratio of the number of silent audio frames to the total number of audio frames in an audio segment exceeds a preset ratio threshold, the current segment is considered as a silent segment.

Preferably, the steps of comparing the similarity of the extracted content feature vector with the awakening word sample library, calculating to obtain a feature distance value, comparing the feature distance value with a preset threshold value, and successfully matching when the feature distance value is smaller than the preset threshold value include:

selecting the smaller value of the vector distance between the k extracted content feature vectors and the sample vectors in the awakening word template base, solving the distance mean value of the smaller value and comparing the distance mean value with a threshold value, and if the average distance is smaller than a preset threshold value, determining that the matching is successful.

Preferably, the data preprocessing includes: and short-time Fourier transform to obtain the amplitude of the frequency spectrum and amplitude normalization.

Preferably, the data enhancement comprises: noise addition, speed change, tone variation and spectrum shielding.

In a second aspect, an embodiment of the present invention provides a voice wake-up system based on voice similarity matching, including:

the deep neural network model training module is used for training a preset deep neural network model to obtain an audio characteristic extraction model;

the awakening word sample library generating module is used for acquiring awakening word sample data with larger difference and preset quantity according to a first preset length, inputting the awakening word sample data into the audio characteristic extraction model, reasoning the sample data to obtain content characteristic vectors corresponding to the sample data, and screening a content characteristic vector set of the sample through preset conditions to obtain the awakening word sample library of the current awakening word;

the real-time audio monitoring module monitors audio streams in real time and filters the mute audio data to obtain non-mute audio data;

the audio characteristic extraction module is used for inputting the non-silent audio data into the audio characteristic extraction model to extract the content characteristic vector of the audio in the current sliding window;

and the feature matching module is used for comparing the similarity of the extracted content feature vector with the awakening word sample library, calculating to obtain a feature distance value, comparing the feature distance value with a preset threshold value, successfully matching when the feature distance value is smaller than the preset threshold value, and sending a signal for monitoring the awakening word.

In a third aspect, an embodiment of the present invention provides a computer device, including: the apparatus may include at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to cause the at least one processor to perform the method of the first aspect of the embodiments of the invention.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are used to cause a computer to execute the method according to the first aspect.

The technical scheme of the invention has the following advantages:

according to the voice awakening method and system based on voice similarity matching, provided by the embodiment of the invention, an audio characteristic extraction model is obtained by training a neural network model; pushing a large amount of wake-up word sample data with large differences by using an audio feature extraction model to obtain corresponding content feature vectors, and further obtaining a wake-up word sample library; the method comprises the steps that non-silent audio data in an audio stream are input into an audio feature extraction model to extract content feature vectors of the audio stream by monitoring the audio stream in real time; and comparing the extracted content characteristic vector with the awakening word sample base to obtain a similarity numerical value, comparing the similarity numerical value with a preset threshold, successfully matching when the similarity numerical value is smaller than the preset threshold, and sending a signal for monitoring the awakening word. The voice awakening method and the voice awakening system simplify the process of activating interaction by a user during voice interaction of the intelligent equipment, select the adaptive word awakening words according to the actual application scene, and improve the user experience in a specific word awakening mode.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of an example of a voice wake-up method based on voice similarity matching provided in an embodiment of the present invention;

fig. 2 is a block diagram of an exemplary voice wake-up system for voice affinity matching according to an embodiment of the present invention;

fig. 3 is a block diagram of a specific example of a computer device according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example 1

The embodiment of the invention provides a voice awakening method based on voice similarity matching, which is applied to various intelligent equipment voice interaction scenes to complete a real-time voice awakening function. As shown in fig. 1, the method comprises the steps of:

s1, training a preset deep neural network model to obtain an audio characteristic extraction model.

For a voice awakening task, a deep network model is required to perform feature extraction on content expressed in a section of audio to obtain a content feature vector, and the content feature vector can be regarded as mapping of the content expressed by an audio signal in a high-dimensional space. It is therefore desirable to design a deep network model training scheme for extracting audio expression content. The selection of the deep neural network model can be carried out according to performance requirements by a deployment platform, and a convolutional neural network, a cyclic neural network and other network models which can model the audio signal can be selected. To facilitate deployment optimization, embodiments of the present invention choose to use convolutional neural networks.

In a specific embodiment, the process of training the convolutional neural network to obtain the audio feature extraction model includes:

step S11: constructing a training data set comprising: acquiring a large amount of audio data and original data of content texts corresponding to the audio data, aligning the sound and the characters of the original data, and cutting out a plurality of sections of audio segments according to the alignment relation between an original audio frame and a text sequence and the length of a preset text sequence.

For example, the original text sequence is "today weather is really good", words such as "today", "weather", "true good" and the like can be segmented according to the length of the text sequence being 2, words such as "today" weather "and" true good "and the like can be segmented according to the length of the text sequence being 4, various word segmentation algorithms can be used in the word segmentation process, if words which may possibly generate non-definite semantics such as" nature of the day "and the like are not used, but if the data volume is small, data which do not have definite semantics can also be used, because in the subsequent deep network model training, the goal is to extract audio content information rather than non-semantic information. After a plurality of audios are obtained, the audios are classified according to text contents, and the audios with the same text contents are classified into one class for subsequent model classification training.

Step S12: and carrying out data preprocessing and data enhancement on the segmented multi-segment audio segments to obtain an amplitude spectrum for subsequent deep neural network reasoning.

The data preprocessing is to obtain a two-dimensional amplitude spectrum for subsequent deep neural network inference, and specifically includes operations of short-time fourier transform, acquisition of amplitude of the spectrum, amplitude normalization, and the like. In order to increase the diversity of training data, data enhancement can be performed on audio data, including noise addition, speed change, pitch change, spectrum occlusion, and the like; adding noise, namely adding random white noise or collected specific scene noise, and adding noise to the original audio data according to a certain signal-to-noise ratio; the speed change and tone modification can be realized by a waveform similarity superposition method, a resampling method and the like; the spectrum occlusion is a data enhancement method for occluding the actual value in the frequency spectrum by using a mean value after the frequency spectrum is obtained. The content and the processing sequence of the data preprocessing and the data enhancement are only used as examples, and are not limited thereto, and are actually set reasonably according to specific requirements.

Step S13: and migrating the initialization parameters obtained by training on the large data set to a preset deep neural network model, and initializing the parameters of the model.

Step S14: inputting the amplitude spectrum into a deep neural network model, adjusting the initialized deep neural network model according to the data labels, and circularly executing training operation by using all training data in the training data set until the model converges to finish training to obtain an audio feature extraction model.

And S2, acquiring wake-up word sample data with preset quantity and large difference according to a first preset length, inputting the wake-up word sample data into the audio feature extraction model, reasoning the sample data to obtain content feature vectors corresponding to the sample data, and screening the content feature vector set of the sample under preset conditions to obtain a wake-up word sample library of the current wake-up word. The selection of the wake-up word is reasonably set according to a specific application scenario, and is not limited herein.

And S3, using a sliding window with a first preset length to monitor the audio stream in real time, and filtering the mute audio data to obtain non-mute audio data.

In the embodiment of the present invention, the process of monitoring the audio stream by using the wakeup word includes two steps for obtaining a non-silent audio clip with a certain length from the audio stream data: the method comprises the steps of audio sampling and silence detection, wherein the audio sampling refers to intercepting audio with a specified length from an audio stream by using a sliding window with the specified length, carrying out format conversion, decoding and other operations on the sampled audio, and the silence detection is to judge whether the audio with the specified length obtained by the audio sampling is a silence segment or not, and filtering if the audio is the silence audio segment. The method specifically comprises the following steps:

and S31, intercepting the audio with the appointed length from the audio stream through a sliding window with a first preset length to obtain an audio fragment. The window length and the sliding distance of the sliding window can be set as required, the sliding distance cannot exceed the window length, and otherwise, missing audio segments exist.

And S32, further sampling the audio frequency segment sampled by the sliding window to obtain a plurality of audio frequency frames, judging whether the sampled audio frequency frames are mute, and judging whether the current audio frequency segment is a mute segment according to the mute judgment condition of the audio frequency frames.

For example, for all frames in a segment, if (silent audio frame number/total audio frame number) > α, the current segment is considered as a silent segment, and the sensitivity of real-time audio monitoring can be controlled by the value of the threshold α.

And S33, filtering the mute audio to obtain non-mute audio. After the mute audio is filtered, the embodiment of the invention can reduce unnecessary operation of subsequent functional modules and reduce the consumption of computing resources and electric quantity of equipment.

And S4, inputting the non-silent audio data into an audio feature extraction model to extract the content feature vector of the audio in the current sliding window.

In the step, the audio feature extraction model extracts the features of the non-silent audio. Firstly, the non-silent audio is subjected to the same data preprocessing as that in the training process, specifically comprising short-time Fourier transform, amplitude calculation, amplitude normalization and other operations, the audio is converted into a two-dimensional amplitude spectrum input by a depth model, and then the content characteristic vector of the current audio is obtained through model forward reasoning.

And S5, comparing the similarity of the extracted content characteristic vector with a wake-up word sample library, calculating to obtain a characteristic distance value, comparing the characteristic distance value with a preset threshold value, and sending a signal for monitoring the wake-up word if the matching is successful when the value is smaller than the preset threshold value.

The embodiment of the invention compares the extracted content characteristic vector with the awakening word template library, calculates to obtain a quantized similarity value, compares the quantized similarity value with a set threshold value to obtain the judgment whether the current audio is successfully matched, and sends a signal for monitoring the awakening word if the matching is successful. The comparison calculation between the feature vectors can use a vector distance measurement method such as Euclidean distance and cosine distance, and the comparison method is represented as D, and the dimension of the feature vector is n-dimension, so that the distance between the two feature vectors is represented as D (x, y). If the Euclidean distance metric is used, then

If cosine distances are used, then

In practical application, because sampling may be biased when the sample library is not large enough, in the distance comparison, the embodiment of the present invention selects the smaller distance value between the k feature vectors and the sample vectors, and calculates the distance average value to compare with the threshold, and if the average distance is smaller than the threshold, the matching is considered to be successful. Let K be the smaller value of K distance, the threshold be β, then the formula is expressed as

The setting of the threshold is related to the actual application scene and the sensitivity requirement, and the threshold setting when the indexes are optimal on the test set can be obtained by traversing different thresholds according to a certain granularity.

According to the voice awakening method based on voice similarity matching, provided by the embodiment of the invention, an audio feature extraction model is obtained by training a neural network model, the audio feature extraction model is utilized to push a large amount of awakening word sample data with large differences to obtain corresponding content feature vectors, and then an awakening word sample library is obtained; the method comprises the steps that non-silent audio data in an audio stream are input into an audio feature extraction model to extract content feature vectors of the audio stream by monitoring the audio stream in real time; the extracted content feature vectors are compared with the awakening word sample base to obtain a similarity value, the similarity value is compared with a preset threshold value, when the similarity value is smaller than the preset threshold value, a signal for monitoring the awakening words is sent out, the process that the user activates interaction during voice interaction of the intelligent device is simplified, the adaptive awakening words are selected according to the actual application scene, and user experience is improved in a specific awakening word awakening mode.

Example 2

An embodiment of the present invention further provides a voice wake-up system based on voice similarity matching, as shown in fig. 3, including:

the deep neural network model training module 1 is used for training a preset deep neural network model to obtain an audio characteristic extraction model; this module executes the method described in step S1 in embodiment 1, and details are not repeated here.

The awakening word sample library generating module 2 is configured to collect awakening word sample data with a preset number and a large difference according to a first preset length, input the sample data into the audio feature extraction model, perform inference on the sample data to obtain content feature vectors corresponding to the sample data, and filter a set of content feature vectors of the sample under preset conditions to obtain an awakening word sample library of the current awakening word; this module executes the method described in step S2 in embodiment 1, and is not described herein again.

The real-time audio monitoring module 3 is configured to monitor an audio stream in real time by using a sliding window with a first preset length, and filter silent audio data to obtain non-silent audio data; the module comprises an audio sampling unit and a silence detection unit, so that audio sampling and silence filtering are realized, and non-silence audio is obtained. This module executes the method described in step S3 in embodiment 1, and is not described herein again.

And the audio feature extraction module 4 is configured to input the non-silent audio data into the audio feature extraction model to extract a content feature vector of the audio in the current sliding window. This module executes the method described in step S4 in embodiment 1, which is not described herein again.

And the feature matching module 5 is used for comparing the similarity of the extracted content feature vector with the awakening word sample library, calculating to obtain a feature distance value, comparing the feature distance value with a preset threshold value, successfully matching when the feature distance value is smaller than the preset threshold value, and sending a signal for monitoring the awakening word. This module executes the method described in step S5 in embodiment 1, and is not described herein again.

According to the voice awakening system based on voice similarity matching, the voice awakening function is achieved through the mutual cooperation of the functional modules, the process that a user activates interaction during voice interaction of intelligent equipment is simplified, and the user experience is improved through the awakening mode of specific words.

Example 3

An embodiment of the present invention provides a computer device, as shown in fig. 3, including: at least one processor 401, such as a CPU (Central Processing Unit), at least one communication interface 403, memory 404, and at least one communication bus 402. Wherein a communication bus 402 is used to enable connective communication between these components. The communication interface 403 may include a Display (Display) and a Keyboard (Keyboard), and the optional communication interface 403 may also include a standard wired interface and a standard wireless interface. The Memory 404 may be a Ramdom Access Memory (RAM) Memory or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 404 may optionally be at least one memory device located remotely from the aforementioned processor 401. Wherein the processor 401 may perform the method as described in embodiment 1 or embodiment 2. A set of program codes is stored in the memory 404 and the processor 401 calls the program codes stored in the memory 404 for performing the voice wakeup method based on voice affinity matching of embodiment 1.

The communication bus 402 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The communication bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one line is shown in FIG. 3, but this does not represent only one bus or one type of bus.

The memory 404 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory may also include a non-volatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated: HDD) or a solid-state drive (english: SSD); the memory 404 may also comprise a combination of memories of the kind described above. The processor 401 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.

The processor 401 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof. Optionally, the memory 404 is also used to store program instructions. Processor 401 may invoke program instructions to implement the voice wake-up method based on voice affinity matching of embodiment 1.

An embodiment of the present invention further provides a computer-readable storage medium, where computer-executable instructions are stored on the computer-readable storage medium, and the computer-executable instructions may execute the voice wake-up method based on voice similarity matching in embodiment 1. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid-State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims

1. A voice wake-up method based on voice similarity matching is characterized by comprising the following steps:

2. The voice awakening method based on voice similarity matching according to claim 1, wherein the training of the preset deep neural network to obtain the audio feature extraction model comprises the following steps:

3. The voice wake-up method based on voice similarity matching according to claim 1, wherein the process of performing wake-up word listening on the audio stream comprises:

intercepting audio with a specified length from an audio stream through a sliding window with a first preset length to obtain an audio clip;

further sampling the audio clip sampled by the sliding window to obtain a plurality of audio frames, judging whether the sampled audio frames are silent, and judging whether the current audio clip is a silent clip according to the mute judgment condition of the audio frames;

and filtering the mute audio to obtain the non-mute audio.

4. Voice wake-up method based on voice affinity matching according to claim 3,

and when the ratio of the number of the mute audio frames to the total number of the audio frames in the audio clip exceeds a preset ratio threshold, the current clip is considered as the mute clip.

5. The voice awakening method based on voice similarity matching according to claim 1, wherein the steps of comparing the similarity of the extracted content feature vector with an awakening word sample library, calculating to obtain a feature distance numerical value, comparing the feature distance numerical value with a preset threshold value, and successfully matching when the feature distance numerical value is smaller than the preset threshold value comprise:

6. The voice wake-up method based on voice affinity matching according to claim 2 or 3, wherein the pre-processing of data comprises: and short-time Fourier transform to obtain the amplitude of the frequency spectrum and amplitude normalization.

7. The voice wake-up method based on voice affinity matching according to claim 2, wherein the data enhancement comprises: noise addition, speed change, tone variation and spectrum shielding.

8. A voice wake-up system based on voice affinity matching, comprising:

the awakening word sample library generating module is used for collecting awakening word sample data with preset quantity and large difference according to a first preset length, inputting the awakening word sample data into the audio feature extraction model, reasoning the sample data to obtain content feature vectors corresponding to the sample data, and screening the content feature vector set of the sample through preset conditions to obtain the awakening word sample library of the current awakening word;

the real-time audio monitoring module is used for monitoring the audio stream in real time by using a sliding window with a first preset length and filtering the mute audio data to obtain non-mute audio data;

and the characteristic matching module is used for comparing the similarity of the extracted content characteristic vector with the awakening word sample library, calculating to obtain a characteristic distance numerical value, comparing the characteristic distance numerical value with a preset threshold value, and sending a signal for monitoring the awakening word if the matching is successful when the characteristic distance numerical value is smaller than the preset threshold value.

9. A computer device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the voice wake-up method based on voice affinity matching as recited in any one of claims 1-7.

10. A computer-readable storage medium storing computer instructions for causing a computer to perform any one of the voice wakeup method based on voice similarity matching described in any one of 1-7.