CN116110376A - Keyword detection model training method, electronic equipment and storage medium - Google Patents

Keyword detection model training method, electronic equipment and storage medium Download PDF

Info

Publication number
CN116110376A
CN116110376A CN202310132690.5A CN202310132690A CN116110376A CN 116110376 A CN116110376 A CN 116110376A CN 202310132690 A CN202310132690 A CN 202310132690A CN 116110376 A CN116110376 A CN 116110376A
Authority
CN
China
Prior art keywords
audio
keyword
text
acoustic
detection model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310132690.5A
Other languages
Chinese (zh)
Inventor
俞凯
奚彧
杨宝琛
李豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN202310132690.5A priority Critical patent/CN116110376A/en
Publication of CN116110376A publication Critical patent/CN116110376A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a keyword detection model training method, electronic equipment and a storage medium, wherein the method comprises the following steps: constructing a matching pair of the audio fragment and the text keyword by utilizing the general data so as to enable the keyword detection model to learn the capability of whether the audio fragment contains the keyword or not; and constructing a matching pair of the positive example audio and a matching pair of the negative example audio, enabling the matching pair of the positive example audio to be closer in a characterization space by using a comparison learning mode, and enabling the matching pair of the negative example audio to be farther in the characterization space. Embodiments of the present application alleviate the problem of insufficient data by building a large number of audio-audio and audio-text pairs from an Automatic Speech Recognition (ASR) dataset to learn an efficient keyword representation. Not only can a large amount of existing data be fully utilized, but also the robustness of the model is greatly improved. In addition, the embodiment of the application can randomly customize various wake-up words, and is characterized by better wake-up words of comparative learning.

Description

Keyword detection model training method, electronic equipment and storage medium
Technical Field
The invention belongs to the technical field of keyword detection model training, and particularly relates to a keyword detection model training method, electronic equipment and a storage medium.
Background
In the related art, wake-up word detection systems of various large voice providers; the wake-up system of the mobile equipment end, such as a certain degree of ' small degree ', a certain meter of ' little college, a certain fruit ' hey and Siri ', is realized by a keyword detection system, and the system responds to the instruction of the user through a wake-up word detection model which runs continuously for 24 hours.
The keywords are usually preset in factory, the user cannot modify the keywords at will, the DIY attribute is poor, and the individuation degree of the user can be further improved. There is still room for improvement in performance.
Thus, while the above-described efforts in keyword discovery tasks greatly improve performance under certain specific conditions, some unresolved issues limit the versatility of these approaches.
The inventors have found in the course of implementing the present application that 1) conventional keyword discovery pipelines are complex. The unique non-neural search process causes additional overhead and causes a mismatch between training and testing. And a number of adjustable superparameters are typically required during the detection phase to reduce false positive rates. 2) Most of the methods are aimed at preset keyword scenes and are not suitable for supporting users to customize keywords arbitrarily. In customizable keyword scenarios, their performance typically drops drastically and even becomes totally unsuitable. While some work has been devoted to customizable keyword discovery tasks, there are still practical challenges such as additional computational costs of adaptive detection, streaming of models, insufficient data, and the like.
Disclosure of Invention
The embodiment of the invention provides a keyword detection model training method, electronic equipment and a storage medium, which are used for at least solving one of the technical problems.
In a first aspect, an embodiment of the present invention provides a keyword detection model training method, including: constructing a matching pair of the audio fragment and the text keyword by utilizing the general data so as to enable the keyword detection model to learn the capability of whether the audio fragment contains the keyword or not; and constructing a matching pair of the positive example audio and a matching pair of the negative example audio, enabling the matching pair of the positive example audio to be closer in a characterization space by using a comparison learning mode, and enabling the matching pair of the negative example audio to be farther in the characterization space.
In a second aspect, there is provided an electronic device, comprising: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the keyword detection model training method of any one of the embodiments of the present invention.
In a third aspect, embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the keyword detection model training method of any of the embodiments of the present invention.
The method of the embodiments of the present application alleviates the problem of insufficient data by building a large number of audio-audio and audio-text pairs from an Automatic Speech Recognition (ASR) dataset to learn an efficient keyword representation. Not only can a large amount of existing data be fully utilized, but also the robustness of the model is greatly improved. In addition, the embodiment of the application can randomly customize various wake-up words, and is characterized by better wake-up words of comparative learning.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a keyword detection model training method according to an embodiment of the present invention;
FIG. 2 is an overview of the overall framework provided by an embodiment of the present invention;
FIG. 3 is a graph showing the results of a baseline model and a proposed model provided by an embodiment of the present invention;
FIG. 4 is a result of a model trained from audio-audio and audio-text pairs or audio-text pairs alone, provided by an embodiment of the present invention;
FIG. 5 is a comparison of standard deviation of different keyword recalls for a baseline and proposed system provided by an embodiment of the present invention;
FIG. 6 is a comparison of the relative velocity accelerations (RSA) of the baseline and proposed systems provided by an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to FIG. 1, a flowchart of one embodiment of a keyword detection model training method of the present application is shown.
As shown in fig. 1, in step 101, a matching pair of an audio segment and a text keyword is constructed by using general data, so that the keyword detection model learns whether the audio segment contains a keyword or not;
in step 102, a matching pair of the positive example audio and a matching pair of the negative example audio are constructed, the matching pair of the positive example audio is closer to each other in a characterization space by means of contrast learning, and the matching pair of the negative example audio is farther from each other in the characterization space.
In embodiments of the present application, the problem of insufficient data is alleviated by building a large number of audio-audio and audio-text pairs from an Automatic Speech Recognition (ASR) dataset to learn an efficient keyword representation. Not only can a large amount of existing data be fully utilized, but also the robustness of the model is greatly improved. In addition, the embodiment of the application can randomly customize various wake-up words, and is characterized by better wake-up words of comparative learning.
In some alternative embodiments, the constructing the matched pair of positive example audio and the matched pair of negative example audio includes: randomly extracting some arbitrary words from each small corpus as key words of the small corpus, wherein the corpus comprises audio and corresponding texts; and cutting out a plurality of positive example audios from the original audios of each keyword to obtain a plurality of matching pairs of positive example audios, wherein other audios are classified into negative example audios except the positive example audios containing the keywords.
In a further alternative embodiment, the contrast-learned penalty function is an InfoNCE penalty in which not only the penalty of matching pairs of audio and text keywords, but also the penalty of audio and audio keywords is calculated.
In a further alternative embodiment, the keyword detection model includes an acoustic model, and the constructing a matching pair of an audio clip and a text keyword using generic data includes:
the acoustic model is pre-trained in a supervised manner to provide frame-level acoustic concealment characterizations, and parameters of the pre-trained acoustic model are frozen after the pre-training is completed.
In a further alternative embodiment, the keyword detection model further includes an acoustic embedded encoder, a keyword sampler, a text embedded encoder, and a similarity calculation module.
In a further alternative embodiment, the means for utilizing contrast learning includes: acquiring acoustic hidden characteristics output by the pre-trained acoustic model; encoding the acoustic concealment feature as an acoustic embedding feature with the acoustic embedding encoder; acquiring a phoneme sequence from the text keyword by using a keyword sampler, and encoding the phoneme sequence into a text embedding feature by using a text embedding encoder; and calculating the similarity of the acoustic embedded feature and the text embedded feature by using the similarity calculation module, determining whether a keyword exists or not based on the similarity, and training the keyword detection model based on the contrast learning loss function.
In some alternative embodiments, the keyword is a wake word, and the keyword detection model is a wake word detection model.
It should be noted that the above method steps are not limited to the order of execution of the steps, and in fact, some steps may be executed simultaneously or in reverse order of the steps, which is not limited by the present application.
The following description is given to better understand the aspects of the present application by describing some of the problems encountered by the inventor in carrying out the present invention and one specific embodiment of the finally-determined aspects.
The inventors have found that the above-mentioned drawbacks are mainly caused by the following reasons: the current technology of self-defining wake word model is not mature enough, and the hardware resources available for the model are relatively less. The wake-up accuracy and false wake-up are difficult to control.
To address the above-mentioned deficiencies in the related art, one skilled in the art will typically predefine one or a small number of wake words for the user; or some syllables are customized, so that the user can combine own wake-up words, but the complete customization cannot be achieved. The solution of the embodiment of the application provides the user with higher degree of freedom, and meanwhile, the model is required to have stronger modeling and custom wake-up word detection capabilities.
In the model training process, keywords are randomly selected for training, so that the model has the capability of randomly defining wake-up words when in use. Specifically, we construct a large number of matching pairs of audio clips and text keywords using large-scale generic data during training, allowing the model to learn the ability of whether the audio clips contain the text keywords. Meanwhile, in order to further enhance the discrimination of the model to different audio fragments, a large number of matching pairs of positive example audio and negative example audio are also constructed, so that the model can further distinguish what is the audio fragment containing the keyword and what is the audio fragment not containing the keyword. In a specific learning stage, a comparison learning mode is utilized, so that a positive example with higher matching degree is closer to the distance in the characterization space, and a negative example with lower matching degree is as far as possible in the characterization space. Finally, a model which can freely define wake-up words during testing is obtained.
Referring to fig. 2, a block diagram of a keyword detection model training system according to an embodiment of the present application is shown. Wherein, the Chinese and English comparison is as follows: negative segment: a negative segment, a negative example segment; positive segment: positive segment, positive example segment.
The scheme of the embodiment of the application can fully utilize a large amount of existing data, and greatly improves the robustness of the model. In addition, the scheme can randomly define various wake-up words, and the better wake-up word characterization of the learning is compared. At the same time, since there is no post-processing part, our model is completely end-to-end, so the delay is very low, and lower delay can be obtained compared to the currently mainstream systems.
The following verifies the beneficial effects of the embodiments of the present application over the prior art through specific experiments and experimental data.
Custom keyword recognition tasks have attracted a great deal of attention in the field of keyword recognition. Although the system of preset wake-up words has achieved good enough performance, it is still very challenging to construct a highly customizable keyword recognition system due to the lack of task-specific training data and hardware resource limitations. In the embodiment of the application, a contrast learning mode and general training data are utilized to construct a COBE-KWS, and an end-to-end frame of customizable wake-up words is utilized. By constructing a large number of matching pairs of audio-audio and audio-text from training data of an automatic speech recognition system in the training process, the problem of insufficient training data is relieved, and therefore the characterization of keywords is better learned. Experiments on the LibriSpecech dataset of the keyword recognition version demonstrate that our proposed framework achieves consistent performance improvement over a different number of keywords than two different baseline models, especially with less frequent occurrences and more complex keyword configurations in acoustic environments. At the same time, our proposed framework has smaller standard deviation on different keywords and higher relative speed ratios, which suggests that our model has higher stability and lower latency.
Keyword recognition (KWS) technology, particularly the well-known wake-up word detection (WWD) technology, has been widely used as an important portal for man-machine interaction systems for smart home devices and smart cabins. Although stationary wake-up word detection systems have achieved satisfactory performance, it remains a significant challenge to build a personalized system.
There is a lot of literature on wake word detection in continuous speech processing. Classical systems are based on the Hidden Markov Model (HMM) of key/filler nodes, which consists of an acoustic model and a decoding graph. How to build the graph and score using keywords and filler paths is critical to system performance. Then, with the development of deep learning, a system is proposed in the related art, which is composed of an acoustic model and a posterior processing module, so that training and reasoning procedures are simplified. Then, on the basis of this paper, there is a great deal of follow-up work. Some work modifies the structure of neural networks, replacing DNNs, such as CNNs, LSTM, and attention-based neural networks, with some more modeling-capable structures. Still others have improved the posterior processing algorithms, consider the relative order of keyword tag sequences to significantly reduce false positives (FAs), or implement new decoding algorithms.
The above method is designed for a predefined wake-up word detection system, and does not allow the user to specify preferred keywords. Furthermore, to achieve high recall and accuracy, a large amount of data with specific wake-up words must be collected to train a robust model, which is laborious. In recent years, some work has been done to support user-defined keywords, which means that models cannot be more sensitive to a single or a few words during training, which must equally consider each word. Therefore, how to use general ASR data to make the model obtain good performance in different keywords which are not predefined in the training process becomes a key research field. The related art describes end-to-end customizable keyword discovery systems. The related art proposes RNN-T systems to identify arbitrary wake-up words. The related art designs a detection module, and a customizable KWS task is formulated as a detection task. From these works we can roughly summarize some of the challenges that are currently.
-lack of task-specific data. The amount and quality of the training data directly determines the performance of the model. However, because customizable KWS systems are unpredictable to inferring keywords, researchers or developers cannot prepare large amounts of specific keyword data to support model training as do pre-defined KWS systems.
High computational costs and delays. Whether HMM-based, DNN-based, or ASR-based methods, all three require search algorithms, which introduce additional computational costs and runtime delays.
In the embodiment of the application, by means of contrast learning and heuristic query (QByE) KWS, the inventor proposes an end-to-end keyword discovery (COBE-KWS) framework based on contrast learning, and any keywords can be customized in the reasoning process to improve the mentioned problems. The core contributions of embodiments of the present application may be summarized in the following points.
Better, fully exploiting generic ASR data to alleviate data shortfall. We construct training samples using commonly found audio and corresponding text. With the help of contrast learning, the model can learn effective acoustic and text characterizations.
The post-processing module is agnostic and is a low-latency framework. The different decoding algorithms do not affect performance because the system of the embodiments of the present application has no post-processing module. Furthermore, there is not much additional computational cost other than the forwarding cost in the reasoning process.
In the process, besides the forwarding cost, the method has no much extra calculation cost, so that the reasoning process is faster.
2. Methodology of
In this section, the overall architecture is presented and the training and reasoning process of the system of embodiments of the present application is described in detail. First, we need to pre-train the Acoustic Model (AM) in a supervised manner to provide hidden characterization at the frame level. Subsequently, we have established a contrast learning based end-to-end customizable KWS (COBE-KWS) system consisting of a keyword sampler, the aforementioned frame-level pre-trained acoustic model, text and acoustic embedded encoders, and a similarity calculation module. Fig. 1 illustrates an example of a training process for a KWS system.
2.1. Frame level AM training
Before the KWS system is established, the AM should be pre-trained to provide a hidden acoustic characterization. A tone is a modeling unit and the procedure of forcing alignment should be completed in advance in preparation for training the frame-level acoustic targets. The acoustic model is then trained in a supervised learning manner using cross entropy criteria. After the pre-training is completed, the acoustic model is frozen, so that the acoustic model only provides the acoustic hiding characteristics at the frame level for the KWS system.
2.2. Building training pairs
As we mention in the introduction, the lack of training data is natural for customizable KWS tasks, as we are not aware of the test keyword phrase. We cannot collect a large amount of data for one or some specific keywords as in a fixed KWS task. How to effectively use the commonly found data becomes a fundamental factor in training a powerful customizable KWS model. Inspired by the multimodal representation and contrast learning in QbyE KWS, mining matching information between audio clips and text keywords in a self-supervising manner may be a meaningful way to alleviate the problem of data starvation.
From other relevant work on contrast learning and our experimental observations, we know how the increase in data, i.e. how to construct positive and negative training samples, has a great impact on performance learning. Thus, we have devised a sample of both positive and negative examples to learn more efficient characterizations, including audio-audio and audio-text pairs.
2.2.1. What is the length of each segment?
Based on the consensus that the fewer mismatches between training and reasoning, the better the model performance, we take the same approach to estimate the segment length between training and testing. We calculate the average duration Tmean of all phonemes from the ASR dataset and then select the length Lseg of the segment using the following estimation method.
L seg =T mean *N phns +L margin , (1)
Wherein N is phns Representing the number of key phonemes. L (L) margin Representing a superparameter, provides an extra length for the speech segment that attempts to contain keywords longer than the average pronunciation time. In our experiments T mean Equal to 90ms, L margin 300ms.
2.2.2. Audio pair
As shown in fig. 1, we randomly extract some arbitrary words from each corpus as keywords for the corpus of the small lot. Based on the overlap ratio between the estimated window and the actual keyword position, we cut several positive and negative audio clips from the original audio for each keyword.
For any one small lot, we represent the j-th sample text keyword in the i-th corpus as W i,j All N corresponding front audio segments are denoted as a p1 i,j ,A p2 i,j ,...,A pN i,j And all M corresponding negative audio segments are denoted as A n1 i,j ,A n2 i,j ,...,A nM i,j
Any two A pk i,j And A p li,j (k/=l) are all considered as a pair of alignment cases.
Any one A pk i,j And A nx i,j Is considered a negative example pair. Since there are CN2/2 positive pairs, there are positive pairs for each sampled keyword, so the number of audio pairs is enormous.
2.2.3. Audio-text pairs
In constructing the audio and text pairs, we consider only the front audio segment A pk i,j And keyword W i,j Ignoring negative audio segment a nx i,j . Each front audio-text pair consists of an audio segment A pk i,j And the corresponding keyword W i,j Composition is prepared. Similar to other comparative learning works, positive audio sample A pk i,j And dividing the corresponding keyword W in the small lot i,j All text keywords W outside are considered negative audio-text pairs.
2.3. Training strategy
InfoNCE loss
InfoNCE loss is widely used in contrast learning tasks, which forces positive pairs to be closer together in contrast space, while negative pairs are farther apart. The loss function of a positive sample pi is expressed as: .
Figure BDA0004084673300000091
Wherein p is i And p + i Is defined as a positive number pair, p i And different negative samples N in the set { N-i }, respectively - i J is considered as a different negative pair. Sim is a similarity calculation function, τ represents a temperature hyper-parameter for controlling the concentration of features in the representation space.
2.3.2. InfoNCE loss in proposed models
As described in section 2.2, there are two types of feature representations that are effective in assisting learning, we apply InfoNCE loss to audio-audio pairs and audio-text pairs, respectively. For a small batch, the loss of audio-audio pairs is defined as follows.
Figure BDA0004084673300000092
Where aa is an abbreviation for audio-audio pair.
The loss of audio-text pairs is also defined in a similar manner.
Figure BDA0004084673300000093
Where at represents an abbreviation for audio-text pair.
The final presentation loss is defined as follows.
Figure BDA0004084673300000094
Here, α and β are two super parameters for balancing the ratio of the two parts.
Empirically we set the temperature τ in equation (3) aa =0.07, the temperature τ is set in equation (4) at =0.12, and α=0.15 and β=1.0 are set in equation (5).
Inference phase
The test phase is similar to the traditional way of inference of deep QbyE KWS in that the model determines whether keywords are present by the similarity of audio segment embedding and text keyword embedding.
In the reasoning phase we cut the test audio into shorter audio segments with a window whose length is estimated by the number of keyword phonemes, which is consistent with the training phase mentioned earlier. The audio clips and keyword phonemes are then fed into a model to determine whether each clip contains the keyword. The system is streaming media, delaying the length of a jump from between two adjacent audio segments.
Experimental configuration
3.1. Data set
High quality public datasets are of great significance to the development of a research field, but unfortunately, standard datasets for continuous speech customizable KWS tasks are few. We have established a keyword discovery version of the library data set to evaluate the performance of a continuous speech customizable KWS task. The selection of keywords is based on the frequency of words and the number of phonemes in order to be consistent with a realistic application scenario. We selected from 5 to 50 most frequent words from the test-clean and test-other sets of test data, respectively, each word having at least six phonemes. In addition, the audio of the entire phoneme sequence containing the keyword is also regarded as a positive sample. For example, "tagther" is considered a positive sample of the keyword "tagther".
All audio in one test dataset, except for keywords, is assigned to the corresponding false positive dataset, each for about 3 hours.
3.2. Details of implementation
Compared with the model proposed by us, there are two widely used streaming media KWS baselines. The first is the well-known HMM-based method, which contains a sound module and keyword/filler HMMs. In the inference process, the keyword hypotheses are searched by the Viterbi algorithm in the decoded picture. The second system, which was first proposed in the related art (we refer to below as HMM-free), comprises an acoustic model and a post-processing module. We use post-processing algorithms proposed in the related art instead of the initial algorithm to reduce false positives. The acoustic models in both baselines are identical.
The same frame-level amplitude modulation and baseline are used to output acoustic bottleneck embedment for the proposed COBE-KWS system. Attention-based LSTM has three layers, representing single vectors of acoustic and phoneme embedding, respectively.
The embedding and the phoneme embedding are each represented as a single vector. These vectors are mapped to the contrasting embedding space by two separate Full Connection (FC) layers. Cosine similarity is applied to obtain the final similarity scores for the two contrasting embeddings. Acoustic, text, hidden, and contrast embedding are all 128-dimensional vectors. These models were trained with a batch of 12288 frames and an SGD optimizer.
3.3. Evaluation index
We use micro-recall (short for micro-average recall) to evaluate the average performance of the model for all keywords at a specific ultra-low number of false positives, i.e., a number of 2 (less than 1 per hour). The microscopic recall measures the recall of the aggregate contributions of all keywords. We also calculated macroscopic statistics such as macroscopic recall and macroscopic F1 score to prevent some huge number of keywords from dominating microscopic statistics. However, during the experimental stage, there is a similar trend for microscopic and macroscopic results. For simplicity we only introduce micro recall.
Fig. 3 shows the results of the baseline model and the proposed model. We introduce microscopic recall cases for 5, 10, 15, 20, 30, 40, and 50 keywords, respectively, provided that the number of false alarms is 2 on the corresponding false alarm data set.
Results and analysis
4.1. Results
The experimental results are presented in figure 3. When the test dataset is relatively simple, the HMM/Filler based baseline performs better on high frequency keywords, while performance drops dramatically on complex acoustic environments or relatively rare keywords. In addition to the results of relatively few keywords (5, 10, 15) in the HMM-based test-clean dataset compared to the two baseline models, our model can achieve consistent yields of 5 to 50 words on both simple test-clean datasets and challenging test-other datasets.
When the number of keywords is 5 or 10, the result reflects more the ability to model high frequency words. In contrast, when the number of keywords is relatively large, such as 40 or 50, the result is more reflective of the model's ability to summarize any given keyword. For customizable KWS systems, both of these aspects are also critical to the researcher or user. Good results on numerous keywords provide the user with the ability to choose keywords arbitrarily, while good results on high frequency words provide the possibility for further customization after keyword determination.
Fig. 4 shows the results of a model trained from audio-audio and audio-text pairs or from audio-text pairs alone. "Clean" means the result of the test-Clean dataset and "Other" means the result of the test-Other dataset. "aa" means the audio-audio pair and "at" means the audio-text pair.
4.2. Importance of audio-audio pairs
To explore the effectiveness of the proposed way of constructing audio-audio pairs, we performed ablation experiments with audio-text pairs only, seeking matching information between text keywords and acoustic segments, the results being shown in fig. 2. The degradation in performance is dramatic. This is not surprising, as audio is critical to improving the performance of the model, as it introduces discrimination between positive and locally similar negative audio clips, greatly reducing false alarms.
Fig. 5 shows a standard deviation comparison of different keyword recalls for baseline and proposed systems. The lower the standard deviation between different keywords, the higher the stability of the system.
4.3. Stability and generalization
We list the standard deviation (STD) of recall for all test keywords in figure 5. The STD of the system we propose on the test-clean dataset is much lower compared to the baseline model, except for the results of 10 and 15 keywords of the HMM/fill model. Even if the number of keywords is large enough, the performance of the proposed model is still stable. The gap in test-other data sets is not obvious, even when the number of keywords is less than 20, the STD of the proposed model is much lower. When the number of keywords rises above 30, the STD is very close, which suggests that our model also has significant performance differences between different keywords in complex acoustic environments where keywords are customized at low frequencies. This provides a direction for further improvements to the system we have made later.
4.4. Speed of reasoning
In this section, we propose a relative speed Ratio (RSA) to explore the inference delay. RSA is defined as the ratio of the execution times of the benchmark and the model. As shown in Table 3, the proposed COBE-KWS system has a breakthrough inference speed compared to other benchmarks, which is critical to KWS tasks.
Fig. 6 shows a comparison of the relative velocity acceleration (RSA) of the baseline and proposed systems. The higher the RSA, the faster the inference speed.
5. Conclusion(s)
The embodiment of the application provides an end-to-end framework based on contrast learning, which is used for customizable keyword discovery tasks. We introduce how to construct huge audio-audio pairs and audio-text pairs to alleviate the problem of data starvation in a task. With the help of contrast learning, the model can effectively learn the characterization of different text words and audio segments. The similarity between keywords and audio segments is considered a confidence score that determines whether the audio consists of keywords. Our approach has been superior to the baseline model in simple and complex test data sets, from a few keywords randomly selected to a large number of keywords. In addition, the system provided by us has more stable performance and faster reasoning speed. In future work, we will explore how to further exploit generic ASR data to improve the performance of customizable KWS systems in complex acoustic environments.
In other embodiments, the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer executable instructions that can perform the keyword detection model training method in any of the above method embodiments;
as one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
constructing a matching pair of the audio fragment and the text keyword by utilizing the general data so as to enable the keyword detection model to learn the capability of whether the audio fragment contains the keyword or not;
and constructing a matching pair of the positive example audio and a matching pair of the negative example audio, enabling the matching pair of the positive example audio to be closer in a characterization space by using a comparison learning mode, and enabling the matching pair of the negative example audio to be farther in the characterization space.
The non-transitory computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from the use of the keyword detection model training system, and the like. Further, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory remotely located with respect to the processor, the remote memory being connectable to the keyword detection model training system through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-volatile computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform any one of the keyword detection model training methods described above.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 7, where the device includes: one or more processors 710, and a memory 720, one processor 710 being illustrated in fig. 7. The device of the keyword detection model training method and system may further include: an input device 730 and an output device 740. Processor 710, memory 720, input device 730, and output device 740 may be connected by a bus or other means, for example in fig. 7. Memory 720 is the non-volatile computer-readable storage medium described above. Processor 710 executes various functional applications of the server and data processing, i.e., implements the keyword detection model training method and system described above, by running non-volatile software programs, instructions, and modules stored in memory 720. The input device 730 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the communication compensation device. The output device 740 may include a display device such as a display screen.
The product can execute the method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present invention.
As one implementation mode, the electronic device is applied to a keyword detection model training system, and comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to:
constructing a matching pair of the audio fragment and the text keyword by utilizing the general data so as to enable the keyword detection model to learn the capability of whether the audio fragment contains the keyword or not;
and constructing a matching pair of the positive example audio and a matching pair of the negative example audio, enabling the matching pair of the positive example audio to be closer in a characterization space by using a comparison learning mode, and enabling the matching pair of the negative example audio to be farther in the characterization space.
The electronic device of the embodiments of the present application exist in a variety of forms including, but not limited to:
(1) A mobile communication device: such devices are characterized by mobile communication capabilities and are primarily aimed at providing voice, data communications. Such terminals include: smart phones, multimedia phones, functional phones, low-end phones, etc.
(2) Ultra mobile personal computer device: such devices are in the category of personal computers, having computing and processing functions, and generally also having mobile internet access characteristics. Such terminals include: PDA, MID, and UMPC devices, etc.
(3) Portable entertainment device: such devices may display and play multimedia content. The device comprises: audio, video players, palm game players, electronic books, and smart toys and portable car navigation devices.
(4) And (3) a server: the configuration of the server includes a processor, a hard disk, a memory, a system bus, and the like, and the server is similar to a general computer architecture, but is required to provide highly reliable services, and thus has high requirements in terms of processing capacity, stability, reliability, security, scalability, manageability, and the like.
(5) Other electronic devices with data interaction function.
The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the various embodiments or methods of some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A keyword detection model training method comprises the following steps:
constructing a matching pair of the audio fragment and the text keyword by utilizing the general data so as to enable the keyword detection model to learn the capability of whether the audio fragment contains the keyword or not;
and constructing a matching pair of the positive example audio and a matching pair of the negative example audio, enabling the matching pair of the positive example audio to be closer in a characterization space by using a comparison learning mode, and enabling the matching pair of the negative example audio to be farther in the characterization space.
2. The method of claim 1, wherein the constructing the matched pair of positive example audio and the matched pair of negative example audio comprises:
randomly extracting some arbitrary words from each small corpus as key words of the small corpus, wherein the corpus comprises audio and corresponding texts;
and cutting out a plurality of positive example audios from the original audios of each keyword to obtain a plurality of matching pairs of positive example audios, wherein other audios are classified into negative example audios except the positive example audios containing the keywords.
3. The method of claim 2, wherein the contrast-learned penalty function is an InfoNCE penalty in which not only the penalty of matching pairs of audio and text keywords, but also the penalty of audio and audio keywords is calculated.
4. The method of claim 3, wherein the keyword detection model comprises an acoustic model, and the constructing a matching pair of audio segments and text keywords using generic data comprises:
the acoustic model is pre-trained in a supervised manner to provide frame-level acoustic concealment characterizations, and parameters of the pre-trained acoustic model are frozen after the pre-training is completed.
5. The method of claim 4, the keyword detection model further comprising an acoustic embedded encoder, a keyword sampler, a text embedded encoder, and a similarity calculation module.
6. The method of claim 5, wherein the means for utilizing contrast learning comprises:
acquiring acoustic hidden characteristics output by the pre-trained acoustic model;
encoding the acoustic concealment feature as an acoustic embedding feature with the acoustic embedding encoder;
acquiring a phoneme sequence from the text keyword by using a keyword sampler, and encoding the phoneme sequence into a text embedding feature by using a text embedding encoder;
and calculating the similarity of the acoustic embedded feature and the text embedded feature by using the similarity calculation module, determining whether a keyword exists or not based on the similarity, and training the keyword detection model based on the contrast learning loss function.
7. The method of any of claims 1-6, wherein the keyword is a wake word and the keyword detection model is a wake word detection model.
8. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 7.
9. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1 to 7.
CN202310132690.5A 2023-02-17 2023-02-17 Keyword detection model training method, electronic equipment and storage medium Pending CN116110376A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310132690.5A CN116110376A (en) 2023-02-17 2023-02-17 Keyword detection model training method, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310132690.5A CN116110376A (en) 2023-02-17 2023-02-17 Keyword detection model training method, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116110376A true CN116110376A (en) 2023-05-12

Family

ID=86259675

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310132690.5A Pending CN116110376A (en) 2023-02-17 2023-02-17 Keyword detection model training method, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116110376A (en)

Similar Documents

Publication Publication Date Title
CN110473531B (en) Voice recognition method, device, electronic equipment, system and storage medium
WO2021174757A1 (en) Method and apparatus for recognizing emotion in voice, electronic device and computer-readable storage medium
KR102315732B1 (en) Speech recognition method, device, apparatus, and storage medium
CN108417202B (en) Voice recognition method and system
CN107134279B (en) Voice awakening method, device, terminal and storage medium
CN110556100B (en) Training method and system of end-to-end speech recognition model
CN106683677B (en) Voice recognition method and device
CN111462735A (en) Voice detection method and device, electronic equipment and storage medium
US7529671B2 (en) Block synchronous decoding
Xu et al. Exploiting shared information for multi-intent natural language sentence classification.
CN109243468B (en) Voice recognition method and device, electronic equipment and storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN111898379B (en) Slot filling model training method, electronic equipment and storage medium
US11398226B1 (en) Complex natural language processing
CN113450771B (en) Awakening method, model training method and device
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
WO2023193394A1 (en) Voice wake-up model training method and apparatus, voice wake-up method and apparatus, device and storage medium
US20240013784A1 (en) Speaker recognition adaptation
CN109471919A (en) Empty anaphora resolution method and device
CN116110376A (en) Keyword detection model training method, electronic equipment and storage medium
CN116564330A (en) Weak supervision voice pre-training method, electronic equipment and storage medium
CN112784094B (en) Automatic audio summary generation method and device
CN116246639A (en) Self-supervision speaker verification model training method, electronic device and storage medium
CN114420098A (en) Wake-up word detection model training method, electronic device and storage medium
Li et al. Conditional joint model for spoken dialogue system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination