CN112509560A - Voice recognition self-adaption method and system based on cache language model - Google Patents

Voice recognition self-adaption method and system based on cache language model Download PDF

Info

Publication number
CN112509560A
CN112509560A CN202011332443.2A CN202011332443A CN112509560A CN 112509560 A CN112509560 A CN 112509560A CN 202011332443 A CN202011332443 A CN 202011332443A CN 112509560 A CN112509560 A CN 112509560A
Authority
CN
China
Prior art keywords
recognition
language model
model
module
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011332443.2A
Other languages
Chinese (zh)
Other versions
CN112509560B (en
Inventor
黄俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yizhi Intelligent Technology Co ltd
Original Assignee
Hangzhou Yizhi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yizhi Intelligent Technology Co ltd filed Critical Hangzhou Yizhi Intelligent Technology Co ltd
Priority to CN202011332443.2A priority Critical patent/CN112509560B/en
Publication of CN112509560A publication Critical patent/CN112509560A/en
Application granted granted Critical
Publication of CN112509560B publication Critical patent/CN112509560B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Abstract

The invention discloses a speech recognition self-adaption method and system based on a cache language model, and belongs to the field of speech recognition. The method comprises the steps of receiving a continuous voice signal input by a user, segmenting the continuous voice signal into a plurality of short voices based on a voice activity detection technology VAD, sequentially identifying phrase sounds based on a general language model, generating a corresponding identification result for each phrase sound, searching based on keywords to obtain an associated word list, processing the associated word list through a cache model to obtain a language model adapting to local change of historical identification text distribution, and continuously identifying subsequent phrase sounds based on the modified language model. After local modification, the language model and the historical recognition content have better similarity, and the accuracy of the recognition of the continuous long speech is improved. In addition, the user can actively correct the low-frequency words which are wrongly recognized in the recognition process, and the subsequent recognition accuracy of the low-frequency words is improved.

Description

Voice recognition self-adaption method and system based on cache language model
Technical Field
The invention relates to the field of voice recognition, in particular to a voice recognition self-adaption method and system based on a cache language model.
Background
After decades of development, speech recognition has a mature technology, and Siri, Cortana and the like have high recognition accuracy under ideal conditions in practical application.
The performance of a speech recognition system depends to a large extent on the similarity between the Language Model (LM) used and the task to be processed. This similarity is particularly important in cases where the statistical properties of the language change over time, such as in application scenarios involving spontaneous and multi-domain speech. Topic Identification (TI) based on information retrieval is a key technology, and a topic under discussion is obtained through semantic analysis of a historical identification result, so that a language model is adjusted, and dynamic self-adaptation is realized.
A problem with topic identification is that for individual low frequency words, there is a possibility that they will cause unnecessary changes to the language model because of the apparent domain characteristics they carry. In the aspect of speech signal processing, a current speech recognition system mainly adopts single-sentence task recognition, namely, the speech recognition system recognizes a single sentence in speech as an independent task according to a Voice Activity Detection (VAD) judgment result no matter how long the speech is input. This has the advantage that better recognition timeliness can be obtained and the system overhead can be reduced to some extent.
For scenes with strong context relation or professional field, such as academic conferences, interview records and the like, the single sentence task recognition system ignores the context relation, repeatedly makes mistakes for words with inaccurate recognition, and cannot recognize low-frequency words by using field information. On the other hand, for a speech recognition system configured with a plurality of domain language models, generally, the domain models need to be manually specified before the recognition is started, or the output results of a plurality of domains need to be selected with confusion, which adds unnecessary steps and leads to insufficient intelligence of the recognition system.
Disclosure of Invention
The invention provides a speech recognition self-adaptive method and system based on a cache language model, and aims to solve the problems that the existing speech recognition system based on a single sentence task cannot self-adaptively recognize low-frequency words according to field information, so that the recognition accuracy of the low-frequency words is low or the recognition system is too complex. The method comprises the following steps: receiving a continuous voice signal input by a user; segmenting the continuous speech signal into a plurality of short voices based on Voice Activity Detection (VAD); sequentially recognizing the short voice based on a general language model or a preset domain voice model and generating a recognition result; obtaining relevant words of the recognition result based on the keyword search; and expanding a cache language model based on a Recursive Network (RN) nonparametric method to obtain new probability distribution of the historical recognition words and the associated words, and obtaining new word statistical probability. The invention utilizes the dynamic cache of the historical recognition result to modify the probability of the language model, so that the voice recognition system has a self-adaptive effect on the information recognition task with coherent fields, and avoids unnecessary changes of the domain language model.
In order to achieve the above object, the present invention adopts a speech recognition adaptive method based on a cache language model, which comprises the following steps.
Step 1: aiming at a section of continuous long voice, firstly, a plurality of short voices are obtained through segmentation, and a task queue is formed according to a time sequence;
step 2: taking a first short voice in the task queue as the input of an automatic voice recognition system, obtaining a recognition text, and deleting the short voice from the task queue; the automatic speech recognition system comprises a dynamic language model, and a preset language model is used as an initialized dynamic language model;
and step 3: establishing a cache model, judging whether probability correction is needed or not in real time according to the recognition text of each short voice, if not, returning to the step 2 until the task queue is empty, and completing the recognition task; if yes, performing keyword search according to a preset associated word list to obtain a keyword group, storing the keyword group in a cache region of a cache model, calculating local vocabulary probability distribution, and constructing a local language model;
and 4, step 4: and (3) carrying out interpolation combination on the local language model constructed in the step (3) and the dynamic language model in the automatic voice recognition system to obtain an updated dynamic language model, returning to the step (2) until the task queue is empty, and completing the recognition task.
Further, the recognized text output by the automatic speech recognition system in step 2 is manually corrected.
Furthermore, the preset language model uses a 3-gram language model established in a big data text corpus.
Further, the step 3 specifically comprises:
step 3.1: establishing a cache model which comprises a plurality of cache regions;
step 3.2: judging whether low-frequency words or strong-field characteristic words exist or not in real time according to the recognition text of each short voice, if so, performing probability correction, entering step 3.3, if not, performing no probability correction, returning to step 2 until the task queue is empty, and completing the recognition task;
step 3.3: and searching keywords according to a preset associated word list to obtain a keyword group, storing the keyword group in a cache region of a cache model, and calculating the probability distribution of local vocabularies.
Step 3.4: and constructing a local language model based on the 3-gram according to the local vocabulary probability distribution.
Further, when the associated phrases stored in the cache region of the cache model reach a threshold value of a million-level numerical value, the local vocabulary probability distribution calculation formula in the step 3.3 is replaced.
Further, the step 4 specifically includes:
carrying out interpolation combination on the local language model and a dynamic language model in the automatic voice recognition system, and correcting the probability of local words; and (5) obtaining an updated dynamic language model through the corrected word probability, returning to the step (2) until the task queue is empty, and completing the recognition task.
Another objective of the present invention is to provide a speech recognition adaptive system based on a cached language model, for implementing the above speech recognition adaptive method, including:
the voice signal receiving module: for receiving a continuous speech signal to be recognized;
the voice signal segmentation module: the voice processing device is used for cutting a section of continuous long voice acquired by the voice signal receiving module into a plurality of short voices;
a storage module: the device comprises a task queue, a word association database and a recognition module, wherein the task queue is used for storing a task queue to be recognized, a recognized text result and a related word corpus, and the task queue is formed by a plurality of short voices according to a time sequence;
an ASR decoding module: the system is provided with a task reading unit, an automatic voice recognition model and a recognition result output unit, wherein the automatic voice recognition model comprises a dynamic language sub-model; the task reading unit is used for reading a first short voice in the task storage module as the input of an automatic voice recognition model, the automatic voice recognition model is used for recognizing the voice into a text, and the recognition result output unit is used for outputting the recognized text to the storage module;
a probability correction judging module: the ASR decoding module is used for judging whether low-frequency words or strong-field characteristic words exist in the recognition text output by the ASR decoding module, if so, sending a starting signal to the related word searching module, and if not, sending a signal to the ASR decoding module to start the recognition of the next short voice;
a related word search module: the system comprises a storage module, a keyword searching module, a word searching module and a word searching module, wherein the storage module is used for storing relevant word corpora;
a language model modification module: and the parameter updating module is used for updating the parameters of the dynamic language submodel in the ASR decoding module according to the real-time associated phrases and the historical associated phrases obtained by the associated word searching module.
The invention has the following beneficial effects:
(1) the traditional speech recognition system has high requirements on data volume and computing resources during training, and a user usually cannot retrain the model according to an actual use scene, so that the similarity between the language model and an actual recognition task is poor. According to the method, historical cache information is utilized, an expanded cache model based on a Recursive Network (RN) is constructed to process historical recognition results, a parameter model is made to adapt to local change in distribution, and interpolation merging is conducted on empirical parameters obtained by optimizing a large amount of data in advance and an original language model. According to the locality principle, the historical information has higher probability to appear in the subsequent recognition tasks, namely the similarity between the language model and the actual recognition tasks can be gradually improved by constructing the cache language model, and the recognition accuracy of the voice recognition system is further improved.
(2) For recognition tasks with high domain relevance, a fixed language model cannot adapt to the diversity of the task corpus domain. The invention uses a plurality of domain text corpora prepared in advance to construct a relational word list. And after obtaining the historical cache information, searching in the relation word list to obtain an associated word group, constructing a cache language model according to the associated word group, and interpolating and combining the cache language model and the original language model for subsequent identification. The addition of the associated phrases improves the domain correlation of the language model and is beneficial to accurately identifying the identification task with stronger domain.
(3) For a long continuous recording and transcription recognition task, a traditional voice recognition system cannot interfere recognition results in recognition, and repeated errors of certain low-frequency words are caused. The manual correction module designed by the invention allows a user to manually correct the historical recognition result, improves the probability of the corresponding low-frequency words in the cache language model, avoids repeated mistakes of the low-frequency words and improves the accuracy of subsequent recognition.
Drawings
FIG. 1 is a flow diagram of a method of building a cached language model according to one embodiment of the invention.
Fig. 2 is a schematic structural diagram of a continuous speech recognition adaptive system according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be illustrative of the invention, and are not to be construed as limiting the invention.
The following describes a method and a system for continuous speech recognition adaptation based on a cache language model according to an embodiment of the present invention with reference to the drawings.
FIG. 1 is a flow diagram of a method of building a cached language model according to one embodiment of the invention.
The method for constructing the cache language model comprises the steps of receiving a continuous voice signal input by a user, segmenting the continuous voice signal into a plurality of short voices based on a voice activity detection technology VAD, sequentially identifying phrase sounds based on a general language model, generating a corresponding identification result for each phrase sound, searching for an associated word list based on keywords, processing the associated word list by using an expanded cache model based on a Recursive Network (RN), obtaining a parameter language model adapting to local change of historical identification text distribution, and continuously identifying subsequent phrase sounds based on the modified parameter language model. After local modification, the language model and the historical recognition content have better similarity, and the accuracy of the recognition of the continuous long speech is improved. In addition, the user can actively correct the low-frequency words which are wrongly recognized in the recognition process, and the subsequent recognition accuracy of the low-frequency words is improved.
As shown in fig. 1, the method for constructing the cache language model includes:
in the present embodiment, S100 to S105 are basic speech recognition function flows.
And S100, receiving a continuous voice signal input by a user.
Specifically, a continuous speech signal input by a user may be received. In the existing implementation, a user inputs voice by uploading off-line recording or clicking a recording button through an identification entrance, and then clicks the end of recording and finishes inputting. The system performs subsequent recognition processing on the received voice signal.
Based on the voice activity detection VAD, the continuous speech signal is segmented into a plurality of short voices S101.
Specifically, each frame of voice of the continuous voice signal is recognized by utilizing a deep learning algorithm according to a preset mute model so as to recognize a mute frame, the frame reaching a preset long mute threshold value is taken as a segmentation point to segment the continuous voice signal into a plurality of effective voice segments, and therefore the segmentation of the voice is realized, and phrase voices are recognized respectively.
S102, a phrase voice task queue to be processed.
Specifically, receiving a voice phrase task segmented based on VAD and adding the voice phrase task into a queue, if the task queue is not empty, sequentially processing the voice of the task, and otherwise, ending the process.
S103, recognizing the cut short voice based on the language model and generating a corresponding recognition result.
Specifically, the general language model or the preset domain model is used for recognition for the first time, the recognition result is input into a cache model of a nonparametric method, and after the existing historical recognition record and associated words are processed by the cache model, the corrected language model is used for recognition.
And S104, presetting a general language model or a domain model.
Specifically, a general language model constructed on a big data text is used as a background language model, basic function support is provided for a speech recognition system, or a corresponding domain language model is directly used when a speech domain can be predicted.
And S105, optionally manually correcting.
Specifically, according to different use scenes, a user can correct the incorrectly recognized homophones/low-frequency words in the recognition process in a part of scenes (such as recording and transcription), and can better match with the local adjustment of a subsequent language model to improve the accuracy of subsequent recognition.
In this embodiment, S200 to S202 are flows for obtaining related phrase functions.
And S200, judging whether probability correction is needed to be carried out on the recognition text.
Specifically, the method of the present invention is directed to low frequency words/strong domain feature words in the history text, that is, for the obtained history recognition text, if the observation probability of the history recognition text in the language model is higher than a threshold, no subsequent probability correction is performed.
S201, searching keywords.
Specifically, the words which are judged to need probability correction through the S200 are used as keywords, and the associated word groups are obtained by searching in the preset associated word list. In order to solve the problem that the number of the cached associated phrases is increased quickly after a large number of recognition tasks, the associated phrases are input into a cache model after pruning.
S202, presetting an associated word list.
Specifically, a related word list is constructed based on pronunciation similarity, semantic field similarity and logic correlation, and an inverted index and product quantification are established for the word list in preprocessing, wherein the inverted index is used for recording all words in a corpus, and takes words or field subjects as main keys, and each main key points to a series of connected word groups, so that rapid keyword retrieval and low memory occupation are realized.
In the present embodiment, S301 to S302 are modified language model flows.
S301, based on the Recursive Network (RN), expanding a cache model.
In particular, according to the locality principle, a word that has been used recently will have a higher probability of being used again, and a word that has a correlation with the recently used word will have a higher probability of being used. In the present invention, a recursive network designed specifically for sequence modeling is used, and in each iteration of time nodes, the network encodes historical identification information and represents the encoded information as a hidden vector related to time
Figure BDA0002796208390000061
The subsequent prediction probabilities based on this are:
Figure BDA0002796208390000062
oωis a coefficient matrix, and oc represents a proportional relationship.
Wherein the hidden vector htThe update is made by the following recursive formula:
ht=Φ(xt,ht-1)
the updating function phi uses different structures in different network structures, the Elman network is used in the invention, and the updating function phi is defined as:
ht=σ(Lxt+Rht-1)
where σ is a non-linear function tanh,
Figure BDA0002796208390000063
the matrix is embedded for the words and,
Figure BDA0002796208390000064
is a recursive matrix.
On the basis of the historical information representation, the model is cached in (h)i,ωi+1) Storing historical information in a mode of observing word pairs by using hidden vectors, and obtaining the word probability until t time in the cache component by using the following kernel density estimator:
Figure BDA0002796208390000071
in the formula, ω1,...,ωt-1Representing a historical recognized text sequence, omegaiIdentifying the ith word, ω, in a text sequence for historytA word representing a current time; pcachet1,...,ωt-1) Representing a current word of a text sequence identified based on a cache history as omegatK denotes a kernel function, htIs omegatCorresponding hidden vector, hiIs omegai
The corresponding hidden vector theta represents the Euclidean distance, | | | · | |, represents the modular length, and oc represents a proportional relationship.
If the cache content is not cleaned and the stored associated phrases reach the million level, the system does not perform accurate search, and instead, the following approximate k-nearest algorithm is used to estimate the distribution probability of the words, which is also called a variable kernel density estimator:
Figure BDA0002796208390000072
wherein ω istFor the current word, ω1,...,ωt-1In order to identify the sequence for the history,
Figure BDA0002796208390000073
and htH having the shortest euclidean distance ofiK is a kernel function (often chosen as gaussian kernel function), θ (h)t) Is htThe euclidean distance to k adjacent words.
According to the above, a local language model based on 3-grams is constructed.
And S302, carrying out linear interpolation to obtain the corrected language model.
Specifically, the local word probability is modified using the following linear interpolation formula:
P(ωt1,...,ωt-1)=(1-λ)Pmodelt1,...,ωt-1)+λPcachet1,...,ωt-1)
in which λ is the tuning parameter, PmodelRepresenting the recognition of the antecedent in the existing language model as omega1,...,ωt-1Under the condition that the current word is omegatThe probability of (d); and adjusting the lambda according to the experimental result, and finally obtaining a corrected language model containing historical recognition text information for recognizing the residual voice.
In order to solve the problem that the single sentence task type recognition system cannot utilize the historical recognition information to a certain extent, the invention also provides a continuous voice recognition self-adaptive system by combining the method.
Fig. 2 is a schematic structural diagram of a continuous speech recognition adaptive system according to an embodiment of the present invention.
As shown in fig. 2, the continuous speech recognition adaptive system may include: a speech signal receiving module 100, a speech signal segmentation module 101, an ASR decoding module 102, an artificial correction module (optional) 103, a related word searching module 200, a probability correction judgment module, a language model correction module 300, and a storage module.
The voice signal receiving module 100 is configured to receive a continuous voice signal input by a user. Specifically, the user uploads an offline audio file through the client, or clicks a recording button, inputs voice through the recording device, and clicks recording to finish recording. In the recording process, the system can automatically perform segmentation recognition according to the recorded voice signal and return a recognition result in real time.
The voice signal segmentation module 101 segments the continuous voice signal into a plurality of short voices based on the voice activity detection VAD. Specifically, a silence model is established by using a pre-labeled corpus, a silence frame recognition is performed on each frame in the continuous voice signal, the frame reaching a preset long silence threshold value is used as a voice signal segmentation point, the continuous voice signal is segmented into a plurality of effective voice segments, and therefore a voice task group to be recognized is obtained, and voice recognition is performed in sequence.
The ASR decoding module 102 is configured with a task reading unit, an automatic speech recognition model and a recognition result output unit, wherein the automatic speech recognition model comprises a dynamic language submodel; the task reading unit is used for reading a first short voice in the task storage module as the input of an automatic voice recognition model, the automatic voice recognition model is used for recognizing the voice into a text, and the recognition result output unit is used for outputting the recognized text to the storage module;
when the first recognition or historical recognition text associated word quantity is not enough to construct a local language model, the ASR decoding module uses a general language model or a predetermined domain language model to decode and recognize; after the local language model is constructed, the ASR decoding module uses the interpolation corrected language model for decoding and recognition.
A probability correction judging module: the short speech recognition method comprises the steps of judging whether low-frequency words or strong-field characteristic words exist in a recognition text output by the ASR decoding unit, if so, sending a starting signal to the related word searching module, and if not, sending a signal to the ASR decoding unit to start recognition of the next short speech.
And (optional) a manual modification module 103, configured to perform manual modification on the recognized text output by the ASR decoding unit. Specifically, the low-frequency words with wrong recognition are corrected, the system is prevented from introducing wrong associated words through manual correction, and the recognition effect of the system on the continuous voice signals is further improved. Because additional interactive operations are introduced, the module is more suitable for text transcription of off-line voice signals, and recognition accuracy is improved at the cost of partial real-time.
The related word searching module 200 is configured to perform keyword search according to a related word corpus preset in the storage module, obtain a related word group of the history recognition word, and input the related word group into the cache model to obtain local observation probability information of the related word group. The association relation in the preset association word list is pronunciation approximation, semantic field approximation and logic correlation, the word groups with the relation form the association word list, and an inverted index is established on the association word list to carry out rapid keyword retrieval.
And the language model modification module 300 is configured to perform parameter updating on the dynamic language submodel in the ASR decoding unit according to the real-time associated phrase and the historical associated phrase obtained by the associated word search module. Specifically, an extended cache model based on a Recursive Network (RN) processes related phrases of a history recognition text, local observation probability is obtained through calculation after related word quantity of the history recognition text reaches a threshold value, a local language model is built on the basis of the local observation probability, the local language model and a general language model are subjected to interpolation and merging in a language model correction unit, and finally a corrected language model containing continuous voice signal local information is obtained and used for a subsequent voice recognition process.
Wherein, the language model modification module comprises:
storage area: the related word group is used for storing the related word group output by the related word searching module;
a local probability calculation unit: and the local vocabulary probability distribution corresponding to the associated phrases at the latest moment in the storage area is calculated.
A language model modification unit: the automatic speech recognition model is used for constructing a local language model according to the local vocabulary probability distribution, interpolating and combining the local language model and a dynamic language sub-model in the ASR decoding unit, and updating the automatic speech recognition model.
And sending a signal to the ASR decoding unit after updating, and starting the recognition of the next short voice.
The units/modules employed in the system of the present invention may all employ the methods described in the above embodiments.
The continuous speech recognition self-adaptive system of the embodiment of the invention realizes the self-adaptation of the language model to the continuous speech content by processing the historical speech signal recognition text, and has obvious effect when the context relevance, the field relevance and the main body consistency of the speech content are stronger. In addition, the user can be allowed to actively correct the intermediate recognition result in the recognition process, and the use experience in a specific task scene is further improved.
Examples
In order to show the experimental effect of the present invention, this embodiment provides a comparative experiment, and the experimental method is the same as the process described above, and only specific implementation details are given here, and repeated processes are not described again.
This implementation was trained and tested using the following data set:
librisipeech, a well-known open source English data set (LibriSpeech: an ASR corpus based on public domain audio books, Vassil Panayotov, Guoguo Chen, Daniel Povey and Sanjeev Khudanpu, ICASSP 2015), written by Vassil Panayotov, includes about 1000 hours of 16kHz English reading speech corpus. Data were taken from the audiobook of the LibriVox project and carefully segmented and aligned.
New Crawl: a data set of news articles containing news data including 2007-2013. In this embodiment, a cache language model test is performed on the New Crawl 2007-2011 subdata set on the data distribution changing with time.
WikiText: write et al (s.write, c.xiong, j.bradbury, and r.socher.point sensor mix models. in ICLR,2017.) developed english datasets derived from Wikipedia articles, training the underlying recognition model on the original annotation format data.
The book repus: Gutenberg project data set (S.Lahiri.Complex of word chromatography networks: A prediction structural analysis. in Proceedings of The Student Research works hop at The 14th Conference of The European channel of The Association for Computational Linaturics, 2014.) contains text corpora in English books 3036.
Phone-record is a real telephone scene recording, contains 16kHz voice corpora for about 200 hours, and has a good-quality label text.
The implementation details are as follows:
in the basic speech recognition system, 39-dimensional MFCC features are extracted on Librispeech with a frame length of 25 milliseconds and a frame shift of 10 milliseconds, an acoustic model based on kaldi-channel is trained, and recognition is performed using the standard nnet 3-laten-family decoding flow. In modeling setting, the number of final phoneme gaussians in a gmm stage is 250000, leaf nodes are 18000, the final alignment result in the gmm stage is used as initial alignment of training of a chain model, a chain network comprises 3 LSTM layers, and each layer comprises 1024 LSTM units. Constructing a 3-gram language model by using Librisipeech, WikiText and The book Corpus text corpora in The initial language model;
the associated word list establishes a mysql relational database on The texts of New Crawl, WikiText and The book Corpus, sets multidimensional tags including topics and field information and The like, and establishes an inverted index. In the process of searching the relevant words, the threshold value of the low-frequency words is-4.75, and if the threshold value is higher than the threshold value, the low-frequency words are not processed and are not added into the cache model. The local language model interpolation coefficient lambda is set to 0.25;
performance evaluation:
for cached language model performance, the recognition accuracy is evaluated using Word Error Rate (WER), and the calculation formula is as follows:
Figure BDA0002796208390000111
in this implementation, the recognition performance of four schemes is evaluated:
table 1: caching language model comparison test results
Figure BDA0002796208390000112
Wherein, the Base: only using the basic speech recognition system, not modifying the initial language model;
local cache, base line, proposed by Grave et al (e.grave, a.job, and n.user.improving neural network models with a connecting cache. in ICLR,2017.), is a relatively common method of using historical identification information;
cache LM, a Cache language model used by the invention adopts a continuous recognition result to calculate the word error rate;
and (5) Final LM, re-identifying each test data set by using the Final language model, wherein the language model is not modified by using a cache model in the identification.
According to table 1, the method of the invention has an improved performance compared to the control scheme in all three test sets. The similarity between the training set and the test set is better, the low-frequency words are fewer, the promotion effect of the cache language model is limited, and the remarkable effect is shown in the other two test sets, particularly the Phone-record with stronger field.
It is worth mentioning that as an evaluation on the Final language model modification effect, the Final LM obtains the best accuracy result in all three terms, that is, the modified language model can better describe the test data text relationship than the original language model, which fully proves that the method for modifying the language model by using the history recognition information is effective and has a positive influence on the context-associated speech recognition task.
For the manual modification module, the present embodiment performs a test on a single conference recording, the recording duration is one hour, and the results are shown in table 2:
table 2: the manual correction module tests that the data is the error rate of a specific word
Figure BDA0002796208390000113
In the conference recording, two low-frequency words which do not appear or appear in the training data rarely are selected, the recognition accuracy of a system (base) which does not contain manual correction is poor, and after the manual correction is used (modified), the recognition accuracy of the system is remarkably improved, and the good effect of a manual correction module on a long voice transcription task is fully displayed.
While the present invention has been described with reference to the above embodiments, it is not intended to limit the invention. Various modifications and alterations may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, the protection scope of the invention is subject to the claims of the invention.

Claims (8)

1. A speech recognition adaptive method based on a cache language model is characterized by comprising the following steps:
step 1: aiming at a section of continuous long voice, firstly, a plurality of short voices are obtained through segmentation, and a task queue is formed according to a time sequence;
step 2: taking a first short voice in the task queue as the input of an automatic voice recognition system, obtaining a recognition text, and deleting the short voice from the task queue; the automatic speech recognition system comprises a dynamic language model, and a preset language model is used as an initialized dynamic language model;
and step 3: establishing a cache model, judging whether probability correction is needed or not in real time according to the recognition text of each short voice, if not, returning to the step 2 until the task queue is empty, and completing the recognition task; if yes, performing keyword search according to a preset associated word list to obtain a keyword group, storing the keyword group in a cache region of a cache model, calculating local vocabulary probability distribution, and constructing a local language model;
and 4, step 4: and (3) carrying out interpolation combination on the local language model constructed in the step (3) and the dynamic language model in the automatic voice recognition system to obtain an updated dynamic language model, returning to the step (2) until the task queue is empty, and completing the recognition task.
2. The adaptive speech recognition method based on cached language model according to claim 1, wherein the recognized text output by the automatic speech recognition system in step 2 is modified manually.
3. The speech recognition adaptive method based on the cache language model according to claim 1, wherein the step 3 specifically comprises:
step 3.1: establishing a cache model which comprises a plurality of cache regions;
step 3.2: judging whether low-frequency words or strong-field characteristic words exist or not in real time according to the recognition text of each short voice, if so, performing probability correction, entering step 3.3, if not, performing no probability correction, returning to step 2 until the task queue is empty, and completing the recognition task;
step 3.3: performing keyword search according to a preset associated word list to obtain a keyword group, storing the keyword group in a cache region of a cache model, and calculating the probability distribution of local vocabularies, wherein the calculation formula is as follows:
Figure FDA0002796208380000011
in the formula, ω1,...,ωt-1Representing a historical recognized text sequence, omegaiIdentifying the ith word, ω, in a text sequence for historytA word representing a current time; pcachet1,...,ωt-1) Representing a current word of a text sequence identified based on a cache history as omegatK denotes a kernel function, htIs omegatCorresponding hidden vector, hiIs omegaiCorresponding hidden vectors, theta represents an Euclidean distance, | | | · | | | represents a modular length, and oc represents a proportional relation;
step 3.4: and constructing a local language model based on the 3-gram according to the local vocabulary probability distribution.
4. The adaptive speech recognition method according to claim 3, wherein when the associated phrases stored in the buffer of the buffer model reach the threshold of million-level values, the local vocabulary probability distribution calculation formula in step 3.3 is replaced with the following formula:
Figure FDA0002796208380000021
in the formula (I), the compound is shown in the specification,
Figure FDA0002796208380000022
is and htH having the shortest euclidean distance ofiSet of (a), θ (h)t) Is htEuclidean distance to adjacent words.
5. The speech recognition adaptive method based on the cache language model according to claim 4, wherein the step 4 specifically comprises:
the local language model and the dynamic language model in the automatic speech recognition system are merged by interpolation, and the probability of the local word is corrected, wherein the formula is as follows:
P(ωt1,...,ωt-1)=(1-λ)Pmodelt1,...,ωt-1)+λPcachet1,...,ωt-1)
in which λ is the tuning parameter, PmodelRepresenting the recognition of the antecedent in the existing language model as omega1,...,ωt-1Under the condition that the current word is omegatThe probability of (d);
and (5) obtaining an updated dynamic language model through the corrected word probability, returning to the step (2) until the task queue is empty, and completing the recognition task.
6. A speech recognition adaptive system based on a cache language model, for implementing the speech recognition adaptive method of any one of claims 1-5, comprising:
the voice signal receiving module: for receiving a continuous speech signal to be recognized;
the voice signal segmentation module: the voice processing device is used for cutting a section of continuous long voice acquired by the voice signal receiving module into a plurality of short voices;
a storage module: the device comprises a task queue, a word association database and a recognition module, wherein the task queue is used for storing a task queue to be recognized, a recognized text result and a related word corpus, and the task queue is formed by a plurality of short voices according to a time sequence;
an ASR decoding module: the system is provided with a task reading unit, an automatic voice recognition model and a recognition result output unit, wherein the automatic voice recognition model comprises a dynamic language sub-model; the task reading unit is used for reading a first short voice in the task storage module as the input of an automatic voice recognition model, the automatic voice recognition model is used for recognizing the voice into a text, and the recognition result output unit is used for outputting the recognized text to the storage module;
a probability correction judging module: the ASR decoding module is used for judging whether low-frequency words or strong-field characteristic words exist in the recognition text output by the ASR decoding module, if so, sending a starting signal to the related word searching module, and if not, sending a signal to the ASR decoding module to start the recognition of the next short voice;
a related word search module: the system comprises a storage module, a keyword searching module, a word searching module and a word searching module, wherein the storage module is used for storing relevant word corpora;
a language model modification module: and the parameter updating module is used for updating the parameters of the dynamic language submodel in the ASR decoding module according to the real-time associated phrases and the historical associated phrases obtained by the associated word searching module.
7. The adaptive speech recognition system according to claim 6, wherein the language model modification module comprises:
storage area: the related word group is used for storing the related word group output by the related word searching module;
a local probability calculation unit: the local vocabulary probability distribution corresponding to the associated phrases at the latest moment of the storage area is calculated;
a language model modification unit: the automatic speech recognition model is used for constructing a local language model according to the local vocabulary probability distribution, interpolating and combining the local language model and a dynamic language submodel in an ASR decoding module, and updating the automatic speech recognition model;
and after updating, sending a signal to an ASR decoding module to start the recognition of the next short voice.
8. The cache language model-based speech recognition adaptation system of claim 6, further comprising:
a manual correction module: for manually modifying the recognized text output by the ASR decoding module.
CN202011332443.2A 2020-11-24 2020-11-24 Voice recognition self-adaption method and system based on cache language model Active CN112509560B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011332443.2A CN112509560B (en) 2020-11-24 2020-11-24 Voice recognition self-adaption method and system based on cache language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011332443.2A CN112509560B (en) 2020-11-24 2020-11-24 Voice recognition self-adaption method and system based on cache language model

Publications (2)

Publication Number Publication Date
CN112509560A true CN112509560A (en) 2021-03-16
CN112509560B CN112509560B (en) 2021-09-03

Family

ID=74958319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011332443.2A Active CN112509560B (en) 2020-11-24 2020-11-24 Voice recognition self-adaption method and system based on cache language model

Country Status (1)

Country Link
CN (1) CN112509560B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767921A (en) * 2021-01-07 2021-05-07 国网浙江省电力有限公司 Voice recognition self-adaption method and system based on cache language model
CN113421553A (en) * 2021-06-15 2021-09-21 北京天行汇通信息技术有限公司 Audio selection method and device, electronic equipment and readable storage medium
CN113741783A (en) * 2021-07-30 2021-12-03 北京搜狗科技发展有限公司 Key identification method and device for identifying key

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030083863A1 (en) * 2000-09-08 2003-05-01 Ringger Eric K. Augmented-word language model
CN102880611A (en) * 2011-07-14 2013-01-16 腾讯科技(深圳)有限公司 Language modeling method and language modeling device
US20150332670A1 (en) * 2014-05-15 2015-11-19 Microsoft Corporation Language Modeling For Conversational Understanding Domains Using Semantic Web Resources
US20180218728A1 (en) * 2017-02-02 2018-08-02 Adobe Systems Incorporated Domain-Specific Speech Recognizers in a Digital Medium Environment
CN110263322A (en) * 2019-05-06 2019-09-20 平安科技(深圳)有限公司 Audio for speech recognition corpus screening technique, device and computer equipment
CN110930993A (en) * 2018-09-20 2020-03-27 蔚来汽车有限公司 Specific field language model generation method and voice data labeling system
CN111276124A (en) * 2020-01-22 2020-06-12 苏州科达科技股份有限公司 Keyword identification method, device and equipment and readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030083863A1 (en) * 2000-09-08 2003-05-01 Ringger Eric K. Augmented-word language model
CN102880611A (en) * 2011-07-14 2013-01-16 腾讯科技(深圳)有限公司 Language modeling method and language modeling device
US20150332670A1 (en) * 2014-05-15 2015-11-19 Microsoft Corporation Language Modeling For Conversational Understanding Domains Using Semantic Web Resources
US20180218728A1 (en) * 2017-02-02 2018-08-02 Adobe Systems Incorporated Domain-Specific Speech Recognizers in a Digital Medium Environment
CN110930993A (en) * 2018-09-20 2020-03-27 蔚来汽车有限公司 Specific field language model generation method and voice data labeling system
CN110263322A (en) * 2019-05-06 2019-09-20 平安科技(深圳)有限公司 Audio for speech recognition corpus screening technique, device and computer equipment
CN111276124A (en) * 2020-01-22 2020-06-12 苏州科达科技股份有限公司 Keyword identification method, device and equipment and readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767921A (en) * 2021-01-07 2021-05-07 国网浙江省电力有限公司 Voice recognition self-adaption method and system based on cache language model
CN113421553A (en) * 2021-06-15 2021-09-21 北京天行汇通信息技术有限公司 Audio selection method and device, electronic equipment and readable storage medium
CN113421553B (en) * 2021-06-15 2023-10-20 北京捷通数智科技有限公司 Audio selection method, device, electronic equipment and readable storage medium
CN113741783A (en) * 2021-07-30 2021-12-03 北京搜狗科技发展有限公司 Key identification method and device for identifying key

Also Published As

Publication number Publication date
CN112509560B (en) 2021-09-03

Similar Documents

Publication Publication Date Title
CN111480197B (en) Speech recognition system
Asami et al. Domain adaptation of dnn acoustic models using knowledge distillation
CN106683677B (en) Voice recognition method and device
US8280733B2 (en) Automatic speech recognition learning using categorization and selective incorporation of user-initiated corrections
CN112509560B (en) Voice recognition self-adaption method and system based on cache language model
CN111968629A (en) Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC
US20210312914A1 (en) Speech recognition using dialog history
Hori et al. Speech recognition algorithms using weighted finite-state transducers
EP1465154B1 (en) Method of speech recognition using variational inference with switching state space models
CN112927682B (en) Speech recognition method and system based on deep neural network acoustic model
CN112767921A (en) Voice recognition self-adaption method and system based on cache language model
Bacchiani et al. Joint lexicon, acoustic unit inventory and model design
Kala et al. Reinforcement learning of speech recognition system based on policy gradient and hypothesis selection
Demuynck Extracting, modelling and combining information in speech recognition
WO2010100853A1 (en) Language model adaptation device, speech recognition device, language model adaptation method, and computer-readable recording medium
CN112542170A (en) Dialogue system, dialogue processing method, and electronic device
CN112233651A (en) Dialect type determining method, dialect type determining device, dialect type determining equipment and storage medium
JP6027754B2 (en) Adaptation device, speech recognition device, and program thereof
Becerra et al. A comparative case study of neural network training by using frame-level cost functions for automatic speech recognition purposes in Spanish
Tabibian A survey on structured discriminative spoken keyword spotting
Norouzian et al. An approach for efficient open vocabulary spoken term detection
CN115376547A (en) Pronunciation evaluation method and device, computer equipment and storage medium
JPH09134192A (en) Statistical language model forming device and speech recognition device
Huda et al. A variable initialization approach to the EM algorithm for better estimation of the parameters of hidden markov model based acoustic modeling of speech signals
CN113593560B (en) Customizable low-delay command word recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant