CN112509560A

CN112509560A - Voice recognition self-adaption method and system based on cache language model

Info

Publication number: CN112509560A
Application number: CN202011332443.2A
Authority: CN
Inventors: 黄俊杰
Original assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Current assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2021-03-16
Anticipated expiration: 2040-11-24
Also published as: CN112509560B

Abstract

The invention discloses a speech recognition self-adaption method and system based on a cache language model, and belongs to the field of speech recognition. The method comprises the steps of receiving a continuous voice signal input by a user, segmenting the continuous voice signal into a plurality of short voices based on a voice activity detection technology VAD, sequentially identifying phrase sounds based on a general language model, generating a corresponding identification result for each phrase sound, searching based on keywords to obtain an associated word list, processing the associated word list through a cache model to obtain a language model adapting to local change of historical identification text distribution, and continuously identifying subsequent phrase sounds based on the modified language model. After local modification, the language model and the historical recognition content have better similarity, and the accuracy of the recognition of the continuous long speech is improved. In addition, the user can actively correct the low-frequency words which are wrongly recognized in the recognition process, and the subsequent recognition accuracy of the low-frequency words is improved.

Description

Voice recognition self-adaption method and system based on cache language model

Technical Field

The invention relates to the field of voice recognition, in particular to a voice recognition self-adaption method and system based on a cache language model.

Background

After decades of development, speech recognition has a mature technology, and Siri, Cortana and the like have high recognition accuracy under ideal conditions in practical application.

The performance of a speech recognition system depends to a large extent on the similarity between the Language Model (LM) used and the task to be processed. This similarity is particularly important in cases where the statistical properties of the language change over time, such as in application scenarios involving spontaneous and multi-domain speech. Topic Identification (TI) based on information retrieval is a key technology, and a topic under discussion is obtained through semantic analysis of a historical identification result, so that a language model is adjusted, and dynamic self-adaptation is realized.

A problem with topic identification is that for individual low frequency words, there is a possibility that they will cause unnecessary changes to the language model because of the apparent domain characteristics they carry. In the aspect of speech signal processing, a current speech recognition system mainly adopts single-sentence task recognition, namely, the speech recognition system recognizes a single sentence in speech as an independent task according to a Voice Activity Detection (VAD) judgment result no matter how long the speech is input. This has the advantage that better recognition timeliness can be obtained and the system overhead can be reduced to some extent.

For scenes with strong context relation or professional field, such as academic conferences, interview records and the like, the single sentence task recognition system ignores the context relation, repeatedly makes mistakes for words with inaccurate recognition, and cannot recognize low-frequency words by using field information. On the other hand, for a speech recognition system configured with a plurality of domain language models, generally, the domain models need to be manually specified before the recognition is started, or the output results of a plurality of domains need to be selected with confusion, which adds unnecessary steps and leads to insufficient intelligence of the recognition system.

Disclosure of Invention

The invention provides a speech recognition self-adaptive method and system based on a cache language model, and aims to solve the problems that the existing speech recognition system based on a single sentence task cannot self-adaptively recognize low-frequency words according to field information, so that the recognition accuracy of the low-frequency words is low or the recognition system is too complex. The method comprises the following steps: receiving a continuous voice signal input by a user; segmenting the continuous speech signal into a plurality of short voices based on Voice Activity Detection (VAD); sequentially recognizing the short voice based on a general language model or a preset domain voice model and generating a recognition result; obtaining relevant words of the recognition result based on the keyword search; and expanding a cache language model based on a Recursive Network (RN) nonparametric method to obtain new probability distribution of the historical recognition words and the associated words, and obtaining new word statistical probability. The invention utilizes the dynamic cache of the historical recognition result to modify the probability of the language model, so that the voice recognition system has a self-adaptive effect on the information recognition task with coherent fields, and avoids unnecessary changes of the domain language model.

In order to achieve the above object, the present invention adopts a speech recognition adaptive method based on a cache language model, which comprises the following steps.

Step 1: aiming at a section of continuous long voice, firstly, a plurality of short voices are obtained through segmentation, and a task queue is formed according to a time sequence;

step 2: taking a first short voice in the task queue as the input of an automatic voice recognition system, obtaining a recognition text, and deleting the short voice from the task queue; the automatic speech recognition system comprises a dynamic language model, and a preset language model is used as an initialized dynamic language model;

and step 3: establishing a cache model, judging whether probability correction is needed or not in real time according to the recognition text of each short voice, if not, returning to the step 2 until the task queue is empty, and completing the recognition task; if yes, performing keyword search according to a preset associated word list to obtain a keyword group, storing the keyword group in a cache region of a cache model, calculating local vocabulary probability distribution, and constructing a local language model;

and 4, step 4: and (3) carrying out interpolation combination on the local language model constructed in the step (3) and the dynamic language model in the automatic voice recognition system to obtain an updated dynamic language model, returning to the step (2) until the task queue is empty, and completing the recognition task.

Further, the recognized text output by the automatic speech recognition system in step 2 is manually corrected.

Furthermore, the preset language model uses a 3-gram language model established in a big data text corpus.

Further, the step 3 specifically comprises:

step 3.1: establishing a cache model which comprises a plurality of cache regions;

step 3.2: judging whether low-frequency words or strong-field characteristic words exist or not in real time according to the recognition text of each short voice, if so, performing probability correction, entering step 3.3, if not, performing no probability correction, returning to step 2 until the task queue is empty, and completing the recognition task;

step 3.3: and searching keywords according to a preset associated word list to obtain a keyword group, storing the keyword group in a cache region of a cache model, and calculating the probability distribution of local vocabularies.

Step 3.4: and constructing a local language model based on the 3-gram according to the local vocabulary probability distribution.

Further, when the associated phrases stored in the cache region of the cache model reach a threshold value of a million-level numerical value, the local vocabulary probability distribution calculation formula in the step 3.3 is replaced.

Further, the step 4 specifically includes:

carrying out interpolation combination on the local language model and a dynamic language model in the automatic voice recognition system, and correcting the probability of local words; and (5) obtaining an updated dynamic language model through the corrected word probability, returning to the step (2) until the task queue is empty, and completing the recognition task.

Another objective of the present invention is to provide a speech recognition adaptive system based on a cached language model, for implementing the above speech recognition adaptive method, including:

the voice signal receiving module: for receiving a continuous speech signal to be recognized;

the voice signal segmentation module: the voice processing device is used for cutting a section of continuous long voice acquired by the voice signal receiving module into a plurality of short voices;

a storage module: the device comprises a task queue, a word association database and a recognition module, wherein the task queue is used for storing a task queue to be recognized, a recognized text result and a related word corpus, and the task queue is formed by a plurality of short voices according to a time sequence;

an ASR decoding module: the system is provided with a task reading unit, an automatic voice recognition model and a recognition result output unit, wherein the automatic voice recognition model comprises a dynamic language sub-model; the task reading unit is used for reading a first short voice in the task storage module as the input of an automatic voice recognition model, the automatic voice recognition model is used for recognizing the voice into a text, and the recognition result output unit is used for outputting the recognized text to the storage module;

a probability correction judging module: the ASR decoding module is used for judging whether low-frequency words or strong-field characteristic words exist in the recognition text output by the ASR decoding module, if so, sending a starting signal to the related word searching module, and if not, sending a signal to the ASR decoding module to start the recognition of the next short voice;

a related word search module: the system comprises a storage module, a keyword searching module, a word searching module and a word searching module, wherein the storage module is used for storing relevant word corpora;

a language model modification module: and the parameter updating module is used for updating the parameters of the dynamic language submodel in the ASR decoding module according to the real-time associated phrases and the historical associated phrases obtained by the associated word searching module.

The invention has the following beneficial effects:

(1) the traditional speech recognition system has high requirements on data volume and computing resources during training, and a user usually cannot retrain the model according to an actual use scene, so that the similarity between the language model and an actual recognition task is poor. According to the method, historical cache information is utilized, an expanded cache model based on a Recursive Network (RN) is constructed to process historical recognition results, a parameter model is made to adapt to local change in distribution, and interpolation merging is conducted on empirical parameters obtained by optimizing a large amount of data in advance and an original language model. According to the locality principle, the historical information has higher probability to appear in the subsequent recognition tasks, namely the similarity between the language model and the actual recognition tasks can be gradually improved by constructing the cache language model, and the recognition accuracy of the voice recognition system is further improved.

(2) For recognition tasks with high domain relevance, a fixed language model cannot adapt to the diversity of the task corpus domain. The invention uses a plurality of domain text corpora prepared in advance to construct a relational word list. And after obtaining the historical cache information, searching in the relation word list to obtain an associated word group, constructing a cache language model according to the associated word group, and interpolating and combining the cache language model and the original language model for subsequent identification. The addition of the associated phrases improves the domain correlation of the language model and is beneficial to accurately identifying the identification task with stronger domain.

(3) For a long continuous recording and transcription recognition task, a traditional voice recognition system cannot interfere recognition results in recognition, and repeated errors of certain low-frequency words are caused. The manual correction module designed by the invention allows a user to manually correct the historical recognition result, improves the probability of the corresponding low-frequency words in the cache language model, avoids repeated mistakes of the low-frequency words and improves the accuracy of subsequent recognition.

Drawings

FIG. 1 is a flow diagram of a method of building a cached language model according to one embodiment of the invention.

Fig. 2 is a schematic structural diagram of a continuous speech recognition adaptive system according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be illustrative of the invention, and are not to be construed as limiting the invention.

The following describes a method and a system for continuous speech recognition adaptation based on a cache language model according to an embodiment of the present invention with reference to the drawings.

The method for constructing the cache language model comprises the steps of receiving a continuous voice signal input by a user, segmenting the continuous voice signal into a plurality of short voices based on a voice activity detection technology VAD, sequentially identifying phrase sounds based on a general language model, generating a corresponding identification result for each phrase sound, searching for an associated word list based on keywords, processing the associated word list by using an expanded cache model based on a Recursive Network (RN), obtaining a parameter language model adapting to local change of historical identification text distribution, and continuously identifying subsequent phrase sounds based on the modified parameter language model. After local modification, the language model and the historical recognition content have better similarity, and the accuracy of the recognition of the continuous long speech is improved. In addition, the user can actively correct the low-frequency words which are wrongly recognized in the recognition process, and the subsequent recognition accuracy of the low-frequency words is improved.

As shown in fig. 1, the method for constructing the cache language model includes:

in the present embodiment, S100 to S105 are basic speech recognition function flows.

And S100, receiving a continuous voice signal input by a user.

Specifically, a continuous speech signal input by a user may be received. In the existing implementation, a user inputs voice by uploading off-line recording or clicking a recording button through an identification entrance, and then clicks the end of recording and finishes inputting. The system performs subsequent recognition processing on the received voice signal.

Based on the voice activity detection VAD, the continuous speech signal is segmented into a plurality of short voices S101.

Specifically, each frame of voice of the continuous voice signal is recognized by utilizing a deep learning algorithm according to a preset mute model so as to recognize a mute frame, the frame reaching a preset long mute threshold value is taken as a segmentation point to segment the continuous voice signal into a plurality of effective voice segments, and therefore the segmentation of the voice is realized, and phrase voices are recognized respectively.

S102, a phrase voice task queue to be processed.

Specifically, receiving a voice phrase task segmented based on VAD and adding the voice phrase task into a queue, if the task queue is not empty, sequentially processing the voice of the task, and otherwise, ending the process.

S103, recognizing the cut short voice based on the language model and generating a corresponding recognition result.

Specifically, the general language model or the preset domain model is used for recognition for the first time, the recognition result is input into a cache model of a nonparametric method, and after the existing historical recognition record and associated words are processed by the cache model, the corrected language model is used for recognition.

And S104, presetting a general language model or a domain model.

Specifically, a general language model constructed on a big data text is used as a background language model, basic function support is provided for a speech recognition system, or a corresponding domain language model is directly used when a speech domain can be predicted.

And S105, optionally manually correcting.

Specifically, according to different use scenes, a user can correct the incorrectly recognized homophones/low-frequency words in the recognition process in a part of scenes (such as recording and transcription), and can better match with the local adjustment of a subsequent language model to improve the accuracy of subsequent recognition.

In this embodiment, S200 to S202 are flows for obtaining related phrase functions.

And S200, judging whether probability correction is needed to be carried out on the recognition text.

Specifically, the method of the present invention is directed to low frequency words/strong domain feature words in the history text, that is, for the obtained history recognition text, if the observation probability of the history recognition text in the language model is higher than a threshold, no subsequent probability correction is performed.

S201, searching keywords.

Specifically, the words which are judged to need probability correction through the S200 are used as keywords, and the associated word groups are obtained by searching in the preset associated word list. In order to solve the problem that the number of the cached associated phrases is increased quickly after a large number of recognition tasks, the associated phrases are input into a cache model after pruning.

S202, presetting an associated word list.

Specifically, a related word list is constructed based on pronunciation similarity, semantic field similarity and logic correlation, and an inverted index and product quantification are established for the word list in preprocessing, wherein the inverted index is used for recording all words in a corpus, and takes words or field subjects as main keys, and each main key points to a series of connected word groups, so that rapid keyword retrieval and low memory occupation are realized.

In the present embodiment, S301 to S302 are modified language model flows.

S301, based on the Recursive Network (RN), expanding a cache model.

In particular, according to the locality principle, a word that has been used recently will have a higher probability of being used again, and a word that has a correlation with the recently used word will have a higher probability of being used. In the present invention, a recursive network designed specifically for sequence modeling is used, and in each iteration of time nodes, the network encodes historical identification information and represents the encoded information as a hidden vector related to time

The subsequent prediction probabilities based on this are:

o_ωis a coefficient matrix, and oc represents a proportional relationship.

Wherein the hidden vector h_tThe update is made by the following recursive formula:

h_t＝Φ(x_t，h_t-1)

the updating function phi uses different structures in different network structures, the Elman network is used in the invention, and the updating function phi is defined as:

h_t＝σ(Lx_t+Rh_t-1)

where σ is a non-linear function tanh,

the matrix is embedded for the words and,

is a recursive matrix.

On the basis of the historical information representation, the model is cached in (h)_i，ω_i+1) Storing historical information in a mode of observing word pairs by using hidden vectors, and obtaining the word probability until t time in the cache component by using the following kernel density estimator:

in the formula, ω₁，...，ω_t-1Representing a historical recognized text sequence, omega_iIdentifying the ith word, ω, in a text sequence for history_tA word representing a current time; p_cache(ω_t|ω₁，...，ω_t-1) Representing a current word of a text sequence identified based on a cache history as omega_tK denotes a kernel function, h_tIs omega_tCorresponding hidden vector, h_iIs omega_i

The corresponding hidden vector theta represents the Euclidean distance, | | | · | |, represents the modular length, and oc represents a proportional relationship.

If the cache content is not cleaned and the stored associated phrases reach the million level, the system does not perform accurate search, and instead, the following approximate k-nearest algorithm is used to estimate the distribution probability of the words, which is also called a variable kernel density estimator:

wherein ω is_tFor the current word, ω₁，...，ω_t-1In order to identify the sequence for the history,

and h_tH having the shortest euclidean distance of_iK is a kernel function (often chosen as gaussian kernel function), θ (h)_t) Is h_tThe euclidean distance to k adjacent words.

According to the above, a local language model based on 3-grams is constructed.

And S302, carrying out linear interpolation to obtain the corrected language model.

Specifically, the local word probability is modified using the following linear interpolation formula:

P(ω_t|ω₁，...，ω_t-1)＝(1-λ)P_model(ω_t|ω₁，...，ω_t-1)+λ_Pcache(ω_t|ω₁，...，ω_t-1)

in which λ is the tuning parameter, P_modelRepresenting the recognition of the antecedent in the existing language model as omega₁，...，ω_t-1Under the condition that the current word is omega_tThe probability of (d); and adjusting the lambda according to the experimental result, and finally obtaining a corrected language model containing historical recognition text information for recognizing the residual voice.

In order to solve the problem that the single sentence task type recognition system cannot utilize the historical recognition information to a certain extent, the invention also provides a continuous voice recognition self-adaptive system by combining the method.

As shown in fig. 2, the continuous speech recognition adaptive system may include: a speech signal receiving module 100, a speech signal segmentation module 101, an ASR decoding module 102, an artificial correction module (optional) 103, a related word searching module 200, a probability correction judgment module, a language model correction module 300, and a storage module.

The voice signal receiving module 100 is configured to receive a continuous voice signal input by a user. Specifically, the user uploads an offline audio file through the client, or clicks a recording button, inputs voice through the recording device, and clicks recording to finish recording. In the recording process, the system can automatically perform segmentation recognition according to the recorded voice signal and return a recognition result in real time.

The voice signal segmentation module 101 segments the continuous voice signal into a plurality of short voices based on the voice activity detection VAD. Specifically, a silence model is established by using a pre-labeled corpus, a silence frame recognition is performed on each frame in the continuous voice signal, the frame reaching a preset long silence threshold value is used as a voice signal segmentation point, the continuous voice signal is segmented into a plurality of effective voice segments, and therefore a voice task group to be recognized is obtained, and voice recognition is performed in sequence.

The ASR decoding module 102 is configured with a task reading unit, an automatic speech recognition model and a recognition result output unit, wherein the automatic speech recognition model comprises a dynamic language submodel; the task reading unit is used for reading a first short voice in the task storage module as the input of an automatic voice recognition model, the automatic voice recognition model is used for recognizing the voice into a text, and the recognition result output unit is used for outputting the recognized text to the storage module;

when the first recognition or historical recognition text associated word quantity is not enough to construct a local language model, the ASR decoding module uses a general language model or a predetermined domain language model to decode and recognize; after the local language model is constructed, the ASR decoding module uses the interpolation corrected language model for decoding and recognition.

A probability correction judging module: the short speech recognition method comprises the steps of judging whether low-frequency words or strong-field characteristic words exist in a recognition text output by the ASR decoding unit, if so, sending a starting signal to the related word searching module, and if not, sending a signal to the ASR decoding unit to start recognition of the next short speech.

And (optional) a manual modification module 103, configured to perform manual modification on the recognized text output by the ASR decoding unit. Specifically, the low-frequency words with wrong recognition are corrected, the system is prevented from introducing wrong associated words through manual correction, and the recognition effect of the system on the continuous voice signals is further improved. Because additional interactive operations are introduced, the module is more suitable for text transcription of off-line voice signals, and recognition accuracy is improved at the cost of partial real-time.

The related word searching module 200 is configured to perform keyword search according to a related word corpus preset in the storage module, obtain a related word group of the history recognition word, and input the related word group into the cache model to obtain local observation probability information of the related word group. The association relation in the preset association word list is pronunciation approximation, semantic field approximation and logic correlation, the word groups with the relation form the association word list, and an inverted index is established on the association word list to carry out rapid keyword retrieval.

And the language model modification module 300 is configured to perform parameter updating on the dynamic language submodel in the ASR decoding unit according to the real-time associated phrase and the historical associated phrase obtained by the associated word search module. Specifically, an extended cache model based on a Recursive Network (RN) processes related phrases of a history recognition text, local observation probability is obtained through calculation after related word quantity of the history recognition text reaches a threshold value, a local language model is built on the basis of the local observation probability, the local language model and a general language model are subjected to interpolation and merging in a language model correction unit, and finally a corrected language model containing continuous voice signal local information is obtained and used for a subsequent voice recognition process.

Wherein, the language model modification module comprises:

storage area: the related word group is used for storing the related word group output by the related word searching module;

a local probability calculation unit: and the local vocabulary probability distribution corresponding to the associated phrases at the latest moment in the storage area is calculated.

A language model modification unit: the automatic speech recognition model is used for constructing a local language model according to the local vocabulary probability distribution, interpolating and combining the local language model and a dynamic language sub-model in the ASR decoding unit, and updating the automatic speech recognition model.

And sending a signal to the ASR decoding unit after updating, and starting the recognition of the next short voice.

The units/modules employed in the system of the present invention may all employ the methods described in the above embodiments.

The continuous speech recognition self-adaptive system of the embodiment of the invention realizes the self-adaptation of the language model to the continuous speech content by processing the historical speech signal recognition text, and has obvious effect when the context relevance, the field relevance and the main body consistency of the speech content are stronger. In addition, the user can be allowed to actively correct the intermediate recognition result in the recognition process, and the use experience in a specific task scene is further improved.

Examples

In order to show the experimental effect of the present invention, this embodiment provides a comparative experiment, and the experimental method is the same as the process described above, and only specific implementation details are given here, and repeated processes are not described again.

This implementation was trained and tested using the following data set:

librisipeech, a well-known open source English data set (LibriSpeech: an ASR corpus based on public domain audio books, Vassil Panayotov, Guoguo Chen, Daniel Povey and Sanjeev Khudanpu, ICASSP 2015), written by Vassil Panayotov, includes about 1000 hours of 16kHz English reading speech corpus. Data were taken from the audiobook of the LibriVox project and carefully segmented and aligned.

New Crawl: a data set of news articles containing news data including 2007-2013. In this embodiment, a cache language model test is performed on the New Crawl 2007-2011 subdata set on the data distribution changing with time.

WikiText: write et al (s.write, c.xiong, j.bradbury, and r.socher.point sensor mix models. in ICLR,2017.) developed english datasets derived from Wikipedia articles, training the underlying recognition model on the original annotation format data.

The book repus: Gutenberg project data set (S.Lahiri.Complex of word chromatography networks: A prediction structural analysis. in Proceedings of The Student Research works hop at The 14th Conference of The European channel of The Association for Computational Linaturics, 2014.) contains text corpora in English books 3036.

Phone-record is a real telephone scene recording, contains 16kHz voice corpora for about 200 hours, and has a good-quality label text.

The implementation details are as follows:

in the basic speech recognition system, 39-dimensional MFCC features are extracted on Librispeech with a frame length of 25 milliseconds and a frame shift of 10 milliseconds, an acoustic model based on kaldi-channel is trained, and recognition is performed using the standard nnet 3-laten-family decoding flow. In modeling setting, the number of final phoneme gaussians in a gmm stage is 250000, leaf nodes are 18000, the final alignment result in the gmm stage is used as initial alignment of training of a chain model, a chain network comprises 3 LSTM layers, and each layer comprises 1024 LSTM units. Constructing a 3-gram language model by using Librisipeech, WikiText and The book Corpus text corpora in The initial language model;

the associated word list establishes a mysql relational database on The texts of New Crawl, WikiText and The book Corpus, sets multidimensional tags including topics and field information and The like, and establishes an inverted index. In the process of searching the relevant words, the threshold value of the low-frequency words is-4.75, and if the threshold value is higher than the threshold value, the low-frequency words are not processed and are not added into the cache model. The local language model interpolation coefficient lambda is set to 0.25;

performance evaluation:

for cached language model performance, the recognition accuracy is evaluated using Word Error Rate (WER), and the calculation formula is as follows:

in this implementation, the recognition performance of four schemes is evaluated:

table 1: caching language model comparison test results

Wherein, the Base: only using the basic speech recognition system, not modifying the initial language model;

local cache, base line, proposed by Grave et al (e.grave, a.job, and n.user.improving neural network models with a connecting cache. in ICLR,2017.), is a relatively common method of using historical identification information;

cache LM, a Cache language model used by the invention adopts a continuous recognition result to calculate the word error rate;

and (5) Final LM, re-identifying each test data set by using the Final language model, wherein the language model is not modified by using a cache model in the identification.

According to table 1, the method of the invention has an improved performance compared to the control scheme in all three test sets. The similarity between the training set and the test set is better, the low-frequency words are fewer, the promotion effect of the cache language model is limited, and the remarkable effect is shown in the other two test sets, particularly the Phone-record with stronger field.

It is worth mentioning that as an evaluation on the Final language model modification effect, the Final LM obtains the best accuracy result in all three terms, that is, the modified language model can better describe the test data text relationship than the original language model, which fully proves that the method for modifying the language model by using the history recognition information is effective and has a positive influence on the context-associated speech recognition task.

For the manual modification module, the present embodiment performs a test on a single conference recording, the recording duration is one hour, and the results are shown in table 2:

table 2: the manual correction module tests that the data is the error rate of a specific word

In the conference recording, two low-frequency words which do not appear or appear in the training data rarely are selected, the recognition accuracy of a system (base) which does not contain manual correction is poor, and after the manual correction is used (modified), the recognition accuracy of the system is remarkably improved, and the good effect of a manual correction module on a long voice transcription task is fully displayed.

While the present invention has been described with reference to the above embodiments, it is not intended to limit the invention. Various modifications and alterations may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, the protection scope of the invention is subject to the claims of the invention.

Claims

1. A speech recognition adaptive method based on a cache language model is characterized by comprising the following steps:

2. The adaptive speech recognition method based on cached language model according to claim 1, wherein the recognized text output by the automatic speech recognition system in step 2 is modified manually.

3. The speech recognition adaptive method based on the cache language model according to claim 1, wherein the step 3 specifically comprises:

step 3.3: performing keyword search according to a preset associated word list to obtain a keyword group, storing the keyword group in a cache region of a cache model, and calculating the probability distribution of local vocabularies, wherein the calculation formula is as follows:

in the formula, ω₁，...，ω_t-1Representing a historical recognized text sequence, omega_iIdentifying the ith word, ω, in a text sequence for history_tA word representing a current time; p_cache(ω_t|ω₁，...，ω_t-1) Representing a current word of a text sequence identified based on a cache history as omega_tK denotes a kernel function, h_tIs omega_tCorresponding hidden vector, h_iIs omega_iCorresponding hidden vectors, theta represents an Euclidean distance, | | | · | | | represents a modular length, and oc represents a proportional relation;

4. The adaptive speech recognition method according to claim 3, wherein when the associated phrases stored in the buffer of the buffer model reach the threshold of million-level values, the local vocabulary probability distribution calculation formula in step 3.3 is replaced with the following formula:

in the formula (I), the compound is shown in the specification,

is and h_tH having the shortest euclidean distance of_iSet of (a), θ (h)_t) Is h_tEuclidean distance to adjacent words.

5. The speech recognition adaptive method based on the cache language model according to claim 4, wherein the step 4 specifically comprises:

the local language model and the dynamic language model in the automatic speech recognition system are merged by interpolation, and the probability of the local word is corrected, wherein the formula is as follows:

P(ω_t|ω₁，...，ω_t-1)＝(1-λ)P_model(ω_t|ω₁，...，ω_t-1)+λP_cache(ω_t|ω₁，...，ω_t-1)

in which λ is the tuning parameter, P_modelRepresenting the recognition of the antecedent in the existing language model as omega₁，...，ω_t-1Under the condition that the current word is omega_tThe probability of (d);

and (5) obtaining an updated dynamic language model through the corrected word probability, returning to the step (2) until the task queue is empty, and completing the recognition task.

6. A speech recognition adaptive system based on a cache language model, for implementing the speech recognition adaptive method of any one of claims 1-5, comprising:

7. The adaptive speech recognition system according to claim 6, wherein the language model modification module comprises:

a local probability calculation unit: the local vocabulary probability distribution corresponding to the associated phrases at the latest moment of the storage area is calculated;

a language model modification unit: the automatic speech recognition model is used for constructing a local language model according to the local vocabulary probability distribution, interpolating and combining the local language model and a dynamic language submodel in an ASR decoding module, and updating the automatic speech recognition model;

and after updating, sending a signal to an ASR decoding module to start the recognition of the next short voice.

8. The cache language model-based speech recognition adaptation system of claim 6, further comprising:

a manual correction module: for manually modifying the recognized text output by the ASR decoding module.