CN110473531B - Voice recognition method, device, electronic equipment, system and storage medium - Google Patents

Voice recognition method, device, electronic equipment, system and storage medium Download PDF

Info

Publication number
CN110473531B
CN110473531B CN201910837038.7A CN201910837038A CN110473531B CN 110473531 B CN110473531 B CN 110473531B CN 201910837038 A CN201910837038 A CN 201910837038A CN 110473531 B CN110473531 B CN 110473531B
Authority
CN
China
Prior art keywords
decoding result
voice recognition
language model
frequency
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910837038.7A
Other languages
Chinese (zh)
Other versions
CN110473531A (en
Inventor
曹立新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910837038.7A priority Critical patent/CN110473531B/en
Publication of CN110473531A publication Critical patent/CN110473531A/en
Application granted granted Critical
Publication of CN110473531B publication Critical patent/CN110473531B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/12Speech classification or search using dynamic programming techniques, e.g. dynamic time warping [DTW]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a voice recognition method, a voice recognition device, electronic equipment, a voice recognition system and a storage medium, which relate to artificial intelligence technology, and utilize machine learning technology in artificial intelligence to carry out voice recognition, wherein the method comprises the following steps: acquiring a voice signal to be recognized; dynamically decoding the acoustic characteristics of the voice signal to be recognized by using a dynamic decoder to obtain at least one decoding result and a score corresponding to each decoding result, wherein the dynamic decoder comprises an increment language model loaded on line, and the increment language model is obtained by updating the increment language model based on a correct text corresponding to a voice recognition error submitted by a user in real time; and determining the voice recognition result of the voice signal to be recognized from at least one decoding result according to the score corresponding to each decoding result. The voice recognition method, the voice recognition device, the electronic equipment, the system and the storage medium provided by the embodiment of the application can repair voice recognition errors in real time and improve the accuracy of voice recognition.

Description

Voice recognition method, device, electronic equipment, system and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a speech recognition method, apparatus, electronic device, system, and storage medium.
Background
Speech recognition, i.e. the process of converting a speech signal into text. The speech recognition has a complex processing flow, and mainly comprises four processes of acoustic model training, language model training, decoder construction and decoding. Due to the complexity of pronunciation and expression modes of human languages, a speech recognition system often has recognition errors, so that the speech recognition system needs to be repaired to avoid the same errors in the subsequent speech recognition process.
The method for repairing the voice recognition system commonly used at present comprises the steps of collecting correct texts corresponding to errors fed back by a user, updating an incremental language model by using the collected correct texts, carrying out interpolation processing on the updated incremental language model and a full-scale language model to obtain a repaired language model, then restarting a decoder, loading the repaired language model into the decoder, and after loading is completed, carrying out voice recognition by using the repaired language model.
However, the above repair method has the following drawbacks: the full-scale language model contains a large amount of data, so that the time consumption of an interpolation process and a decoder loading process is long, errors cannot be repaired in real time, and speech recognition service cannot be provided during the restart of the decoder, so that service interruption is caused.
Disclosure of Invention
The embodiment of the application provides a voice recognition method, a voice recognition device, an electronic device, a voice recognition system and a storage medium, which can repair voice recognition errors in real time and improve the accuracy of voice recognition.
In one aspect, an embodiment of the present application provides a speech recognition method, including:
acquiring a voice signal to be recognized;
dynamically decoding the acoustic characteristics of the voice signal to be recognized by using a dynamic decoder to obtain at least one decoding result and a score corresponding to each decoding result, wherein the dynamic decoder comprises an online loaded incremental language model, and the incremental language model is obtained by updating the incremental language model based on correct text corresponding to a voice recognition error submitted by a user in real time;
and determining the voice recognition result of the voice signal to be recognized from the at least one decoding result according to the score corresponding to each decoding result.
In one aspect, an embodiment of the present application provides a speech recognition apparatus, including:
the acquisition module is used for acquiring a voice signal to be recognized;
the decoding module is used for dynamically decoding the acoustic characteristics of the voice signal to be recognized by using a dynamic decoder to obtain at least one decoding result and a score corresponding to each decoding result, the dynamic decoder comprises an online loaded incremental language model, and the incremental language model is obtained by updating the incremental language model based on a correct text corresponding to a voice recognition error submitted by a user in real time;
and the determining module is used for determining the voice recognition result of the voice signal to be recognized from the at least one decoding result according to the score corresponding to each decoding result.
Optionally, the decoding module is specifically configured to:
obtaining the at least one decoding result and the acoustic probability corresponding to each decoding result based on a search network;
obtaining a first language probability corresponding to each decoding result based on the full-scale language model;
obtaining a second language probability corresponding to each decoding result based on the incremental language model;
and respectively determining the score corresponding to each decoding result according to the acoustic probability, the first language probability and the second language probability corresponding to each decoding result.
Optionally, the speech recognition apparatus provided in the embodiment of the present application further includes an update module, configured to:
and if the incremental language model is updated, acquiring the updated incremental language model, and loading the updated incremental language model into the dynamic decoder on line.
Optionally, the determining module is further configured to: before determining the voice recognition result of the voice signal to be recognized from the at least one decoding result according to the score corresponding to each decoding result, respectively aiming at each decoding result in the at least one decoding result, if the decoding result contains the high-frequency words in the high-frequency word list, increasing the score corresponding to the decoding result.
Optionally, the determining module is specifically configured to:
and if the decoding result contains the high-frequency words in the high-frequency word list, adding corresponding scores to the scores of the decoding result according to the matching condition of the decoding result and the high-frequency words in the high-frequency word list.
Optionally, the determining module is specifically configured to:
if the decoding result is consistent with any high-frequency word in the high-frequency word list, the score of the decoding result is increased by a first score;
and if the decoding result is inconsistent with all the high-frequency words in the high-frequency word list and the decoding result contains the high-frequency words in the high-frequency word list, increasing a second score by the score of the decoding result, wherein the first score is higher than the second score.
Optionally, the high frequency word list is updated by:
acquiring correct high-frequency words submitted aiming at high-frequency word recognition errors;
and if the submitted correct high-frequency word is not in the high-frequency word list, adding the submitted correct high-frequency word into the high-frequency word list.
Optionally, the high-frequency words submitted for the high-frequency word recognition error are obtained by:
identifying correct high-frequency words submitted by a high-frequency word identification error from correct texts submitted by a user in real time by using a high-frequency word identification model, wherein the high-frequency word identification model is a model for identifying whether the input texts are high-frequency words or not based on neural network training;
alternatively, the first and second electrodes may be,
and acquiring the correct high-frequency words submitted by the user in real time through the high-frequency word error reporting entry.
In one aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of any one of the methods when executing the computer program.
In one aspect, an embodiment of the present application provides a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, implement the steps of any of the above-described methods.
In one aspect, an embodiment of the present application provides a speech recognition system, including a speech acquisition device and a speech recognition device;
the voice acquisition device is used for acquiring a voice signal to be recognized;
the voice recognition device is used for determining the voice recognition result of the voice signal to be recognized by adopting any one of the methods.
The voice recognition method, the voice recognition device, the electronic equipment, the system and the storage medium provided by the embodiment of the application can update the incremental language model based on the correct text corresponding to the voice recognition error submitted by the user in real time in the voice recognition process, then load the updated incremental language model into the dynamic decoder in a hot loading mode, and when the voice signal is decoded subsequently, utilize the interpolation results of the full-scale language model and the updated incremental language model as the language probability of calculating the score to realize the error recovery. The decoder does not need to be restarted in the whole process, so that the voice recognition errors can be repaired in real time while the voice recognition service is not interrupted, the voice recognition system can learn from the errors corrected by the user, the probability of the errors appearing again in the future is reduced, and the recognition accuracy of the voice recognition system is improved. In addition, because the data volume of the incremental speech model is small, the incremental speech model is only updated and loaded, the repair speed can be greatly improved, and second-level repair is realized, so that the identification errors which are temporarily found on site can be quickly and timely repaired.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of a speech recognition method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;
FIG. 3 is a schematic flowchart illustrating updating an incremental language model according to an embodiment of the present application;
fig. 4 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;
fig. 5 is a schematic flow chart illustrating a process of repairing a high-frequency word recognition error according to an embodiment of the present application;
fig. 6 is a schematic flowchart of training a high-frequency word recognition model according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a speech recognition system according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
For convenience of understanding, terms referred to in the embodiments of the present application are explained below:
artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.
With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.
Speech recognition, i.e. the process of converting a speech signal into text. The speech recognition has a complex processing flow, and mainly comprises four processes of acoustic model training, language model training, decoder construction and decoding.
An Acoustic Model (AM) is one of the most important parts of a speech recognition system, and is a model that classifies Acoustic features of a speech signal into phonemes. At present, a hidden Markov model is mostly adopted for modeling in a mainstream system.
A Language Model (LM), which is a component of a speech recognition system, obtains a statistical Model for a Language by counting grammatical distributions in a corpus of text, with the aim of establishing a distribution that can describe the probability of occurrence of a given word sequence in the Language. That is, the language model is a model describing the probability distribution of words, a model that can reliably reflect the probability distribution of words used in language recognition. Language models play an important role in natural language processing and are widely applied in the fields of speech recognition, machine translation and the like. For example, a language model can be used to obtain a word sequence with the highest probability in speech recognition of multiple word sequences, or to predict the next most likely word given several words, etc. Commonly used language models include N-Gram LM (N-Gram language model), Big-Gram LM (binary language model), Tri-Gram LM (ternary language model).
And the pronunciation dictionary is a mapping from phoneme to word and is used for connecting the acoustic model and the language model.
The decoder is an engine of voice recognition, and mainly comprises a search network constructed by using an acoustic model, a pronunciation dictionary and a language model, acoustic features of voice signals are used as input, the search is carried out in the search network, a path with the highest score is searched to serve as an optimal path, a word sequence corresponding to the optimal path is a recognition result of the voice signals, and the score of one path comprises an acoustic score (namely acoustic probability) determined based on the acoustic model and a language score (namely language probability) determined based on the language model.
Compared with the traditional static decoder, the dynamic decoder only compiles the acoustic model and the pronunciation dictionary into the state network to form a search network from phonemes to words, so that the search network is smaller, and the language probability corresponding to the language model is obtained by dynamic query in the decoding process.
The language model interpolation is to fuse the statistical probabilities of the two language models, so that the language model obtained based on the two language model interpolation can take account of the original fields of the two language models. For example, a text in language model A corresponds to a language probability of P1The language probability corresponding to the language model B is P2Interpolating the language model A and the language model B according to the weight alpha, wherein the language probability corresponding to the text after interpolation is alpha P1+(1-α)P2
The full-scale language model is a language model obtained based on the statistical result of a huge number of corpora, for example, the order of the corpora participating in the statistics can reach tens of millions, hundreds of millions, or even billions, and the number of the corpora participating in the statistics is determined according to the requirement of the application scenario on the recognition accuracy, without limitation.
The language material of the incremental language model is far less than that of the full language model, so the volume of the incremental language model is far less than that of the full language model. In order to ensure that the full-scale language model and the incremental language model can smoothly perform dynamic language model interpolation decoding, the word list of the full-scale language model and the word list of the incremental language model need to be kept consistent, namely, words appearing in the incremental language model also appear in the full-scale language model.
The terminal device may be a device capable of installing various applications and displaying an object provided in the installed application, and the electronic device may be mobile or fixed. For example, a mobile phone, a tablet computer, various wearable devices, a vehicle-mounted device, a Personal Digital Assistant (PDA), a point of sale (POS), or other electronic devices capable of implementing the above functions may be used.
Any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.
In a specific practical process, a method for repairing a speech recognition system, which is commonly used at present, includes collecting a correct text corresponding to a speech recognition error fed back by a user, updating an incremental language model by using the collected correct text, performing interpolation processing on the updated incremental language model and a full-scale language model to obtain a repaired language model, then restarting a decoder, loading the repaired language model into the decoder, and after loading is completed, performing speech recognition by using the repaired language model. However, the above repair method has the following drawbacks: the full-scale language model contains a large amount of data, so that the full-scale language model and the incremental language model can be interpolated only by consuming a long time, and the decoder can be loaded only by consuming a long time, so that the repair speed is reduced. In addition, voice recognition service cannot be provided during the restart of the decoder, resulting in interruption of the voice recognition service, and only reduction of the repair frequency is selected to avoid frequent interruption of the voice service. The above-mentioned defects result in that the existing method for repairing the speech recognition system cannot repair the speech recognition error in real time, and especially in some application occasions with high real-time performance and recognition accuracy, the user experience is greatly reduced, for example, in the simultaneous interpretation of important conferences, the tolerance to some speech recognition errors is low, so that the speech recognition error temporarily found on site needs to be repaired quickly and timely.
Therefore, the inventor of the application provides an error reporting entry for submitting correct texts corresponding to the voice recognition errors for users, once the users find the voice recognition errors in the process of using the voice recognition system, the users can immediately submit the correct texts through the error reporting entry, the increment language model is updated in real time by using the correct texts submitted by the users in real time, and due to the fact that the increment language model contains less data, the updated increment language model can be quickly loaded to a dynamic decoder on line by adopting a hot loading technology to cover the original increment language model in the dynamic decoder, and then the voice recognition errors submitted by the users are repaired under the condition that the dynamic decoder is not restarted, so that the users can carry out dynamic interpolation decoding by using the full-scale language model and the updated increment language model. Meanwhile, in order to ensure that the incremental language model in the decoder can be updated online, the method uses a dynamic decoder to perform voice recognition, the dynamic decoder comprises a search network and a full language model which are pre-loaded and constructed based on an acoustic model and a pronunciation dictionary, and the incremental language model which is loaded online, the dynamic decoding process is to search each possible path in the search network, one path corresponds to one decoding result and obtains the score of each decoding result, the score of each decoding result comprises the acoustic probability obtained based on the search network and the language probability determined based on the full language model and the incremental language model through dynamic interpolation decoding, wherein in the dynamic decoding process, the first language probability of each decoding path is obtained through dynamic query of the full language model, and the second language probability of each decoding path is obtained through dynamic query of the incremental language model, and carrying out interpolation processing on the first language probability and the second language probability to obtain the language probability of the decoding path.
Based on the dynamic decoder constructed as above, the specific speech recognition process includes: the method comprises the steps of obtaining a voice signal to be recognized, utilizing a dynamic decoder to dynamically decode the voice signal to be recognized, obtaining at least one decoding result and a score corresponding to each decoding result, and determining a voice recognition result of the voice signal to be recognized from the obtained at least one decoding result according to the score corresponding to the decoding result. The method can update the incremental language model based on the correct text corresponding to the voice recognition error submitted by the user in real time in the voice recognition process, then load the updated incremental language model into the dynamic decoder in a hot loading mode, and when the voice signal is decoded subsequently, the error is repaired by using the full-scale language model and the interpolation result of the updated incremental language model as the language probability for calculating the score. The decoder does not need to be restarted in the whole process, so that the voice recognition errors can be repaired in real time while the voice recognition service is not interrupted, the voice recognition system can learn from the errors corrected by the user, the probability of the errors appearing again in the future is reduced, and the recognition accuracy of the voice recognition system is improved. In addition, because the data volume of the incremental speech model is small, the incremental speech model is only updated and loaded, the repair speed can be greatly improved, and second-level repair is realized, so that the identification errors which are temporarily found on site can be quickly and timely repaired.
The speech recognition method provided by the embodiment of the application can be used for recognizing speech signals of any language, such as Chinese, English, Japanese, Korean and the like. In specific implementation, for each language, the acoustic model, the full-scale language model, the incremental language model and the like are trained by using the corpus corresponding to the language, so that the dynamic decoder for identifying the language can be obtained, the speech signal of the language is decoded, and the dynamic decoder is repaired in real time based on the error submitted by the user. The embodiment of the application mainly takes Chinese as an example, explains the voice recognition method provided by the application, and is similar to the corresponding methods of other types of languages, so that the description is omitted.
After introducing the design concept of the embodiment of the present application, some simple descriptions are provided below for application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.
Fig. 1 is a schematic view of an application scenario of a speech recognition method according to an embodiment of the present application. The application scenario includes a plurality of terminal devices 101 (including terminal device 101-1, terminal device 101-2, … … terminal device 101-n), a speech recognition server 102, and an update server 103. The terminal device 101, the speech recognition server 102 and the update server 103 are connected via a wireless or wired network, and the terminal device 101 includes, but is not limited to, an electronic device such as a desktop computer, a mobile phone, a mobile computer, a tablet computer, a media player, a smart wearable device, and a smart television. The speech recognition server 102 and the update server 103 may be a server, a server cluster composed of a plurality of servers, or a cloud computing center. Of course, the speech recognition server 102 and the update server 103 shown in fig. 1 may also be arranged in the same server or server cluster.
The terminal device 101 shown in fig. 1 sends the collected voice signal to the voice recognition server 102, and the voice recognition server 102 decodes the voice signal by using the dynamic decoder to obtain a voice recognition result, and feeds back the voice recognition result to the terminal device 101. When the user finds that the voice recognition result is wrong, the correct text corresponding to the wrong voice recognition result can be submitted through the terminal device 101, and the terminal device 101 sends the correct text to the update server 103. The update server 103 updates the incremental language model based on the correct text submitted by the user, and sends the updated incremental language model to the speech recognition server 102. The speech recognition server 102 loads the updated incremental language model into the dynamic decoder in a hot loading manner, and covers the original incremental language model in the dynamic decoder, thereby realizing the repair of the error.
Of course, the method provided in the embodiment of the present application is not limited to be used in the application scenario shown in fig. 1, and may also be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described in the following method embodiments, and will not be described in detail herein.
To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide the method operation steps as shown in the following embodiments or figures, more or less operation steps may be included in the method based on the conventional or non-inventive labor. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application.
The scheme provided by the embodiment of the application relates to the voice recognition technology of artificial intelligence, and the technical scheme provided by the embodiment of the application is explained below with reference to the application scenario shown in fig. 1.
Referring to fig. 2, an embodiment of the present application provides a speech recognition method, including the following steps:
s201, obtaining a voice signal to be recognized.
When the method is specifically implemented, the terminal equipment sends the collected voice signals of the user to the voice recognition server in real time, and the voice recognition server carries out voice recognition on the voice signals.
S202, dynamically decoding the acoustic features of the voice signal to be recognized by using a dynamic decoder to obtain at least one decoding result and a score corresponding to each decoding result, wherein the dynamic decoder comprises an online loaded incremental language model, and the incremental language model is obtained by updating the incremental language model based on correct text corresponding to the voice recognition error submitted by the user in real time.
In specific implementation, the dynamic decoder comprises a search network, a full-scale language model and an incremental language model. Wherein the search network is a phoneme-to-word search network constructed based on the trained acoustic model and the pronunciation dictionary. The method comprises the steps of loading a built search network and a trained full-scale language model into a dynamic decoder of a voice recognition server in advance, loading an initial incremental language model into the dynamic decoder, and starting the dynamic decoder, so that a voice signal can be decoded through the dynamic decoder.
In specific implementation, the acoustic features of the speech signal to be recognized can be extracted in the following manner: the method comprises the steps of performing framing processing on a speech signal to be recognized to obtain a plurality of audio frames, and performing acoustic feature extraction on each audio frame to obtain an acoustic feature vector corresponding to each audio frame. The framing processing is to cut the audio with indefinite length into small segments with definite length, generally 10-30ms is taken as a frame, framing can be realized by using a moving window function, and the adjacent audio frames have an overlapping part to avoid omission of the window boundary to the signal. The extracted acoustic features include, but are not limited to, Fbank features, MFCC (Mel Frequency Cepstral coeffients, Mel Frequency cepstrum coefficients) features, spectrogram features, and the like. The method for extracting acoustic features from a speech signal is the prior art and is not described in detail.
And inputting the extracted acoustic characteristics of the voice signal to be recognized into a dynamic decoder for decoding to obtain a word sequence possibly corresponding to the voice signal, namely a decoding result. The decoding process using the dynamic decoder specifically includes: searching possible decoding paths in a search network based on the acoustic characteristics of the voice signal to be recognized, wherein one decoding path corresponds to one decoding result; in the searching process, for each possible decoding path, determining the acoustic probability P corresponding to the decoding path based on the acoustic modelAWhile dynamically querying the first language probability P of the decoding path of the full-scale language modelL1And obtaining the second language probability P of the decoding path by dynamically querying the incremental language modelL2For the first language probability PL1And a second language probability PL2Performing interpolation to obtain language probability P of the decoding pathLThen according to the acoustic probability PAAnd language probabilityPLA Score for the decoding path is determined. In particular, PL=αPL1+(1-α)PL2Wherein α can be determined by those skilled in the art according to the application requirements and practical experience, and the embodiments of the present application are not limited. In particular, the score of a decoding path is equal to the acoustic probability P of that decoding pathAAnd the language probability PLSum, i.e. Score of the decoding path is PA+PL
For the acoustic characteristics of a speech signal to be recognized, the dynamic decoder outputs a plurality of decoding results and a score corresponding to each decoding result. In practical applications, one or more decoding results may be obtained from all decoding results output by the dynamic decoder according to application requirements, and as the candidate result, for example, N decoding results with the highest scores may be obtained as the candidate result, or a decoding result with a score exceeding a preset threshold may be obtained as the candidate result. N and the preset threshold may be determined according to actual application requirements, and the embodiment of the present application is not limited.
It should be noted that once the dynamic decoder is started, if the search network or the full-scale language model needs to be updated, the decoder needs to be restarted. However, because the data size of the incremental language model is small, the dynamic decoder can load the incremental language model quickly, for example, the dynamic decoder can complete the loading of the incremental language model in only a few seconds or even less, and therefore, the dynamic decoder can perform real-time hot loading on the incremental language model without restarting the dynamic decoder.
S203, determining a voice recognition result of the voice signal to be recognized from the at least one obtained decoding result according to the score corresponding to each decoding result.
In specific implementation, if a plurality of decoding results are obtained, the voice recognition result of the voice signal to be recognized can be determined from the obtained plurality of decoding results through any one of the following strategies; if only one decoding result is obtained, the decoding result can be directly determined as the voice recognition result of the voice signal to be recognized.
As a possible implementation manner, the decoding result with the highest score in the obtained at least one decoding result may be selected as the voice recognition result of the voice signal to be recognized.
As another possible implementation manner, the recognition result corresponding to the speech signal before the speech signal to be recognized may also be used as context information, and the speech recognition result of the speech signal to be recognized may be determined by combining the context information and the scores corresponding to the candidate results.
Specifically, the association degree between each candidate result and the context information may be calculated, the score corresponding to each candidate result may be adjusted according to the association degree, and then, based on the adjusted score, the candidate result with the highest score may be selected as the voice recognition result of the voice signal to be recognized. The association degree may be determined according to the domain information and the intention information corresponding to the candidate result, the domain information and the intention information corresponding to the included keyword and the context information, and the included keyword. For example, if the candidate result is the same as the domain information and the intention information of the context information and includes the same keyword, the degree of association between the candidate result and the context information is higher, which indicates that the probability that the candidate result is the correct speech recognition result is higher, and the score corresponding to the candidate result can be improved; on the contrary, if the candidate result is different from the domain information and the intention information of the context information and does not contain the same keyword, the degree of association between the candidate result and the context information is low, which indicates that the probability that the candidate result is the correct speech recognition result is high, and the score corresponding to the candidate result can be reduced. In specific implementation, the corresponding relationship between the association degree and the amplitude of the adjustment score may be set according to actual application, and the embodiment of the present application is not limited.
As another possible implementation manner, the word vector corresponding to the candidate result may be input into the neural network language model to obtain a probability value corresponding to each candidate result, and the voice recognition result of the voice signal to be recognized may be determined according to the probability value and the score corresponding to each candidate result.
Specifically, the score corresponding to each candidate result may be adjusted according to the probability value of each candidate result, and then, based on the adjusted score, the candidate result with the highest score is selected as the voice recognition result of the voice signal to be recognized. For example, for each candidate result, the probability value and the score may be weighted, and the weighted result may be used as the adjusted score corresponding to the candidate result, or the product of the probability value and the score may be used as the adjusted score corresponding to the candidate result.
The neural network language model is a class of language model obtained by training a neural network, words are mapped to a continuous space by the neural network language model through a distributed representation method, the problem of data sparsity is effectively solved, the neural network is trained through a large number of training linguistic data, a neural network language model describing grammatical distribution in text linguistic data can be obtained, word vectors corresponding to a text are input into the neural network language model, and the neural network language model can output probability values of the text in the language. The neural network language model is the prior art, and the content of the neural network language model is not elaborated.
In specific implementation, the voice recognition server feeds back the determined voice recognition result of the voice signal to be recognized to the terminal equipment, and the terminal equipment displays the voice recognition result to a user. If the user finds that the voice recognition result is wrong, the error feedback information can be submitted through an error reporting inlet provided by the terminal equipment, the error feedback information comprises a correct text corresponding to the voice signal to be recognized input by the user, and the terminal equipment sends the error feedback information to the updating server. The update server updates the incremental language model based on the correct text in the error feedback information, as shown in fig. 3, the specific update process includes:
s301, acquiring correct text corresponding to the voice recognition error submitted by the user in real time.
S302, performing word segmentation processing on the correct text to obtain a word sequence consisting of a plurality of word segmentation segments.
In specific implementation, the word segmentation processing can be performed on the correct text through the existing word segmentation tools (such as jieba, SnowNLP, THULAC, NLPIR, and the like) so that the correct text is divided into a plurality of word segmentation segments, and a word sequence formed by the word segmentation segments is obtained. For example, if the correct text is "introduce a movie Nezha", the word segmentation results in four word segmentation segments of "introduce", "next", "movie", "Nezha", and the resulting word sequence is { introduce, next, movie, Nezha }.
In specific implementation, the correct text can be preprocessed before word segmentation processing so as to improve the accuracy of word segmentation, wherein the preprocessing comprises the following steps: and selecting and filtering meaningless characters such as special symbols and the like, and unifying the formats such as case, full half angle and the like.
And S303, updating the incremental language model by using the word sequence.
In specific implementation, the word sequence is utilized to update the statistical probability of the corresponding word sequence in the incremental language model. For example, if the wrong text with the speech recognition error is "introduce movie nazha", and the word sequence corresponding to the correct text submitted by the user is { introduce, next myzha }, the statistical probability of the word sequence { introduce, next myzha } in the incremental language model is increased, so that when the incremental language model is used for language recognition later, the probability of outputting the word sequence { introduce, next movie, and myzha } is increased, and "introduce movie nazha" is not output.
In specific implementation, if the incremental language model is updated, the voice recognition server acquires the updated incremental language model and loads the updated incremental language model into the dynamic decoder on line so as to cover the original incremental language model in the dynamic decoder.
In the embodiments of the present application, the loading of the incremental language model on line refers to loading the incremental language model into the dynamic decoder without restarting the dynamic decoder. In specific implementation, the latest incremental language model may be loaded into the dynamic decoder according to a set update period, or after each update of the incremental language model, the updated incremental language model may be loaded into the dynamic decoder in time.
Specifically, the update server may push the updated incremental language model to a dynamic decoder in the speech recognition server in real time after the incremental language model is updated, and load the updated incremental language model into the dynamic decoder on line to complete the repair of the speech recognition error. Or the voice recognition server periodically sends a request to the updating server to detect whether the incremental language model is updated, if the incremental language model is updated, the voice recognition server acquires the updated incremental language model and loads the updated incremental language model into the dynamic decoder on line to finish the repair of the voice recognition error.
The voice recognition method provided by the embodiment of the application provides an error reporting entry for a user, and in the voice recognition process, the user can update the incremental language model based on the correct text through the correct text corresponding to the error in the voice recognition of the error reporting entry, and then load the updated incremental language model into the dynamic decoder in real time in a hot loading mode, so that the voice recognition error can be repaired in real time in the voice recognition process. The decoder does not need to be restarted in the whole process, so that the voice recognition errors can be repaired in real time while the voice recognition service is not interrupted, the voice recognition system can learn from the errors corrected by the user, the probability of the errors appearing again in the future is reduced, and the recognition accuracy of the voice recognition system is improved. In addition, because the data volume of the incremental speech model is very small, only the incremental speech model is updated and loaded, the repair speed can be greatly improved, and second-level repair is realized, so that the identification errors which are temporarily found on site can be quickly and timely repaired.
In practical application, the time for updating the incremental language model can be selected according to the requirements of the application scene using the dynamic decoder on the speech recognition accuracy, the repair instantaneity and the like.
As a possible implementation, the incremental language model may be updated immediately after receiving the correct text uploaded by the user.
The method is suitable for application scenes with high requirements on speech recognition accuracy and repair instantaneity. For example, in the process of acoustic interpretation, after finding a speech recognition error, a user immediately reports a correct text corresponding to the speech recognition error to the update server, after receiving the correct text, the update server immediately updates the incremental language model based on the correct text, and loads the updated incremental language model into the dynamic decoder at the first time after completing the updating, so that the speech recognition error reported by the user can be timely repaired, and the same error is prevented from appearing again in the subsequent simultaneous interpretation process.
As another possible implementation manner, an update period may be set, and after the update period is reached, correct texts uploaded by the user in the update period are acquired, and the incremental language model is updated based on the correct texts.
The updating period can be determined according to the requirements of the application scene using the dynamic decoder on speech recognition accuracy, repair instantaneity and the like. For example, if the requirement on the speech recognition accuracy and the repair real-time performance is high, a short update cycle may be selected, and for example, the update cycle may be set to a value of 10 seconds, 30 seconds, 1 minute, or the like, so that the speech recognition error can be repaired in time. If the requirements on the speech recognition accuracy and the repair real-time performance are low, a longer update period may be selected, for example, the update period may be set to a value of 10 minutes, 1 hour, and the like, so that frequent updates may be avoided, which may lead to resource waste.
When correct text submitted by a user is not received in an updating period, the incremental language model does not need to be updated.
As yet another possible implementation, the incremental language model may be updated based on the correct texts submitted by the user when the number of correct texts reaches a preset number.
The preset number can be determined according to the application scene of the used dynamic decoder, the requirements of speech recognition accuracy, repair instantaneity and the like. For example, if the requirement on the speech recognition accuracy and the repair real-time performance is high, a smaller preset number may be selected, for example, the preset number is 5, and after 5 correct texts submitted by the user are obtained, the incremental language model may be updated. If the requirements on the speech recognition accuracy and the repair real-time performance are low, a larger preset number can be selected, for example, the preset number is 100, so that frequent updating and resource waste can be avoided.
In specific implementation, the preset number can be determined by combining the number of the terminal devices using the dynamic decoder. If the number of terminal devices using the dynamic decoder is small, a small preset number may be selected, so as to update the incremental language model as soon as possible, for example, when the dynamic decoder is only used by a single user or a small number of users, the use frequency of the dynamic decoder is low, and the frequency of speech recognition errors reported by the users is also low, at this time, if the preset number is too large, the speech recognition errors submitted by the users cannot be repaired in time, so the small preset number may be selected. If the number of the terminal devices using the dynamic decoder is large, the usage frequency of the dynamic decoder is high, and the frequency of the speech recognition error reported by the user is also high, at this time, a large preset number can be selected, but an excessively large preset number is not recommended, which may cause that the speech recognition error submitted by the user cannot be repaired in time.
In practical application, for some application scenarios, there are some words with higher usage frequency, which are called high-frequency words. For example, in an application scenario of smart home, command words for controlling devices, such as "turn on air conditioner", "turn on light", "play music", and the like, which are frequently input by a user, may be set as high-frequency words. For another example, in a chat robot application scenario, keywords in some topics frequently chatted by a user may be set as high-frequency words, and also hot search words corresponding to hot topics on a current network may be set as high-frequency words.
In practical application, different language models are generally trained aiming at different application scenes, and a dynamic decoder suitable for the application scenes is obtained, so that the accuracy of speech recognition is improved. On the basis, aiming at a specific application scene, the accuracy of voice recognition can be improved by improving the probability of selecting high-frequency words. For this purpose, before step S203, the following steps are further included: and aiming at each decoding result in the at least one obtained decoding result, if the decoding result contains the high-frequency words in the high-frequency word list, increasing the score corresponding to the decoding result.
On this basis, referring to fig. 4, an embodiment of the present application provides a speech recognition method, including the following steps:
s401, obtaining a voice signal to be recognized.
Step S201 may be referred to in the specific implementation of step S401, and is not described again.
S402, dynamically decoding the acoustic features of the voice signal to be recognized by using a dynamic decoder to obtain at least one decoding result and a score corresponding to each decoding result.
Step S402 can refer to step S202 for details, which are not described again.
And S403, respectively aiming at each decoding result in the at least one obtained decoding result, if the decoding result contains the high-frequency words in the high-frequency word list, increasing the score corresponding to the decoding result.
The high-frequency word list comprises high-frequency words preset for the application scene, and the high-frequency word list is loaded into the dynamic decoder in advance. Specifically, for each decoding result in the at least one obtained decoding result, if the decoding result includes a high-frequency word in the high-frequency word list, a corresponding score is added to the score of the decoding result according to the matching condition between the decoding result and the high-frequency word in the high-frequency word list.
In specific implementation, if the decoding result is consistent with any high-frequency word in the high-frequency word list, the score of the decoding result is increased by a first score; if the decoding result is inconsistent with all the high-frequency words in the high-frequency word list and the decoding result contains the high-frequency words in the high-frequency word list, the score of the decoding result is increased by a second score; and if the decoding result does not contain the high-frequency words in the high-frequency word list, not increasing the score of the decoding result. The first score is higher than the second score, and values of the first score and the second score can be set according to actual application requirements.
For example, the decoding result is "turn on the air conditioner", the high frequency word list includes the high frequency word of "turn on the air conditioner", and the score of the decoding result is increased by the first score. For example, the decoding result is "search movie Nezha", the high frequency word list includes a hot search word of "Nezha", and the score of the decoding result is increased by a second score.
In specific implementation, the score of the decoding result may be increased by multiplying the score corresponding to the decoding result by the corresponding coefficient. Similarly, the coefficient corresponding to the decoding result may be determined according to the matching condition of the decoding result and the high-frequency word in the high-frequency word list. Specifically, if the decoding result is consistent with any high-frequency word in the high-frequency word list, determining the coefficient corresponding to the decoding result as a first coefficient; and if the decoding result is inconsistent with all the high-frequency words in the high-frequency word list and the decoding result contains the high-frequency words in the high-frequency word list, determining the coefficient corresponding to the decoding result as a second coefficient. The first coefficient is higher than the second coefficient, and both the first coefficient and the second coefficient are greater than 1, for example, the first coefficient may be 1.5, and the second coefficient may be 1.2.
S404, determining the voice recognition result of the voice signal to be recognized from the at least one obtained decoding result according to the score corresponding to each decoding result.
Step S404 may refer to step S203 for a specific implementation, and is not described again.
And increasing the score of the decoding result containing the high-frequency words through the high-frequency word list, so that the probability of selecting the decoding result containing the high-frequency words is improved, and the accuracy of voice recognition is improved.
In the speech recognition process, aiming at the recognition error of the high-frequency word, the user can also submit the corresponding correct text (specifically the correct high-frequency word) through the error reporting entry, and the high-frequency word recognition error is repaired by updating the incremental language model.
In order to improve the repairing effect of the high-frequency word recognition, in addition to repairing the high-frequency word recognition error based on the incremental language model, as shown in fig. 5, the high-frequency word recognition error can be repaired in the following manner:
s501, acquiring correct high-frequency words submitted by high-frequency word recognition errors.
S502, if the submitted correct high-frequency word is not in the high-frequency word list, adding the submitted correct high-frequency word into the high-frequency word list.
In specific implementation, if the correct high-frequency word submitted by the user is in the high-frequency word list, the high-frequency word does not need to be added to the high-frequency word list.
In practical application, the update server can push the updated high-frequency word list to a dynamic decoder in the speech recognition server in real time after the high-frequency word list is updated, and the updated high-frequency word list is loaded into the dynamic decoder on line to complete the repair of speech recognition errors.
In specific implementation, the high-frequency words submitted by the high-frequency word recognition error can be acquired in the following two ways:
in the first mode, the correct high-frequency words submitted by the user in real time through the high-frequency word error reporting entry are obtained.
In specific implementation, two error reporting entries are provided for a user: a common error reporting entry and a high-frequency word error reporting entry. When the user finds that the high-frequency words are wrong, the correct high-frequency words can be submitted through a high-frequency word error reporting inlet; when the user finds that the speech recognition is wrong, which is not a high-frequency word error, the correct text can be submitted through a common error reporting entry. Updating a high-frequency word list based on a correct high-frequency word submitted by a user through a high-frequency word error reporting entry; and updating the incremental language model based on the correct high-frequency words submitted by the user through the high-frequency word error reporting inlet and the correct texts submitted by the common error reporting inlet.
In specific implementation, the user can also submit the high-frequency words which the user wants to add newly to the update server through the high-frequency word error reporting entry so as to expand the high-frequency word list.
And in the second mode, a high-frequency word recognition model is utilized to recognize correct high-frequency words which are wrongly submitted for high-frequency word recognition from correct texts which are submitted by a user in real time.
When the correct high-frequency words submitted by the high-frequency word recognition error are acquired in the second mode, only one error reporting entry can be provided for the user, namely, the high-frequency word errors and the common errors can be submitted through the error reporting entry.
The high-frequency word recognition model is a model which is obtained based on neural network training and is used for recognizing whether an input text is a high-frequency word. The high-frequency word recognition model is essentially a two-classification model, and referring to fig. 6, the high-frequency word recognition model can be specifically trained in the following manner:
s601, obtaining a corpus sample set, wherein each corpus sample comprises a corpus marked with category labels, and the category labels are used for marking whether the corpus is a high-frequency word or not.
The corpus sample set includes corpora that are high-frequency words and corpora that are not high-frequency words, for example, the category of the corpus that is high-frequency words is labeled as "1", and the category of the corpus that is not high-frequency words is labeled as "0".
S602, performing word segmentation processing on each corpus sample to obtain a plurality of word segmentation segments corresponding to each corpus sample.
S603, inputting a word sequence formed by a plurality of word segmentation segments corresponding to each corpus sample into a neural network to obtain a prediction result of whether the corpus is a high-frequency word.
In this step, each corpus sample is subjected to word segmentation to obtain a plurality of word segmentation segments, and the word sequence of the corpus sample can be formed according to the position of each word segmentation segment in the corpus sample. For example, the corpus sample is "introduce blue and white porcelain", and the participle segments obtained after the corpus sample is subjected to the participle processing include "introduce", "next" and "blue and white porcelain", and then the word sequence finally formed according to the appearance position of each participle segment in the corpus sample is { introduce, next, blue and white porcelain }.
In specific implementation, if the output result is "1", it indicates that the prediction result is: the corpus is a high-frequency word, and if the output result is '0', the prediction result is shown as follows: the corpus is not high frequency words.
S604, adjusting parameters of the neural network according to the category label and the corresponding prediction result in each corpus sample.
In specific implementation, a loss function for determining the deviation of the corpus sample and the prediction result corresponding to the corpus sample output by the neural network can be calculated according to the original category label of the corpus sample and the prediction result corresponding to the corpus sample output by the neural network, then parameters of the neural network are adjusted by using a gradient descent algorithm so as to reduce the loss function until the model converges, model training is stopped, and the neural network with the adjusted parameters is used as a high-frequency word recognition model. The manner of judging the convergence of the model may be: and judging the convergence of the model if the difference value between the loss values obtained by the last two times of training is smaller than the threshold value.
The neural network used for training in the embodiments of the present application includes, but is not limited to, the following types: residual Networks (ResNet), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Deep Neural Networks (DNN), Deep Belief Networks (DBNs), and the like.
As shown in fig. 7, based on the same inventive concept as the voice recognition method, the embodiment of the present application further provides a voice recognition apparatus 70, which includes an obtaining module 701, a decoding module 702, and a determining module 703.
An obtaining module 701, configured to obtain a speech signal to be recognized.
A decoding module 702, configured to perform dynamic decoding on the acoustic features of the speech signal to be recognized by using a dynamic decoder, to obtain at least one decoding result and a score corresponding to each decoding result, where the dynamic decoder includes an online loaded incremental language model, and the incremental language model is obtained by updating the incremental language model based on a correct text corresponding to a speech recognition error submitted by a user in real time.
The determining module 703 is configured to determine, according to the score corresponding to each decoding result, a speech recognition result of the speech signal to be recognized from the at least one decoding result.
Optionally, the decoding module 702 is specifically configured to:
obtaining at least one decoding result and acoustic probability corresponding to each decoding result based on a search network;
obtaining a first language probability corresponding to each decoding result based on the full-scale language model;
obtaining a second language probability corresponding to each decoding result based on the incremental language model;
and respectively determining the score corresponding to each decoding result according to the acoustic probability, the first language probability and the second language probability corresponding to each decoding result.
Optionally, the speech recognition apparatus provided in the embodiment of the present application further includes an update module, configured to obtain an updated incremental language model if the incremental language model is updated, and load the updated incremental language model into the dynamic decoder on line.
Optionally, the determining module 703 is further configured to: before determining the voice recognition result of the voice signal to be recognized from the at least one decoding result according to the score corresponding to each decoding result, respectively aiming at each decoding result in the at least one decoding result, if the decoding result contains the high-frequency words in the high-frequency word list, increasing the score corresponding to the decoding result.
Optionally, the determining module 703 is specifically configured to: and if the decoding result contains the high-frequency words in the high-frequency word list, adding corresponding scores to the scores of the decoding result according to the matching condition of the decoding result and the high-frequency words in the high-frequency word list.
Optionally, the determining module 703 is specifically configured to: if the decoding result is consistent with any high-frequency word in the high-frequency word list, the score of the decoding result is increased by a first score; and if the decoding result is inconsistent with all the high-frequency words in the high-frequency word list and the decoding result contains the high-frequency words in the high-frequency word list, increasing a second score by the score of the decoding result, wherein the first score is higher than the second score.
Optionally, the high frequency word list is updated by: acquiring correct high-frequency words submitted aiming at high-frequency word recognition errors; and if the submitted correct high-frequency word is not in the high-frequency word list, adding the submitted correct high-frequency word into the high-frequency word list.
Optionally, the high-frequency words submitted for the high-frequency word recognition error are obtained by: identifying correct high-frequency words submitted by a high-frequency word identification error from correct texts submitted by a user in real time by using a high-frequency word identification model, wherein the high-frequency word identification model is a model for identifying whether the input texts are high-frequency words or not based on neural network training; or acquiring the correct high-frequency words submitted by the user in real time through the high-frequency word error reporting entry.
The voice recognition device and the voice recognition method provided by the embodiment of the application adopt the same inventive concept, can obtain the same beneficial effects, and are not repeated herein.
Based on the same inventive concept as the voice recognition method, the embodiment of the present application further provides an electronic device, which may be specifically a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), a server, and the like. As shown in fig. 8, the electronic device 80 may include a processor 801 and a memory 802.
The Processor 801 may be a general-purpose Processor, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware components, and may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.
Memory 802, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charged Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 802 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
Referring to fig. 9, based on the same inventive concept as the above-mentioned speech recognition method, an embodiment of the present application provides a speech recognition system 90, which includes a speech acquisition device 901 and a speech recognition device 902.
And the voice acquisition device 901 is used for acquiring a voice signal to be recognized.
The speech recognition device 902 is configured to determine a speech recognition result of the speech signal to be recognized, which is collected by the speech collection device 901, by using the speech recognition method in any of the embodiments.
The voice collecting device 901 and the voice recognizing device 902 may be integrated in the same terminal device, for example, the voice collecting device 901 may be a microphone or a microphone array built in the terminal device, and the voice recognizing device 902 may be a processor inside the terminal device. Alternatively, the voice collecting device 901 is disposed in the terminal device, the voice recognition device 902 is disposed in the voice recognition server, and the terminal device and the voice recognition server communicate with each other through the communication network.
Referring to fig. 9, in particular implementation, the speech recognition system 90 further includes: and a display device 903 for displaying the voice recognition result. The display device 903 may be a display screen on the terminal device.
When the voice recognition system 90 is used, the terminal device collects a voice signal to be recognized through the voice collection device 901 and sends the voice signal to the voice recognition server, the voice recognition server determines a voice recognition result of the voice signal to be recognized by using the voice recognition method in any of the embodiments and sends the voice recognition result to the terminal device, and the terminal device displays the voice recognition result through the display device 903.
In specific implementation, the voice collecting device 901 and the display device 903 may also be independent devices, for example, the voice collecting device 901 may be an external microphone, the display device 903 may be an external display screen, the voice recognition device 902 may be a voice recognition server, and the external microphone and the external display screen are both in communication with the voice recognition server through a communication network.
The voice recognition system can be applied to various voice recognition products. For example, the voice recognition system may be an intelligent voice transcription system, and the intelligent voice transcription system provides a voice real-time transcription service for a user to quickly convert a voice into a text and output the text. For example, in the court trial process, the speech recognition system can transcribe the speeches of the trial participant and the trial participant into characters in real time, display the transcribed characters through the display device, and the bookmarker can correct and edit the transcribed characters in real time to quickly finish the electronic record of the inquiry.
On the basis of any of the above embodiments, the speech recognition system 90 further includes a control device 904, configured to determine response data corresponding to the speech recognition result, and control the controlled device to execute the response data.
The controlled device in the embodiment of the present application includes, but is not limited to, a smart appliance (such as a smart air conditioner, a smart television, a smart lamp, etc.), a smart robot, and other devices.
The response data in the embodiment of the present application includes, but is not limited to, text data, audio data, image data, video data, voice broadcast, or control instructions, and the like, where the control instructions include, but are not limited to: the instructions include instructions for adjusting parameters of the controlled equipment (such as adjusting the temperature of an intelligent air conditioner, turning off an intelligent lamp and the like), instructions for controlling the controlled equipment to display expressions, instructions for controlling the motion of action parts of the controlled equipment (such as leading, navigation, photographing, dancing and the like) and the like.
The voice recognition system including the control device 904 can be applied to various voice recognition products. For example, the method can be used for an application scenario in which a smart device (i.e., a controlled device) is controlled through voice instructions, and a smart device such as a smart home or a smart robot is controlled through voice. Specifically, the voice collecting device 901 may be a microphone in the terminal device or the controlled device, the voice recognition device 902 may be in the voice recognition server, the control device 904 may be the command parsing server, and the terminal device, the controlled device, the voice recognition server, and the command parsing server may communicate with each other through a communication network. The terminal device or the controlled device collects a voice signal to be recognized corresponding to a voice instruction input by a user through a built-in microphone and sends the voice signal to the voice recognition server, the voice recognition server determines a voice recognition result of the voice signal to be recognized by adopting the voice recognition method in any embodiment and sends the voice recognition result to the instruction analysis server, the instruction analysis server determines response data corresponding to the voice recognition result and sends the response data to the controlled device, and the controlled device executes the response data, so that the intelligent device is controlled through the voice instruction.
The voice recognition system including the control device 904 can also be used in various application scenarios of voice interaction, such as intelligent robots, intelligent customer service, and the like. Specifically, the voice collecting device 901 may be a microphone in the terminal device, the voice recognizing device 902 may be in the voice recognition server, the control device 903 may be a semantic analysis server, and the terminal device, the voice recognition server, and the semantic analysis server may communicate with each other through a communication network. The terminal device collects a voice signal to be recognized corresponding to a voice instruction input by a user through a built-in microphone and sends the voice signal to the voice recognition server, the voice recognition server adopts the voice recognition method in any embodiment to determine a voice recognition result of the voice signal to be recognized and sends the voice recognition result to the semantic analysis server, the semantic analysis server conducts semantic analysis on the voice recognition result through semantic analysis methods such as natural language analysis and the like, corresponding response data are determined according to the semantic analysis result and sent to the terminal device, and the terminal device outputs the response data, so that man-machine interaction is achieved, and help is provided for the user or chatting with the user is achieved.
An embodiment of the present application provides a computer-readable storage medium, which is used for storing computer program instructions for the electronic device, and which includes a program for executing the bullet screen processing method.
The computer storage media may be any available media or data storage device that can be accessed by a computer, including but not limited to magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.
The above embodiments are only used to describe the technical solutions of the present application in detail, but the above embodiments are only used to help understanding the method of the embodiments of the present application, and should not be construed as limiting the embodiments of the present application. Modifications and substitutions that may be readily apparent to those skilled in the art are intended to be included within the scope of the embodiments of the present application.

Claims (12)

1. A speech recognition method, comprising:
acquiring a voice signal to be recognized;
dynamically decoding the acoustic characteristics of the voice signal to be recognized by using a dynamic decoder to obtain at least one decoding result and a score corresponding to each decoding result, wherein the dynamic decoder comprises an online loaded incremental language model, and the incremental language model is obtained by updating the incremental language model based on correct text corresponding to a voice recognition error submitted by a user in real time;
respectively aiming at each decoding result in the at least one decoding result, if the decoding result contains the high-frequency words in the high-frequency word list, increasing the score corresponding to the decoding result;
and determining the voice recognition result of the voice signal to be recognized from the at least one decoding result according to the score corresponding to each decoding result.
2. The method according to claim 1, wherein the dynamically decoding the acoustic features of the speech signal to be recognized by using a dynamic decoder to obtain at least one decoding result and a score corresponding to each decoding result includes:
obtaining the at least one decoding result and the acoustic probability corresponding to each decoding result based on a search network;
obtaining a first language probability corresponding to each decoding result based on the full-scale language model;
obtaining a second language probability corresponding to each decoding result based on the incremental language model;
and respectively determining the score corresponding to each decoding result according to the acoustic probability, the first language probability and the second language probability corresponding to each decoding result.
3. The method of claim 1, further comprising:
and if the incremental language model is updated, acquiring the updated incremental language model, and loading the updated incremental language model into the dynamic decoder on line.
4. The method according to claim 1, wherein if the decoding result includes a high-frequency word in the high-frequency word list, increasing a score corresponding to the decoding result, specifically comprising:
and if the decoding result contains the high-frequency words in the high-frequency word list, adding corresponding scores to the scores of the decoding result according to the matching condition of the decoding result and the high-frequency words in the high-frequency word list.
5. The method according to claim 4, wherein the increasing the score of the decoding result by a corresponding score according to the matching condition of the decoding result and the high-frequency word in the high-frequency word list specifically comprises:
if the decoding result is consistent with any high-frequency word in the high-frequency word list, the score of the decoding result is increased by a first score;
and if the decoding result is inconsistent with all the high-frequency words in the high-frequency word list and the decoding result contains the high-frequency words in the high-frequency word list, increasing a second score by the score of the decoding result, wherein the first score is higher than the second score.
6. The method of claim 1, wherein the list of high frequency words is updated by:
acquiring correct high-frequency words submitted aiming at high-frequency word recognition errors;
and if the submitted correct high-frequency word is not in the high-frequency word list, adding the submitted correct high-frequency word into the high-frequency word list.
7. The method of claim 6, wherein the high-frequency words submitted for the high-frequency word recognition error are obtained by:
identifying correct high-frequency words submitted by a high-frequency word identification error from correct texts submitted by a user in real time by using a high-frequency word identification model, wherein the high-frequency word identification model is a model for identifying whether the input texts are high-frequency words or not based on neural network training;
alternatively, the first and second electrodes may be,
and acquiring the correct high-frequency words submitted by the user in real time through the high-frequency word error reporting entry.
8. A speech recognition apparatus, comprising:
the acquisition module is used for acquiring a voice signal to be recognized;
the decoding module is used for dynamically decoding the acoustic characteristics of the voice signal to be recognized by using a dynamic decoder to obtain at least one decoding result and a score corresponding to each decoding result, the dynamic decoder comprises an online loaded incremental language model, and the incremental language model is obtained by updating the incremental language model based on a correct text corresponding to a voice recognition error submitted by a user in real time;
the determining module is used for respectively aiming at each decoding result in the at least one decoding result, and increasing the score corresponding to the decoding result if the decoding result contains the high-frequency words in the high-frequency word list; and determining the voice recognition result of the voice signal to be recognized from the at least one decoding result according to the score corresponding to each decoding result.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium having computer program instructions stored thereon, which, when executed by a processor, implement the steps of the method of any one of claims 1 to 7.
11. A voice recognition system is characterized by comprising a voice acquisition device and a voice recognition device;
the voice acquisition device is used for acquiring a voice signal to be recognized;
the voice recognition device is used for determining a voice recognition result of the voice signal to be recognized by adopting the method of any one of claims 1 to 7.
12. The system of claim 11, further comprising:
display means for displaying a voice recognition result; and/or
And the control device is used for determining response data corresponding to the voice recognition result and controlling the controlled equipment to execute the response data.
CN201910837038.7A 2019-09-05 2019-09-05 Voice recognition method, device, electronic equipment, system and storage medium Active CN110473531B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910837038.7A CN110473531B (en) 2019-09-05 2019-09-05 Voice recognition method, device, electronic equipment, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910837038.7A CN110473531B (en) 2019-09-05 2019-09-05 Voice recognition method, device, electronic equipment, system and storage medium

Publications (2)

Publication Number Publication Date
CN110473531A CN110473531A (en) 2019-11-19
CN110473531B true CN110473531B (en) 2021-11-09

Family

ID=68514926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910837038.7A Active CN110473531B (en) 2019-09-05 2019-09-05 Voice recognition method, device, electronic equipment, system and storage medium

Country Status (1)

Country Link
CN (1) CN110473531B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111090305A (en) * 2019-11-27 2020-05-01 新华蓝海(北京)人工智能技术有限公司 3D intelligent tax service equipment
CN113178194B (en) * 2020-01-08 2024-03-22 上海依图信息技术有限公司 Voice recognition method and system for interactive hotword updating
CN111462751B (en) * 2020-03-27 2023-11-03 京东科技控股股份有限公司 Method, apparatus, computer device and storage medium for decoding voice data
CN111862967B (en) * 2020-04-07 2024-05-24 北京嘀嘀无限科技发展有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113515618A (en) * 2020-04-09 2021-10-19 北京搜狗科技发展有限公司 Voice processing method, apparatus and medium
CN113593543B (en) * 2020-04-30 2024-06-11 浙江未来精灵人工智能科技有限公司 Intelligent loudspeaker voice service system, method, device and equipment
CN111672099B (en) * 2020-05-28 2023-03-24 腾讯科技(深圳)有限公司 Information display method, device, equipment and storage medium in virtual scene
CN111681643A (en) * 2020-05-29 2020-09-18 标贝(北京)科技有限公司 Speech recognition post-processing method, device, system and storage medium
CN112151022A (en) * 2020-09-25 2020-12-29 北京百度网讯科技有限公司 Speech recognition optimization method, device, equipment and storage medium
CN112199498A (en) * 2020-09-27 2021-01-08 中国建设银行股份有限公司 Man-machine conversation method, device, medium and electronic equipment for endowment service
CN112116907A (en) * 2020-10-22 2020-12-22 浙江同花顺智能科技有限公司 Speech recognition model establishing method, speech recognition device, speech recognition equipment and medium
CN112002308B (en) * 2020-10-30 2024-01-09 腾讯科技(深圳)有限公司 Voice recognition method and device
CN112667270B (en) * 2020-12-23 2024-02-13 科大讯飞股份有限公司 Updating method of voice processing resource, computer equipment and storage device
CN112599128B (en) * 2020-12-31 2024-06-11 百果园技术(新加坡)有限公司 Voice recognition method, device, equipment and storage medium
CN113782005B (en) * 2021-01-18 2024-03-01 北京沃东天骏信息技术有限公司 Speech recognition method and device, storage medium and electronic equipment
CN113066489A (en) * 2021-03-16 2021-07-02 深圳地平线机器人科技有限公司 Voice interaction method and device, computer readable storage medium and electronic equipment
CN113257227B (en) * 2021-04-25 2024-03-01 平安科技(深圳)有限公司 Speech recognition model performance detection method, device, equipment and storage medium
CN113823269A (en) * 2021-09-07 2021-12-21 广西电网有限责任公司贺州供电局 Method for automatically storing power grid dispatching command based on voice recognition
CN113889088B (en) * 2021-09-28 2022-07-15 北京百度网讯科技有限公司 Method and device for training speech recognition model, electronic equipment and storage medium
CN114078475B (en) * 2021-11-08 2023-07-25 北京百度网讯科技有限公司 Speech recognition and updating method, device, equipment and storage medium
CN114179083B (en) * 2021-12-10 2024-03-15 北京云迹科技股份有限公司 Leading robot voice information generation method and device and leading robot

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103247291A (en) * 2013-05-07 2013-08-14 华为终端有限公司 Updating method, device, and system of voice recognition device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20150031984A (en) * 2013-09-17 2015-03-25 한국전자통신연구원 Speech recognition system and method using incremental device-based model adaptation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103247291A (en) * 2013-05-07 2013-08-14 华为终端有限公司 Updating method, device, and system of voice recognition device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Incremental language models for speech recognition using finite-state transducers;H.J.G.A. Dolfing; I.L. Hetherington;《IEEE Workshop on Automatic Speech Recognition and Understanding, 2001》;20011213;全文 *
大词汇量连续语音识别解码器优化研究与实现;李先刚,张晨炜,庞在虎,吴玺宏;《第十二届全国人机语音通讯学术会议(NCMMSC2013)论文集》;20130805;第1页第1栏第2段,第1页第2栏第4段 *

Also Published As

Publication number Publication date
CN110473531A (en) 2019-11-19

Similar Documents

Publication Publication Date Title
CN110473531B (en) Voice recognition method, device, electronic equipment, system and storage medium
US11545142B2 (en) Using context information with end-to-end models for speech recognition
US11238845B2 (en) Multi-dialect and multilingual speech recognition
US11848008B2 (en) Artificial intelligence-based wakeup word detection method and apparatus, device, and medium
CN108899013B (en) Voice search method and device and voice recognition system
JP6550068B2 (en) Pronunciation prediction in speech recognition
US11043214B1 (en) Speech recognition using dialog history
US20230074869A1 (en) Speech recognition method and apparatus, computer device, and storage medium
US10152298B1 (en) Confidence estimation based on frequency
EP3948849A1 (en) Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models
US11514916B2 (en) Server that supports speech recognition of device, and operation method of the server
JP7351018B2 (en) Proper noun recognition in end-to-end speech recognition
JP2023545988A (en) Transformer transducer: One model that combines streaming and non-streaming speech recognition
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
KR20230158107A (en) Efficient streaming non-circular on-device end-to-end model
CN111326144A (en) Voice data processing method, device, medium and computing equipment
KR102409873B1 (en) Method and system for training speech recognition models using augmented consistency regularization
WO2024093578A1 (en) Voice recognition method and apparatus, and electronic device, storage medium and computer program product
US20220310097A1 (en) Reducing Streaming ASR Model Delay With Self Alignment
CN113421587B (en) Voice evaluation method, device, computing equipment and storage medium
US11804225B1 (en) Dialog management system
US20240029720A1 (en) Context-aware Neural Confidence Estimation for Rare Word Speech Recognition
US20240203409A1 (en) Multilingual Re-Scoring Models for Automatic Speech Recognition
US20240194188A1 (en) Voice-history Based Speech Biasing
KR20230156795A (en) Word segmentation regularization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant