WO2020156342A1

WO2020156342A1 - Voice recognition method and device, electronic device and storage medium

Info

Publication number: WO2020156342A1
Application number: PCT/CN2020/073328
Authority: WO
Inventors: 王杰; 钟贵平; 李宝祥; 吴本谷; 陈江
Original assignee: 北京猎户星空科技有限公司
Priority date: 2019-01-30
Filing date: 2020-01-20
Publication date: 2020-08-06
Also published as: TWI752406B; CN111508497A; TW202032534A; CN111508497B

Abstract

Provided are a voice recognition method and device, an electronic device and a storage medium. The method comprises: acquiring an input voice and a user ID corresponding to the input voice (S201); searching, according to the user ID, the optimal path corresponding to the input voice in a decoding network, wherein each path between word nodes in the decoding network is marked with the user ID (S202); and determining text information corresponding to the input voice according to the optimal path (S203). The voice recognition method is based on a set of decoding networks, can provide personalized voice recognition services for users, and can greatly save hardware resources.

Description

Speech recognition method, device, electronic equipment and storage medium

Cross references to related applications

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201910094102.7, and the application name is "voice recognition method, device, electronic equipment and storage medium" on January 30, 2019, the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the field of speech recognition technology, and in particular to a speech recognition method, device, electronic equipment, and storage medium.

Background technique

The speech recognition system mainly includes a set of acoustic models, language models and decoders. The accuracy of speech recognition mainly depends on the language model. As the user's personalization needs become higher and higher, different language models need to be trained for different users to provide proprietary speech recognition services. At present, the training method of personalized language model is to use the user's own corpus to train the general language model to generate a user-specific language model, and deploy a set of specialized speech recognition services for each user, through periodic updates Language models to meet the individual needs of users.

Summary of the invention

The embodiments of the present application provide a voice recognition method, device, electronic equipment, and storage medium to solve the need to deploy a set of specialized voice recognition services for each user in order to meet the needs of user personalized customization in the prior art, resulting in resources The problem of serious waste.

In the first aspect, an embodiment of the present application provides a voice recognition method, including:

Obtain the input voice and the user ID corresponding to the input voice;

According to the user ID, search the optimal path corresponding to the input voice in the decoding network, and the path between each word node in the decoding network is marked with the user ID;

Determine the text information corresponding to the input voice according to the optimal path.

Optionally, the searching for the optimal path corresponding to the input voice in the decoding network according to the user ID includes:

Determine the optimal path corresponding to the input voice according to the probability score corresponding to the user ID marked by the path between each word node in the decoding network.

Obtaining the language model corresponding to the user ID according to the user ID;

According to the language model corresponding to the user ID, search for the optimal path corresponding to the input voice in the decoding network.

Optionally, the decoding network is constructed based on a full dictionary.

Optionally, update the language model corresponding to the user ID in the following manner:

Determine that the language model corresponding to the user ID needs to be updated;

Update the language model according to the corpus in the corpus corresponding to the user ID, and determine the latest probability score corresponding to the path between the word nodes in the decoding network;

According to the latest probability score, the probability score corresponding to the user ID of the path mark between the corresponding word nodes in the decoding network is updated.

Optionally, the determining that the language model corresponding to the user ID needs to be updated includes:

Detecting whether the corpus corresponding to the user ID has been updated;

If the corpus corresponding to the user ID is updated, it is determined that the language model corresponding to the user ID needs to be updated.

Optionally, the detecting whether the corpus corresponding to the user ID has been updated includes:

Calculating the first summary value of all corpora in the corpus corresponding to the user ID;

Compare the first summary value with the second summary value, if they are not the same, confirm that the corpus corresponding to the user ID has been updated, and the second summary value is in the corpus corresponding to the user ID after the most recent update The summary value of all corpora.

Optionally, after determining that the language model corresponding to the user ID needs to be updated, the method further includes:

According to the frequency of each word node in the decoding network in the corpus corresponding to the user ID, obtain the appearance frequency score of each word node corresponding to the user ID;

For each phoneme node in the decoding network, select the maximum value of the appearance frequency scores of the target word node corresponding to the phoneme node corresponding to the user ID, and determine it from the phoneme node to the target word node The path corresponding to the latest forward probability of the user ID;

According to the latest look-ahead probability, the look-ahead probability corresponding to the user ID of the path from the phoneme node to the target word node in the decoding network is updated.

Optionally, according to the frequency of each word node in the decoding network in the corpus corresponding to the user ID, obtaining the appearance frequency score corresponding to each word node includes:

Determine the frequency of occurrence of word nodes in the corpus corresponding to the corpus corresponding to the user ID in the corpus in the decoding network;

For the word node corresponding to the corpus in the corpus, the frequency of the word node is normalized to obtain the appearance frequency score corresponding to the word node.

In the second aspect, an embodiment of the present application provides a voice recognition device, including:

The obtaining module is used to obtain the input voice and the user ID corresponding to the input voice;

The decoding module is used to search for the optimal path corresponding to the input voice in the decoding network according to the user ID, and the path between the word nodes in the decoding network is marked with the user ID;

The determining module is used to determine the text information corresponding to the input voice according to the optimal path.

Optionally, the decoding module is specifically configured to: determine the optimal path corresponding to the input voice according to the probability score corresponding to the user ID marked by the path between each word node in the decoding network.

Optionally, the decoding module is specifically configured to:

Optionally, the decoding network is constructed based on a full dictionary.

Optionally, it also includes a model update module for:

Optionally, the model update module is specifically configured to:

Detecting whether the corpus corresponding to the user ID has been updated;

Optionally, the model update module is specifically configured to:

Optionally, the model update module is also used to:

Optionally, the model update module is specifically configured to:

In the third aspect, an embodiment of the present application provides an electronic device, including a transceiver, a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the transceiver is used for When receiving and sending data under control, the processor implements the steps of any of the above methods when the processor executes the program.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having computer program instructions stored thereon, and when the program instructions are executed by a processor, the steps of any of the above methods are implemented.

In a fifth aspect, the present application also provides a computer program product, the computer program product includes a computer program stored on a computer-readable storage medium, the computer program includes program instructions, when the program instructions are executed by a processor Steps to implement any of the above voice recognition methods.

The technical solution provided by the embodiments of the present application marks the user ID on the path between the word nodes in the constructed decoding network, so that in the process of using the decoding network to recognize speech, it is possible to search for only the user ID marked with the user ID. The optimal path is selected from the searched multiple paths, and the text information corresponding to the input voice is determined according to the optimal path, so that different users can obtain different recognition results based on the same decoding network. Therefore, only a set of decoding network needs to be deployed on the server side. The decoding network integrates multiple user-specific language models and can provide personalized speech recognition services for multiple users while saving hardware resources.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the embodiments of the present application. Obviously, the drawings described below are only some embodiments of the present application. A person of ordinary skill in the art can obtain other drawings based on these drawings without creative work.

FIG. 1 is a schematic diagram of an application scenario of a speech recognition method provided by an embodiment of the application;

2 is a schematic flowchart of a voice recognition method provided by an embodiment of this application;

FIG. 3 is an example of a local network in a decoding network provided by an embodiment of this application;

Fig. 4 is an example of a path between word nodes in a decoding network provided by an embodiment of the application;

FIG. 5 is another example of a local network in a decoding network provided by an embodiment of this application;

FIG. 6 is an example of a local network in a decoding network constructed based on language models of multiple users according to an embodiment of the application;

FIG. 7 is a schematic flowchart of a method for updating a language model corresponding to a user ID according to an embodiment of the application;

FIG. 8 is a schematic structural diagram of a speech recognition device provided by an embodiment of this application;

FIG. 9 is a schematic structural diagram of an electronic device provided by an embodiment of the application.

detailed description

In order to make the objectives, technical solutions, and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the drawings in the embodiments of the present application.

To facilitate understanding, the terms involved in the embodiments of this application are explained below:

The purpose of language model (Language Model, LM) is to establish a distribution that can describe the probability of occurrence of a given word sequence in a language. In other words, the language model is a model that describes the probability distribution of words, a model that can reliably reflect the probability distribution of words used in language recognition. Language models occupy an important position in natural language processing and have been widely used in speech recognition, machine translation and other fields. For example, a language model can be used to obtain the most likely word sequence among multiple word sequences in speech recognition, or given several words, predict the next most likely word, etc. Commonly used language models include N-Gram LM (N-gram language model), Big-Gram LM (binary language model), and Tri-Gram LM (tri-gram language model).

Acoustic model (AM, Acoustic model) is one of the most important parts of a speech recognition system, and is a model that classifies the acoustic characteristics of speech to phonemes. The current mainstream systems mostly use hidden Markov models for modeling.

A dictionary is a collection of phonemes corresponding to words and describes the mapping relationship between words and phonemes.

Phoneme (phone) is the smallest unit of speech. It is analyzed based on the pronunciation actions in syllables. One action constitutes a phoneme. Phonemes in Chinese are divided into two categories: initials and finals. For example, initials include: b, p, m, f, d, t, etc., and finals include: a, o, e, i, u, ü, ai, ei , Ao, an, ian, ong, iong, etc. Phonemes in English are divided into vowels and consonants. For example, vowels include a, e, ai, etc., and consonants include p, t, h, etc.

Look-ahead probability (look-ahead probability): In order not to cut the path with lower acoustic scores in the middle of decoding, it is generally adopted to decompose the appearance probability scores based on the language model to represent the frequency of each word occurrence. The technology of language model look-ahead technology is to introduce the appearance probability score corresponding to the word node on the path from the phoneme node to the word node in the decoding network, and use the maximum value of the occurrence probability score as the phoneme node to all The look-ahead probability on the path of the word node reached. When calculating the score of the path from the phoneme node to the word node, the look-ahead probability is added to the score of the path, which can significantly improve some of the lower acoustic scores but higher probability scores The score of the path to avoid cutting such paths during the pruning process.

Any number of elements in the drawings is for example rather than limitation, and any naming is only used for distinction and does not have any limiting meaning.

In the specific practice process, the training method of personalized language model is to use the user's own corpus to train the general language model to generate a user-specific language model, and deploy a set of special speech recognition services for each user. The language model is periodically updated to meet the individual needs of users. However, deploying a set of specialized speech recognition services for each user will cause a serious waste of resources and generate huge expenses.

To this end, the inventor of the present application considers that the user ID is marked on the path between the word nodes in the constructed decoding network, so that in the process of using the decoding network to recognize speech, it is possible to search for only those marked with the user ID according to the user ID. The path of the user ID selects the optimal path from the searched multiple paths, and determines the text information corresponding to the input voice according to the optimal path, so that different users can obtain different recognition results based on the same decoding network. Therefore, only a set of decoding network needs to be deployed on the server side. The decoding network integrates multiple user-specific language models and can provide personalized speech recognition services for multiple users while saving hardware resources.

In addition, the full vocabulary is used to construct the decoding network, so that the constructed decoding network can be applied to multiple users. When adding new users, there is no need to rebuild the decoding network, and there is no need to restart the decoder, thus realizing online new additions. Users, to ensure that users can uninterrupted access to voice recognition services, improve user experience. The decoding network constructed based on the full vocabulary can also realize online update of the language model corresponding to each user. When the language model of a user needs to be updated, only need to recalculate the word nodes in the decoding network according to the updated language model of the user The probability score of the path, and the user’s probability score in the decoding network is updated based on the user ID in the decoding network, the changes brought by the updated language model can be introduced into the decoding network, and the decoding network passes the decoding network after updating the probability score Perform a path search to obtain a recognition result that meets the user's personalized needs. Therefore, only a set of decoders needs to be deployed on the server side to train their own language model for each user, provide users with personalized speech recognition services, and realize the online update of the language model to update the user's language in time Model, and ensure that users can uninterrupted access to voice recognition services to improve user experience.

After introducing the basic principles of this application, various non-limiting implementations of this application are specifically introduced below.

First, refer to FIG. 1, which is a schematic diagram of an application scenario of the voice recognition method provided by an embodiment of the application. Multiple users 10 jointly use the voice recognition service provided by the decoder in the same server 12. During the interaction between the user 10 and the smart device 11, the smart device 11 sends the voice signal input by the user 10 to the server 12. The server 12 decodes the voice signal through the decoding network in the decoder to obtain the text information corresponding to the voice signal, and The decoded text information is fed back to the smart device 11 to complete the voice recognition service.

In this application scenario, the smart device 11 and the server 12 communicate through a network, and the network may be a local area network, a wide area network, or the like. The smart device 11 can be a smart speaker, a robot, etc., or a portable device (for example, a mobile phone, a tablet, a laptop, etc.), or a personal computer (PC, Personal Computer), and the server 12 can be any voice recognition service. Server equipment.

The technical solution provided by the embodiment of the present application will be described below in conjunction with the application scenario shown in FIG. 1.

Referring to FIG. 2, an embodiment of the present application provides a voice recognition method, including the following steps:

S201: Acquire an input voice and a user ID corresponding to the input voice.

In specific implementation, the smart terminal can send the collected input voice and user ID to the server, and the server performs voice recognition on the input voice according to the user ID. In this embodiment, one user ID corresponds to one language model, and the corpus corresponding to each user ID is used to train each user-specific language model.

The user ID in this embodiment may be enterprise-level, that is, the user ID is used to identify a different enterprise, a language model corresponding to an enterprise, and the smart device under the enterprise uses a language model. The user ID can also be device-level, that is, the user ID is used to identify a type or device. A type of device or a device corresponds to a language model. For example, a smart speaker corresponds to a language model about music, and a chat robot responds to a language model about chat. Language model, so that different devices can use the same decoding network. The user ID can also be service-level, that is, different services correspond to a language model, and smart devices under the service use a language model. and many more. The embodiments of the present application do not limit the specific implementation of the user ID, and can be configured according to actual application scenarios or requirements.

S202. According to the user ID, search for the optimal path corresponding to the input voice in the decoding network, and the path between the word nodes in the decoding network is marked with the user ID.

In this embodiment, multiple user IDs share a decoding network. The decoding network is a network diagram representing the relationship between phonemes and words and between words and words.

In order to realize that multiple users share a decoding network, the decoding network can be constructed based on the acoustic model and the corpus and language model corresponding to the multiple users. The specific construction method is as follows:

The first step is to obtain a dictionary containing all the words in the corpus based on the corpus corresponding to each user ID, and convert the words in the dictionary into phoneme strings. For example, the phoneme string of "open" is "k-ai", " The phoneme string of "Beijing" is "b-ei-j-ing", and the phoneme string of a word and the word form a path. For example, the path corresponding to "开" is "k-ai-开" and "Beijing" corresponds to The path is "b-ei-j-ing-Beijing".

The second step is to merge the nodes in the paths corresponding to all the words in the dictionary, that is, to merge the same phonemes in each path into one node to form a network of phoneme strings corresponding to all words, and one phoneme is one of the networks. Phoneme node.

Figure 3 shows an example of a partial network in the decoding network. Among them, the "k" in the phoneme strings of words such as "ka", "kai" and "ke" are merged into a node in a network. The last node of each path in the network corresponds to the vocabulary corresponding to the phoneme string composed of phonemes on the path, as shown in Figure 3, the word "ka-ka" corresponds to "ka", "ka-ch-e-truck" The corresponding word is "truck".

For the convenience of description, in this embodiment, the node corresponding to the phoneme in the decoding network is called the phoneme node, and the node corresponding to the vocabulary is called the word node.

Since a large number of the same nodes are merged together, the size of the search space can be significantly reduced, and the amount of computation in the decoding process can be reduced. The method for generating a decoding network based on a dictionary is an existing technology and will not be described in detail.

The third step is to determine the acoustic score between the connected phoneme nodes in the decoding network constructed in the second step according to the acoustic model.

In this embodiment, multiple users can share one acoustic model.

The fourth step is to determine the connection relationship and probability score between the words and words in the dictionary for each user ID according to the language model of the user ID, and establish the words and words in the decoding network constructed in the second step according to the connection relationship And mark the user ID and the probability score of the user on the path between word nodes.

In specific implementation, according to the language model, the conditional probability p(W ₂ |W ₁ ) of another word W ₂ appearing after a word W ₁ can be determined, and the conditional probability p(W ₂ |W ₁ ) is taken as the word W ₁ to The probability score of W ₂ .

For example, the corpus for training the language model includes "My family is in Beijing", and the words in the corpus include "我", "家", "在", "Beijing", then in the decoding network, the word nodes "我" and "家" ”Is connected, “家” and “在” are connected, and a connection is established between “Zi” and “Beijing”, and then according to the language model, “I” and “home”, “家” and “在” and “在” are connected. The probability score between "and "Beijing". Fig. 4 is an example of the path between word nodes in the decoding network. Fig. 4 conceals the network relationship between phoneme nodes and word nodes. It should be noted that the actual connection between word nodes and word nodes in the decoding network is shown in Figure 5. The word node "I" is connected to the first phoneme node of "home", SA ₁ , SA ₂ , SA ₃ Represents the acoustic score, SL ₁ represents the probability score of the path from the word node "I" to "home" corresponding to the user ID ₁ , and SL ₂ represents the probability score of the path from the word node "I" to "home" corresponding to the user ID ₂ .

Through the fourth step, the probability score of each user ID is marked on the corresponding path in the decoding network, so that when decoding, the path corresponding to the user can be selected according to the user ID, and the input voice can be determined based on the probability score on the corresponding path The optimal path.

Through the above four steps, a decoding network that can be used by multiple users can be obtained. Pre-loading the constructed decoding network into the decoder of the server can provide voice recognition services for these multiple users.

S203: Determine text information corresponding to the input voice according to the optimal path.

Based on any of the above embodiments, the process of speech recognition includes: preprocessing the speech signal, extracting the acoustic feature vector of the speech signal, and then inputting the acoustic feature vector into the acoustic model to obtain the phoneme sequence; based on the phoneme sequence and the corresponding speech signal User ID, search for a path with the highest score in the decoding network as the optimal path, and determine the text sequence corresponding to the optimal path as the recognition result of the voice signal. Among them, the optimal path is determined according to the total score of each path. The total score of the path is determined according to the acoustic score on the path and the probability score corresponding to the user ID. Specifically, the decoding score on a path can be calculated by the following formula:

Among them, L is a decoding path, SA _i is the i-th acoustic score on the path L, and SL _j,x is the j-th probability score corresponding to the user whose user ID is x on the path L. Taking Figure 5 as an example, the score of the decoding result "My Home" corresponding to user ID ₁ is (logSA ₁ +logSA ₂ +logSA ₃ +log SL ₁ ).

In the method of the embodiment of the present application, the user ID is marked on the path between the word nodes in the decoding network. When searching the path, the path that the user can use is selected according to the user ID on the path, so that different users can decode based on the same The network gets different recognition results. Referring to FIG. 6, it is a partial example of a decoding network generated based on language models of multiple users. Due to space limitations, some phoneme nodes in FIG. 6 are not shown. Taking Figure 6 as an example, when the voice signal of user ID ₁ is recognized, the path between the word node "Zai" and "Beijing" is marked with "ID ₁ ". At this time, the selected path is "Zai-Beijing" , Instead of selecting the other two paths in Figure 6; when recognizing the voice signal of user ID ₂ , the selected path is "Zai-Suzhou" and "Zai-Jiangsu" two paths marked with ID ₂ .

Therefore, the speech recognition method of the embodiment of the present application only needs to deploy a set of decoding network on the server side. The decoding network integrates multiple user-specific language models, and can provide personalized speech recognition services for multiple users, while saving Hardware resources.

As a possible implementation, step S202 specifically includes: determining the optimal path corresponding to the input voice according to the probability score corresponding to the user ID marked by the path between each word node in the decoding network.

Specifically, different probability scores will be obtained according to the language models of different users. For the same path, different probability scores will lead to completely different recognition results. Therefore, in the embodiment of the present application, the user ID is used in the decoding network to distinguish the probability scores of different users, so that multiple users can share a decoding network. When decoding, according to the user ID of the user currently using the decoding network, the probability score of the user ID marked on the decoding network path is taken to calculate the total score of each path, and the path with the highest total score is selected as the optimal path, based on the optimal path The vocabulary corresponding to the word node on the above, obtain the speech recognition result. Referring to Figure 6, “ID ₁ ”and “SL ₁ ” are marked between “in” and “Beijing”, indicating that only user ID ₁ can use the path during decoding, and the corresponding probability score is SL ₁ ; “in” and "Suzhou" is marked with "ID ₂ " and "SL ₂ ", which means that only user ID ₂ can use the path during decoding, and the corresponding probability score is SL ₂ ; between "at" and "Jiangsu" is marked with " ID ₂ ”, “SL ₂ ”, “ID ₃ ”, “SL ₃ ”, indicating that both user ID ₂ and ID ₃ use this path when decoding, and the probability score of user ID ₂ passing this path is SL ₂ , user ID ₃ The probability score when passing this path is SL ₃ .

As a possible implementation, step S202 specifically includes: searching for the optimal path corresponding to the input voice in the decoding network according to the user ID, including: obtaining the language model corresponding to the user ID according to the user ID; The language model searches for the optimal path corresponding to the input voice in the decoding network.

In specific implementation, each user ID corresponds to a language model, which is trained based on the corpus in the corpus corresponding to the user ID. The language model corresponding to the user ID is obtained based on the user ID corresponding to the input voice, and the user ID is used to correspond In the decoding network, search for the optimal path corresponding to the input voice and provide personalized voice recognition services for different users. When performing speech recognition services, its unique language model will be loaded into the decoder according to the user ID in advance, while the language models of other user IDs cannot be loaded into the decoder, so that multiple users can share one. It sets a general decoding network while maintaining its own characteristic language model.

On the basis of any of the above embodiments, in order to make the constructed decoding network applicable to more users, the embodiment of the present application uses a full dictionary to construct a decoding network shared by multiple users.

The full dictionary in the embodiment of this application is a dictionary containing a large number of commonly used words. In specific implementation, the number of vocabulary contained in the full dictionary is more than 100,000, which can cover different topics in multiple fields. The vocabulary in the full dictionary includes words and words. The full dictionary can cover all the words contained in the corpus corresponding to the user ID.

The method of constructing a decoding network shared by multiple users based on a full dictionary is similar to the method of constructing a decoding network based on a corpus corresponding to multiple users, and will not be repeated here.

When a new user needs to use the decoding network, he only needs to train the general language model based on the corpus corresponding to the user to obtain the user-specific language model, and then determine the decoding network for each word according to the user’s language model The probability score corresponding to the path between nodes is marked with the user ID of the user and the corresponding probability score on the path between the word nodes in the decoding network.

In the method of the embodiment of this application, a full dictionary is used to construct a decoding network, so that the constructed decoding network can be applied to more users. In addition, when adding new users, the nodes in the decoding network (including word nodes and phoneme nodes) do not need to be repeated. This means that there is no need to rebuild the decoding network, and there is no need to restart the decoder, so that new users can be added online, ensuring that users can uninterruptedly obtain speech recognition services, and improving user experience.

Based on any of the foregoing embodiments, as shown in FIG. 7, based on a decoding network constructed by a full dictionary, the embodiment of the present application can update the language model corresponding to each user ID through the following steps:

S701: Determine that the language model corresponding to the user ID needs to be updated.

Further, it can be determined that the language model corresponding to the user ID needs to be updated through the following steps: detecting whether the corpus corresponding to the user ID is updated; if the corpus corresponding to the user ID is updated, determining that the language model corresponding to the user ID needs to be updated.

In specific implementation, collect the corpus corresponding to each user ID, and store the corpus in the corpus corresponding to the user ID. For example, for smart speakers, music-related corpus can be collected; for individual users, it can be collected when the user uses a smart device. The input corpus is stored in the user's corpus to continuously update the user's language model and improve the accuracy of speech recognition. It can periodically or periodically check whether the corpus corresponding to each user ID has been updated. If it is detected that the corpus corresponding to a certain user ID has been updated, the corpus corresponding to the user ID will be used for the user ID The corresponding language model is trained to update the language model corresponding to the user ID. Wherein, the detection time or detection period can be set according to actual conditions, which is not limited in this embodiment. By setting regular or periodic detection tasks, it is possible to periodically detect whether the corpus is updated and update the language model in time, making the process of model updating more automated and saving manpower.

As a possible implementation, the following steps can be used to detect whether the corpus in the corpus has been updated: calculate the first summary value of all corpora in the corpus corresponding to the user ID; compare the first summary value with the second summary value, If the first summary value is different from the second summary value, confirm that the corpus corresponding to the user ID has been updated; if the first summary value is the same as the second summary value, then confirm that the corpus corresponding to the user ID has not been updated, and there is no need to update the user The language model corresponding to the ID. Wherein, the second summary value is the summary value of all the corpora in the corpus corresponding to the user ID after the most recent update.

In specific implementation, the MD5 Message-Digest Algorithm can be used to generate the digest values of all corpora in the corpus. Each time the language model corresponding to a user ID is updated, the first summary value of the corpus corresponding to the user ID can be stored as the second summary value used when checking whether the corpus is updated next time.

S702: Update the language model according to the corpus in the corpus corresponding to the user ID, and determine the latest probability score corresponding to the path between each word node in the decoding network.

S703: According to the latest probability score, update the probability score corresponding to the user ID marked by the path between the corresponding word nodes in the decoding network.

In specific implementation, the language model is updated according to the corpus in the corpus corresponding to the user ID, and the conditional probability between each word appearing in the corpus corresponding to the user ID is re-determined according to the updated language model, as the corresponding word node According to the latest probability score corresponding to the path of, update the probability score corresponding to the user ID of the path mark between the corresponding word nodes in the decoding network. When the language model corresponding to the user ID is updated, if a usable path is added, the user ID of the user and the probability score corresponding to the path can be added to the path corresponding to the decoding network. Taking Figure 6 as an example, if the language model of user ID ₁ is updated and the path from "Zai" to "Suzhou" is added, the user ID ₁ and the corresponding path are marked on the path from "Zai" to "Suzhou". Probability score.

Based on any of the above embodiments, the process of performing voice recognition based on the updated language model corresponding to the user ID is roughly as follows: preprocessing the voice signal corresponding to the user ID, extracting the acoustic feature vector of the voice signal, and then combining the acoustic feature The vector is input into the acoustic model to obtain the phoneme sequence; based on the phoneme sequence, according to the user ID, search for a path with the highest score in the decoding network as the optimal path, and the text sequence corresponding to the optimal path is determined as the recognition result of the speech signal.

The score of the path is determined according to the acoustic score on the path and the probability score corresponding to the user ID. Specifically, the decoding score on a path can be calculated by the following formula:

Among them, L is a decoding path, SA _i is the i-th acoustic score on the path L, and SL _j,x is the j-th probability score of the user ID x on the path L. Taking Fig. 5 as an example, the score of the decoding result "My Home" corresponding to the user whose user ID is ID ₁ is (logSA ₁ +logSA ₂ +logSA ₃ +log SL ₁ ). In this embodiment, since each user ID uses the same acoustic model, each user ID uses the same acoustic score.

Since the decoding network has been pre-loaded into the decoder, once it is detected that the language model corresponding to a certain user ID needs to be updated, it is only necessary to recalculate the path between each word node in the decoding network according to the updated language model corresponding to the user ID With the probability score of, the changes brought by the updated language model can be introduced into the decoding network. The decoder uses the decoding network with the updated probability score to perform a path search, and the correct result can be solved.

In the method of the embodiment of the present application, the user ID is marked on the path of the constructed decoding network. When the language model of a user needs to be updated, it is only necessary to recalculate the words in the decoding network according to the updated language model corresponding to the user ID. The probability score of the path between nodes, and the user’s probability score in the decoding network is updated based on the user ID in the decoding network. The changes brought by the updated language model can be introduced into the decoding network. The decoder updates the probability score The decoding network performs path search to solve the result that meets the user's personalized needs. Therefore, only a set of decoders need to be deployed on the server side to train their unique language model for each user and provide users with personalized The speech recognition service, while greatly saving hardware resources.

The method of the embodiment of the present application uses a full vocabulary to construct a decoding network, so that the constructed decoding network can be applied to multiple users. In addition, when the language model is updated, the nodes in the decoding network (including word nodes and phoneme nodes) are not required Reconstruction means that there is no need to rebuild the decoding network and restart the decoder, thus realizing the online update of the language model, ensuring that users can continuously obtain speech recognition services and improving user experience.

Based on any of the foregoing embodiments, the path from each phoneme node in the decoding network to all word nodes that the phoneme node can reach also includes the forward probability corresponding to each user ID. Referring to Figure 6, the path between the phoneme node "b" and the word node "Beijing" is marked with "ID ₁ "and "LA ₁ ", indicating that on this path, the forward probability corresponding to user ID ₁ is SL ₁ ; "ID ₂ " and "SL ₂ " are marked between "s" and "Suzhou", which means that on this path, the forward-looking probability corresponding to user ID ₂ is LA ₂ ; between "j" and "Jiangsu" is marked “ID ₂ ”, “SL ₂ ”, “ID ₃ ”, and “SL ₃ ”indicate that on this path, the forward probability corresponding to user ID ₂ is LA ₂ , and the forward probability corresponding to user ID ₃ is LA ₃ .

Based on the look-ahead probability corresponding to the user ID, in the process of searching the corresponding word sequence according to the phoneme sequence, the score of the path needs to be added to the look-ahead probability on the path, that is, in the path search, the intermediate score of the path L is:

Among them, SA _i is the i-th acoustic score on path L, SL _j,x is the j-th probability score corresponding to the user with user ID x on path L, and LA _n,x is the user ID x on path L The nth forward probability corresponding to the user. After introducing the forward probability, you can increase the scores of some paths during the pruning process to prevent them from being clipped. Then, after searching for each possible path, subtract the forward probability on the path to obtain the corresponding path The final score of the path is:

Finally, the path with the highest Score is selected as the decoding result.

When constructing the decoding network, determine the forward probability of the path from each phoneme node corresponding to each user ID to all word nodes that the phoneme node can reach in the decoding network according to the language model corresponding to the user ID. Specifically, the forward-looking probability corresponding to each user ID can be calculated by the following formula:

Among them, W(s) refers to the set of words corresponding to word nodes that can be reached from a phoneme node s in the decoding network, h is the corpus used to train the language model corresponding to the user ID, and p(w|h) is The appearance frequency score corresponding to the word w in the set W(s), and the appearance frequency score is used to characterize the frequency of appearance of the word w in the corpus corresponding to the user ID.

In this embodiment, the word node corresponding to the word in W(s) in the decoding network is called the target word node corresponding to the phoneme node s. As a possible implementation, determine the appearance frequency score corresponding to each word node in the following way: determine the frequency of the word node in the corpus corresponding to the corpus corresponding to the user ID in the decoding network; for the word node in the corpus For the word node corresponding to the corpus, the frequency of the word node is normalized to obtain the appearance frequency score corresponding to the word node.

In this embodiment, the value of the appearance frequency score corresponding to each word node is in the range of [0,1].

For example, taking the node "k" in Figure 3 as an example, for each user ID, the set of words corresponding to the target word node reachable with the node "k" as the starting point of the path is {card, truck, open, open door , Triumph, Section, Class}, based on the corpus corresponding to the user ID, count the frequency of each word in the set {card, truck, open, open door, Triumph, Section, class} in the corpus, for the set {card, truck The frequency of each word in, open, open the door, triumphant, ke, class} is normalized, and the appearance frequency scores p(card|h), p(truck|h), p(开|h) corresponding to each word are obtained ), p(开门|h), p(凯旋|h), p(科|h), p(课|h), take the largest occurrence frequency score among these occurrence frequency scores, as in the decoding network, The forward probability corresponding to the user ID on the path from node "k" to each word node in the set {card, truck, open, open door, triumph, section, lesson} is determined by the language model corresponding to the user ID The maximum value of the appearance frequency scores of all target word nodes corresponding to node "k" is used as the forward probability of all paths from node "k" to all target word nodes, so as to avoid cutting off nodes in the process of decoding using the decoding network The path with the lower acoustic score among the paths corresponding to "k".

Correspondingly, after determining that the language model needs to be updated, the model update method of the embodiment of the present application further includes the following steps: According to the frequency of each word node in the decoding network in the corpus corresponding to the user ID, obtain the user ID corresponding to each word node Appearance frequency score; for each phoneme node in the decoding network, select the maximum value of the appearance frequency score of the target word node corresponding to the user ID corresponding to the phoneme node, and determine the path corresponding to the user ID from the phoneme node to each target word node According to the latest lookahead probability, update the lookahead probability corresponding to the user ID of the path from the phoneme node to the target word node in the decoding network.

Further, according to the frequency of each word node in the decoding network in the corpus, the appearance frequency score corresponding to each word node is obtained, including: determining that the word node corresponding to the corpus in the corpus corresponding to the user ID in the decoding network is in the corpus Frequency of appearance: For the word node corresponding to the corpus in the corpus, the frequency of the word node is normalized to obtain the frequency score corresponding to the word node.

Similarly, when updating the forward probability corresponding to each user ID in the decoding network, there is no need to modify the nodes (including word nodes and phoneme nodes) in the decoding network. Once it is detected that the language model corresponding to a certain user ID needs to be updated, it is only necessary to recalculate the forward probability of the path from each phoneme node to the target word node in the decoding network according to the updated language model, and then the updated language The changes brought about by the model are introduced into the decoding network to prevent the paths with lower acoustic scores from being clipped during path pruning. The decoder uses the decoding network with the updated look-ahead probability to perform path search, and the correct results can be solved.

The voice recognition method of the embodiment of the present application can be used to recognize any language, such as Chinese, English, Japanese, German, etc. In the embodiments of the present application, the description is mainly based on the speech recognition of Chinese as an example, and the voice recognition methods of other languages are similar to this, and the embodiments of the present application will not be illustrated one by one.

As shown in FIG. 8, based on the same inventive concept as the above-mentioned speech recognition method, an embodiment of the present application further provides a speech recognition device 80, which includes an acquisition module 801, a decoding module 802, and a determination module 803.

The obtaining module 801 is used to obtain the input voice and the user ID corresponding to the input voice.

The decoding module 802 is configured to search for the optimal path corresponding to the input voice in the decoding network according to the user ID, and the path between the word nodes in the decoding network is marked with the user ID.

The determining module 803 is configured to determine text information corresponding to the input voice according to the optimal path.

Further, the decoding module 802 is specifically configured to determine the optimal path corresponding to the input voice according to the probability score corresponding to the user ID marked by the path between each word node in the decoding network.

Further, the decoding module 802 is specifically configured to: obtain the language model corresponding to the user ID according to the user ID; according to the language model corresponding to the user ID, search for the optimal path corresponding to the input voice in the decoding network.

Based on any of the above embodiments, the decoding network is constructed based on a full dictionary.

Further, the speech recognition device 80 of the embodiment of the present application further includes a model update module, which is used to: determine that the language model corresponding to the user ID needs to be updated; update the language model according to the corpus in the corpus corresponding to the user ID, and determine the decoding network The latest probability score corresponding to the path between each word node; according to the latest probability score, the probability score corresponding to the user ID marked by the path between the corresponding word nodes in the decoding network is updated.

Further, the model update module is specifically used to detect whether the corpus corresponding to the user ID is updated; if the corpus corresponding to the user ID is updated, it is determined that the language model corresponding to the user ID needs to be updated.

Further, the model update module is specifically configured to: calculate the first summary value of all corpora in the corpus corresponding to the user ID; compare the first summary value with the second summary value, and if they are not the same, confirm the corpus corresponding to the user ID There is an update, and the second summary value is the summary value of all corpora in the corpus corresponding to the user ID after the most recent update.

Based on any of the above embodiments, the model update module is also used to: obtain the appearance frequency score of each word node corresponding to the user ID according to the frequency of each word node in the decoding network in the corpus corresponding to the user ID; For each phoneme node, select the maximum value of the frequency scores of the user ID corresponding to the target word node corresponding to the phoneme node, and determine it as the latest look-ahead probability of the user ID corresponding to the path from the phoneme node to each target word node; according to the latest look-ahead probability, Update the look-ahead probability corresponding to the user ID of the path from the phoneme node to the target word node in the decoding network.

Further, the model update module is specifically used to determine the frequency of the word node in the corpus corresponding to the corpus corresponding to the user ID in the decoding network; for the word node corresponding to the corpus in the corpus, the frequency of the word node Perform normalization to obtain the frequency score corresponding to the word node.

The voice recognition device provided in the embodiment of the present application adopts the same inventive concept as the above-mentioned voice recognition method, and can achieve the same beneficial effects, which will not be repeated here.

Based on the same inventive concept as the above-mentioned voice recognition method, embodiments of the present application also provide an electronic device. The electronic device can be a controller of a smart device (such as a robot, a smart speaker, etc.), a desktop computer, or a portable Computers, smart phones, tablet computers, personal digital assistants (PDAs), servers, etc. As shown in FIG. 9, the electronic device 90 may include a processor 901, a memory 902, and a transceiver 903. The transceiver 903 is used to receive and send data under the control of the processor 901.

The memory 902 may include a read only memory (ROM) and a random access memory (RAM), and provides the processor with program instructions and data stored in the memory. In the embodiment of the present application, the memory may be used to store the program of the voice recognition method.

The processor 901 can be a CPU (central embedded device), ASIC (Application Specific Integrated Circuit, application-specific integrated circuit), FPGA (Field-Programmable Gate Array, field programmable gate array) or CPLD (Complex Programmable Logic Device), complex programmable The (logic device) processor implements the voice recognition method in any of the foregoing embodiments according to the obtained program instructions by calling the program instructions stored in the memory.

The embodiment of the present application provides a computer-readable storage medium for storing computer program instructions used for the above-mentioned electronic device, which includes a program for executing the above-mentioned voice recognition method.

The above-mentioned computer storage medium may be any available medium or data storage device that the computer can access, including but not limited to magnetic storage (such as floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.), optical storage (such as CD, DVD, BD) , HVD, etc.), and semiconductor memory (such as ROM, EPROM, EEPROM, non-volatile memory (NAND FLASH), solid state drive (SSD)), etc.

As mentioned above, the above embodiments are only used to introduce the technical solutions of the present application in detail, but the descriptions of the above embodiments are only used to help understand the methods of the embodiments of the present application, and should not be construed as limiting the embodiments of the present application. Any changes or substitutions that can be easily conceived by those skilled in the art should be covered by the protection scope of the embodiments of the present application.

Claims

A speech recognition method, characterized in that it comprises:

Acquiring an input voice and a user ID corresponding to the input voice;

Searching for the optimal path corresponding to the input voice in the decoding network according to the user ID, and the path between the word nodes in the decoding network is marked with the user ID;

The text information corresponding to the input voice is determined according to the optimal path.
The method according to claim 1, wherein the searching for the optimal path corresponding to the input voice in a decoding network according to the user ID comprises:

Determine the optimal path corresponding to the input voice according to the probability score corresponding to the user ID marked by the path between each word node in the decoding network.
The method according to claim 1, wherein the searching for the optimal path corresponding to the input voice in a decoding network according to the user ID comprises:

Obtaining the language model corresponding to the user ID according to the user ID;

According to the language model corresponding to the user ID, search for the optimal path corresponding to the input voice in the decoding network.
The method according to any one of claims 1 to 3, wherein the decoding network is constructed based on a full dictionary.
The method according to claim 4, wherein the language model corresponding to the user ID is updated in the following manner:

Determine that the language model corresponding to the user ID needs to be updated;

Update the language model according to the corpus in the corpus corresponding to the user ID, and determine the latest probability score corresponding to the path between the word nodes in the decoding network;

According to the latest probability score, the probability score corresponding to the user ID of the path mark between the corresponding word nodes in the decoding network is updated.
The method according to claim 5, wherein the determining that the language model corresponding to the user ID needs to be updated comprises:

Detecting whether the corpus corresponding to the user ID has been updated;

If the corpus corresponding to the user ID is updated, it is determined that the language model corresponding to the user ID needs to be updated.
The method according to claim 6, wherein the detecting whether the corpus corresponding to the user ID has been updated comprises:

Calculating the first summary value of all corpora in the corpus corresponding to the user ID;

Compare the first summary value with the second summary value, if they are not the same, confirm that the corpus corresponding to the user ID has been updated, and the second summary value is in the corpus corresponding to the user ID after the most recent update The summary value of all corpora.
The method according to any one of claims 5-7, wherein after determining that the language model corresponding to the user ID needs to be updated, the method further comprises:

According to the frequency of each word node in the decoding network in the corpus corresponding to the user ID, obtain the appearance frequency score of each word node corresponding to the user ID;

For each phoneme node in the decoding network, select the maximum value of the appearance frequency scores of the target word node corresponding to the phoneme node corresponding to the user ID, and determine it from the phoneme node to the target word node The path corresponding to the latest forward probability of the user ID;

According to the latest look-ahead probability, the look-ahead probability corresponding to the user ID of the path from the phoneme node to the target word node in the decoding network is updated.
The method according to claim 8, characterized in that, according to the frequency of each word node in the decoding network in the corpus corresponding to the user ID, obtaining the appearance frequency score corresponding to each word node comprises:

Determine the frequency of occurrence of word nodes in the corpus corresponding to the corpus corresponding to the user ID in the corpus in the decoding network;

For the word node corresponding to the corpus in the corpus, the frequency of the word node is normalized to obtain the appearance frequency score corresponding to the word node.
A speech recognition device is characterized in that it comprises:

An obtaining module, configured to obtain an input voice and a user ID corresponding to the input voice;

A decoding module, configured to search for an optimal path corresponding to the input voice in a decoding network according to the user ID, and the path between each word node in the decoding network is marked with a user ID;

The determining module is configured to determine the text information corresponding to the input voice according to the optimal path.
An electronic device, including a transceiver, a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the transceiver is used for receiving and sending under the control of the processor Data, when the processor executes the program, the steps of the method according to any one of claims 1 to 9 are realized.
A computer-readable storage medium having computer program instructions stored thereon, wherein the program instructions implement the steps of any one of claims 1 to 9 when executed by a processor.