CN114528812A

CN114528812A - Voice recognition method, system, computing device and storage medium

Info

Publication number: CN114528812A
Application number: CN202011219334.XA
Authority: CN
Inventors: 王凯; 李标; 刘杰
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2020-11-04
Filing date: 2020-11-04
Publication date: 2022-05-24

Abstract

The application provides a voice recognition method and a voice recognition system. The method comprises the steps of receiving uploaded self-defined hotwords and similarity configuration, converting the self-defined hotwords into a first phoneme sequence, expanding the first phoneme sequence according to an expansion rule obtained based on a clustering algorithm model to obtain an expanded first phoneme sequence, converting received audio data into a second phoneme sequence, and determining a voice recognition result of the audio data according to the similarity configuration and the similarity by calculating the similarity between the expanded first phoneme sequence and the second phoneme sequence. The speech recognition method expands the user-defined hot words based on the clustering algorithm model, and utilizes the uploaded user-defined hot words more effectively, so that the accuracy of the speech recognition result is improved.

Description

Voice recognition method, system, computing device and storage medium

Technical Field

The present application relates to the field of speech recognition, and in particular, to a speech recognition method and system.

Background

Speech recognition technology is a high technology that allows machines to convert speech signals into corresponding text or commands through a recognition and understanding process. The voice recognition technology mainly comprises three aspects of a feature extraction technology, a pattern matching criterion and a model training technology.

At present, hotwords are widely used in speech recognition technology. Especially in a Chinese environment, the situations of homophones and different characters are more, the local language habit is more complex, the problem can be solved to a certain extent by introducing the hot words, and the probability of the hot words being recognized in the voice recognition can be increased by uploading the user-defined hot words by a user, so that the accuracy of the voice recognition is improved.

Aiming at the user-defined hotwords uploaded by the user, how to match the phonemes more accurately is a problem which is important to be concerned by the industry.

Disclosure of Invention

The application provides a voice recognition method and a voice recognition system, which can improve recognition precision of hot words.

A first aspect of the present application provides a speech recognition method, which is applied to a speech recognition system, and includes: and after receiving the uploaded self-defined hot words and the similarity configuration, the voice recognition system converts the self-defined hot words into a self-defined phoneme sequence and takes the self-defined hot word phoneme sequence as a first phoneme sequence. And expanding the first phoneme sequence according to an expansion rule obtained based on a clustering algorithm to obtain an expanded first phoneme sequence. And after receiving the uploaded audio data, the voice recognition system converts the audio data into an audio data phoneme sequence and takes the audio data phoneme sequence as a second phoneme sequence. And then calculating the similarity between the expanded first phoneme sequence and the second phoneme sequence, and determining the voice recognition result of the audio data according to the relationship between the similarity and the similarity configuration.

The speech recognition method expands the self-defined hot words based on the clustering algorithm model, wherein the establishment of the clustering algorithm model comprehensively considers language features such as languages, dialects and the like, expands the matching phoneme sequence of words such as names of people, place names or proper nouns and the like, and effectively improves the accuracy of speech recognition results.

The voice recognition system can receive the user-defined hot words, the similarity configuration and the audio data through the same instruction, and can also receive the user-defined hot words, the similarity configuration and the audio data through different instructions.

In some possible designs, the clustering algorithm model includes a binary decision tree model.

In some possible designs, the method further comprises: a phone set is input to the binary decision tree model, the phone set including a plurality of phones. The phone set may be a set of phones built by a linguist based on language, dialect, and the like. Optionally, the phoneme set may also be a set of phonemes corresponding to audio data in the corpus to be customized. After dividing each phoneme in the phoneme set into leaf nodes of the binary decision tree model, determining the distance between each phoneme in the phoneme set. The distance between the individual phonemes indicates the distance within the binary decision tree of the leaf node into which the two phonemes are divided. The expansion rule is then obtained based on the distance between the phonemes in the set of phonemes. The extension rule includes a correspondence of each phoneme in the set of phonemes to an extension phoneme of each phoneme.

In some possible designs, the expansion rule may be obtained before receiving the custom hotword and similarity configuration, or after receiving the corpus that needs to be customized.

In some possible designs, the distance of the leaf node into which two phonemes are divided within the binary decision tree is calculated based on the number of branch nodes included in the path from the leaf node where one phoneme is located to the leaf node where the other phoneme is located. And comparing the distance with a distance threshold, and if the distance does not exceed the distance threshold, writing the second phoneme into the extension rule as the extension phoneme of the first phoneme.

In some possible designs, fitting the phonemes in the extended first phoneme sequence to a model; the phonemes in the second phoneme sequence are fitted to a model. And calculating the model distance between the model fitted by the phoneme in the expanded first phoneme sequence and the model fitted by the phoneme in the second phoneme sequence. And according to the model distance, obtaining the similarity between the phonemes in the expanded first phoneme sequence and the phonemes in the second phoneme sequence.

According to the method, the phoneme is fitted into the model, and the model distance is accurately calculated, so that the similarity between two phoneme sequences is obtained according to the model distance, and the accuracy of the speech recognition result is further improved.

A second aspect of the application provides a speech recognition system for performing the method provided by the first aspect. Specifically, the system comprises an interaction module, a processing module and a storage module. The interactive module is used for receiving the uploaded custom hotword, similarity configuration and audio data. The storage module is used for storing the self-defined hot word, the similarity configuration and the audio data. The processing module is used for expanding the first phoneme sequence according to an expansion rule obtained based on a clustering algorithm model to obtain an expanded first phoneme sequence. The audio data is converted into an audio data phoneme sequence, which is a second phoneme sequence. And calculating the similarity between the expanded first phoneme sequence and the second phoneme sequence. And determining a voice recognition result of the audio data according to the similarity configuration and the similarity. The interactive module is also used for returning the voice recognition result of the audio data.

The speech recognition system may be deployed on one or more servers, or the speech recognition system may be deployed partly on a server and partly on a terminal device.

In some possible designs, the clustering algorithm model includes a binary decision tree.

In some possible designs, the processing module is further configured to input the set of phonemes into the binary decision tree, the set of phonemes including a plurality of phonemes. Each phoneme in the set of phonemes is partitioned into leaf nodes of the binary decision tree. Determining distances of the phonemes in the set of phonemes, wherein the distances of the two phonemes in the set of phonemes indicate distances within the binary decision tree of leaf nodes into which the two phonemes are partitioned. And obtaining the extension rule according to the distance between the phonemes in the phoneme set, wherein the extension rule comprises the corresponding relation between the phonemes in the phoneme set and the extension phonemes of the phonemes.

In some possible designs, the processing module is further configured to obtain a distance between the first phoneme and the second phoneme according to a number of branch nodes included in a path from a leaf node where the first phoneme is located to a leaf node where the second phoneme is located. And if the distance between the first phoneme and the second phoneme does not exceed a distance threshold value, writing the second phoneme into the extension rule as the extension phoneme of the first phoneme.

A third aspect of the present application provides a cluster of computing devices comprising at least one computing device, each computing device comprising a processor and a memory. The processor of the at least one computing device is configured to execute the instructions stored in the memory to cause the cluster of computing devices to perform the method as described in the first aspect or any implementation of the first aspect.

A fourth aspect of the present application provides a computer-readable storage medium comprising computer program instructions which, when executed by a cluster of computing devices, perform a method as set forth in the first aspect or any implementation of the first aspect.

A fifth aspect of the present application provides a computer program product comprising instructions that, when run on a cluster of computer devices, cause the cluster of computer devices to perform the method according to the first aspect or any implementation manner of the first aspect.

Drawings

In order to more clearly illustrate the technical method of the embodiments of the present application, the drawings used in the embodiments will be briefly described below.

Fig. 1 is an architecture diagram of a speech recognition system according to an embodiment of the present application;

fig. 2A is a schematic view of an application scenario of a speech recognition system according to an embodiment of the present application;

fig. 2B is a schematic view of an application scenario of another speech recognition system according to an embodiment of the present application;

fig. 3 is a flowchart of a speech recognition method according to an embodiment of the present application;

fig. 4 is a schematic diagram of an interactive interface of a client according to an embodiment of the present application;

fig. 5 is a schematic diagram of an interactive interface of a client according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a binary decision tree model according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of an extended rule provided in an embodiment of the present application;

FIG. 8 is an architecture diagram of yet another speech recognition system provided by an embodiment of the present application;

FIG. 9 is an architecture diagram of yet another speech recognition system provided by an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computing device according to an embodiment of the present application.

Detailed Description

The terms "first" and "second" in the embodiments of the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.

Some technical terms referred to in the embodiments of the present application will be first described.

Phones (phones) are the smallest phonetic unit divided according to natural attributes of speech, and are analyzed according to pronunciation actions in syllables, and one action constitutes one phone. For example, the international phonetic alphabet in english has 48 phonemes, wherein 20 vowel phonemes and 28 consonant phonemes exist, and 130 phonemes exist in the standard chinese language.

Hot words (hot words) refer to words that are commonly used by users in linguistic expressions. For example, in the financial field, the hotword may be terms of interest rate, offshore finance, position, and the like. For an individual user, the hotword may be a word such as the name of family or friend and a frequently-visited place name. The hotword list may contain one or more collections of hotwords, such as one or more of the names of people, places, or proper nouns.

With the development of mobile communication and artificial intelligence, voice interaction technology gradually deepens people's lives, and voice recognition becomes an important part of ensuring reliable service and improving user experience as basic capability. Although recognition systems built using deep learning acoustic models have performed better than humans on some published data sets, the effectiveness of general-purpose domain acoustic models in recognizing domain-specific speech is less than optimal. The method for using the acoustic model in the field of the corpus training in the specific field is a solution, but the method has obvious defects, the online period of the field model training is long, and a large amount of manpower and material resources are consumed for collecting and labeling the corpus data in the field. Therefore, the user-defined hot word recognition scheme also becomes one of the research focuses, supports the hot words uploaded by the user, and modifies the recognition result according to the word information of the user in the recognition process or later so as to improve the recognition accuracy.

More specifically, on one hand, the phoneme conversion and labeling of the custom hotword is usually implemented using a query pronunciation dictionary or G2P (graph-to-phoneme) model. On the other hand, after the steps of encoding and decoding the data inputted by the voice of the terminal device, a phoneme sequence set containing an N-best path such as a decoding word lattice is generated. And matching the phoneme sequence corresponding to the user-defined hot word with the phoneme sequence obtained by converting the audio recognition result of the terminal equipment, and modifying the decoded word lattice or directly modifying the recognition result. The matching modes commonly used at present include complete matching, rule matching and edit distance matching. And the complete matching is to completely search the user-defined hotword phoneme sequence and the phoneme sequence converted by the recognition result according to characters, wherein the matching is determined if the user-defined hotword phoneme sequence exists, and otherwise, the user-defined hotword phoneme sequence and the phoneme sequence are not matched. The rule matching is to expand part of the initial and final phonemes after the user-defined hotword conversion through a regular expression or a logic rule to match more similar or similar sounds. For example, the regular pronunciation bias caused by dialects is enumerated. The editing distance matching is a method for measuring the similarity degree more accurately, and comprises the steps of regarding a phoneme sequence after hot word conversion to be matched and a phoneme sequence corresponding to a recognition word result as a text sequence, calculating the difference between the phoneme sequence and the text sequence, counting the number of characters deleted, replaced and added, and determining whether the two are matched or not through configuration. It is worth noting that the phoneme sequence itself is a text signal, not a sound signal that is characterized behind it. For example: the phoneme sequences between s i1 (thought) and c i1 (this) and p i1 (cape) are different by one character, so that s i1 and c i1 are found to be similar in pronunciation, but s i1 and p i1 are different in pronunciation. In addition, the user-defined hot word weight in the decoding word lattice is fixed, so that the overall flexibility is poor, and false excitation is easy to generate. Aiming at the user-defined hotwords uploaded by the user, how to match the phonemes more accurately is a problem which is important to be concerned by the industry.

In view of this, the present application provides a speech recognition method capable of implementing more accurate phoneme matching. The method may be performed by a speech recognition system. Specifically, for the user-defined hot words uploaded by the user, the user-defined hot words are expanded based on the phoneme-bound binary decision tree. And measuring the similarity between the phoneme sequence after the expanded user-defined hotword conversion and the phoneme sequence obtained by speech recognition decoding, adding a decoding path with dynamic weight according to the size relationship between the similarity and the configuration, and then returning a speech recognition result to the terminal equipment.

On one hand, the method can autonomously find the phonemes which are similar to the phoneme sequence of the self-defined hotword, improves the hit rate of the similar phonemes, and solves the phoneme reasoning error caused by dialect pronunciation deviation or voice quality. On the other hand, the method can accurately calculate the similarity between the two groups of phoneme sequences, and realize the dynamic modification of the decoding path cost by using the configuration control matching result, thereby further improving the accuracy of the voice recognition.

In order to make the technical solution of the present application clearer and easier to understand, a system architecture of a speech recognition method provided in the embodiments of the present application is described below with reference to the drawings.

FIG. 1 illustrates an exemplary architecture diagram of a speech recognition system. As shown in fig. 1, a communication path is established between the speech recognition system 1000 and each of the service device cluster 1100 and the terminal device cluster 1200. The speech recognition system 1000 is configured to perform speech recognition on audio data input by the terminal device according to data such as a custom hotword uploaded by the service device or the terminal device.

The service device cluster 1100 may contain one or more service devices. Wherein the plurality of service devices may be different parts of the same class of service devices, e.g. different interfaces in the same application software. Or may be a different class of service devices, such as different application software. Each service device may upload a custom hotword list to the speech recognition system 1000 separately. Because the domains in which the service devices are located may be different, there may be a difference between the hot vocabulary uploaded by the service devices. The terminal device cluster 1200 may also contain one or more terminal devices. The terminal device may be a multimedia device, such as a mobile phone, a computer, or a smart speaker.

As shown in fig. 1, the speech recognition system 1000 is connected to a terminal device cluster 1200, which includes one or more terminal devices, and the one or more terminal devices may be classified into M classes. In some implementations, a service device may correspond to one or more types of terminal devices. For example, in fig. 1, N terminal devices from terminal device 1200_1_1 to terminal device 1200_1_ N correspond to service device 1100_1, N terminal devices from terminal device 1200_2_1 to terminal device 1200_2_ N correspond to service device 1100_2, and so on. The number of terminal devices corresponding to each service device or a class of service devices may be the same or different. Each terminal device may correspond to one or more service devices, and thus different classes of terminal devices may correspond to the same service device.

The speech recognition system 1000 may accept that a plurality of service devices upload a user-defined hot word list with differences, and the user-defined hot word list uploaded by each service device is independent, that is, the hot word list uploaded by each service device is only used for decoding and phoneme conversion of audio data uploaded by a terminal device corresponding to the service device. Specifically, the speech recognition system 1000 decodes the encoded speech sequence uploaded by the terminal device using a generic speech recognition model. And modifying a decoding path and outputting a final recognition result by solving the similarity between the self-defined hot words and the phoneme sequence corresponding to the speech recognition result. For example, the customized hotword list of the service device 1100_1 may be provided to N terminal devices 1200_1_1 to 1200_1_ N, the customized hotword list of the service device 1100_2 may be provided to N terminal devices 1200_2_1 to 1200_2_ N, and so on.

In some implementations, the service device cluster 1100 includes multiple service devices that share a custom hotword vocabulary. The customized hotword list may affect the decoding and phoneme conversion processes of the audio data uploaded by each terminal device in the terminal device cluster 1200. Further, under the framework, the corresponding voice data features can be extracted according to the domain or dialect training corpus which needs to be customized and is provided by the service equipment, so that the customization of the voice recognition system 1000 is realized. In particular, the customization of the speech recognition system 1000 includes the customization of the phoneme extension model and the feature distribution model. The phoneme extension model refers to clustering phonemes using a binary decision tree. The splitting factor, i.e. the voice data feature, is extracted according to the corpus provided by the service device. The feature distribution model refers to fitting a feature distribution corresponding to each phoneme using a Gaussian Mixture Model (GMM) or a deep neural network model. The parameters of the feature distribution model can be updated according to the training corpus provided by the service equipment.

In some implementations, the serving device and the terminal device may be the same device. Specifically, this class of devices both sends custom hotwords to the speech recognition system 1000 and sends audio data to the system 1000 and receives speech recognition results.

Next, an application scenario of the speech recognition system 1000 in the present application will be described.

The modules in the speech recognition system 1000 may be deployed on the same device or on different devices.

In some implementations, the modules in the speech recognition system 1000 are all deployed on a server. As shown in fig. 2A, the speech recognition system 1000 may be deployed on one or more local servers, or may be deployed on one or more cloud servers in the cloud.

In some implementations, the speech recognition system 1000 includes a sub-speech recognition system 1000_ a running on a server and a sub-speech recognition system 1000_ B running on a terminal device. As shown in fig. 2B, the sub voice recognition system 1000_ a in the voice recognition system 1000 is deployed on the server side, and the sub voice recognition system 1000_ B is deployed on the terminal device side. The server may be one or more cloud servers, or may be one or more local servers.

Next, from the perspective of the speech recognition system 1000, the speech recognition method provided in the embodiment of the present application will be described in detail.

Referring to fig. 3, a flow chart of a speech recognition method is shown, the method comprising:

s2002: the speech recognition system 1000 receives information uploaded by the device.

In some implementations, as in the application scenario shown in fig. 2A, the speech recognition system 1000 (specifically, the interaction module 1002) may collect information through a client on the service device side. Specifically, the information uploaded by the service device side includes similarity configuration and a custom hot word list. Optionally, the information uploaded by the service device side may further include a corpus to be customized. The training corpus to be customized comprises audio data and corresponding labeled text information.

Next, how the service device uploads information to the speech recognition system 1000 will be described. For example, FIG. 4 provides a schematic diagram of an interactive interface of a service device side client, which runs on the service device side. As shown in fig. 4, there are controls for setup and upload in the interactive interface 3000. Specifically, interactive interface 3000 includes one or more of the following controls: file selection control 3002, upload control 3004, similarity setting control 3008, and determination control 3010. Selecting the file control 3002 can allow the user at the service device side to provide a custom hotword file for uploading. Further, the custom hotword files need to meet the corresponding quantity and format requirements during uploading, including the text typesetting format and the file encoding mode of the files, so as to facilitate subsequent steps. When needed, the user can also upload the domain or dialect corpus to be customized by selecting the file control 3002. After selecting a file to be uploaded, a specific file path or file name is displayed in a blank box above the select file control 3002. After the user at the service device side confirms that the selected file is error-free, the initial uploading of the file is completed by triggering the uploading control 3004.

The service device side user can set the similarity configuration through the similarity setting control 3008. The similarity configuration may include a variety of similarity levels, and/or custom similarity parameters. Different similarity levels represent different similarity parameters, and a user can input a customized similarity parameter. Generally speaking, the higher the similarity parameter, the higher the matching accuracy requirement of the user is represented. For example, the similarity setting control 3008 provides 3 similarity levels (high, medium, and low), and the similarity parameters corresponding to the 3 similarity levels are sequentially decreased. The user may control how strong or weak the hotword weighted stimulus is by providing the user with a similarity configuration through the similarity settings control 3008. The user can set reasonable similarity configuration according to the field and the use scene, and the problem that the hot word matching is too loose or too strict is avoided, so that the accuracy of voice recognition is further improved. The user at the service device side can determine the transmission path of the file or data by clicking the determination control 3010, and trigger the uploading of the information.

The file selection, upload, and similarity configuration selection and upload operations in the client described above are performed by the speech recognition system 1000 in fig. 2A. The files and similarity configurations uploaded by the user at the service device side are stored in the speech recognition system 1000.

It should be noted that, at the service device side, the uploading operations of the three types of information, i.e., the similarity configuration, the customized hotword vocabulary, and the training corpus to be customized, may be controlled by the same instruction or different instructions.

In some implementations, as in the application scenario shown in fig. 2B, the speech recognition system 1000 (specifically, the interaction module 1002) may collect information through clients on the service device side and the terminal device side, respectively. Specifically, the information uploaded by the service device side includes a similarity configuration. Optionally, the information uploaded by the service device side may further include one or more of a custom hotword vocabulary and a corpus to be customized. The specific operation of the service device side user for uploading information is similar to the operation in the application scenario as shown in fig. 2A. The information uploaded by the terminal equipment side comprises a private self-defined hot word list. The private customized hotword list may be a customized hotword list that the user selects to upload to the sub-speech recognition system 1000_ B on the terminal device side for reasons of privacy protection or personal will, but not to upload to other devices. Specifically, after signing a relevant agreement with a service provider on the service device side or a service provider on the voice recognition system side, a user on the terminal device side may choose not to open the reading authority of information such as a common address in an address list or a map application of the user to other devices. For example, information such as a person name in the address book and a place name in a common address can be used as a private self-defined hotword list. It should be noted that the other devices include servers other than the terminal device, and also include other terminal devices. Therefore, the private customized hotword list of any one user is not shared with other terminal devices in the terminal device cluster 1200.

The sub-speech recognition system 1000_ B on the terminal device side may read or call the private custom hotword vocabulary uploaded by the terminal-side user. Considering that the private customized hotword vocabulary cannot be uploaded to other devices, a part of modules in the speech recognition system 1000 need to be deployed on the terminal device side. It should be noted that storing the user-defined hotword vocabulary in the terminal device and performing the related calculation may affect the allocation of the memory and the operating memory of the terminal device, and further affect the operating speed of the terminal device.

Next, how the terminal device uploads information to the voice recognition system 1000 will be described. For example, fig. 5 provides a schematic view of an interactive interface of a terminal device side client, which runs on the terminal device side. As shown in fig. 5, there are controls for setting and uploading in the interactive interface 4000. Specifically, the interactive interface 4000 includes at least a permission opening control 4004 and a determination control 4006.

The user can select the file range of the open authority through the authority open control 4004. In particular, the file of the open right may include a file in application software. For example, after the user clicks the option of the address book in the permission opening control 4004 and clicks the determination control 4006 to complete uploading. The sub-speech recognition system 1000_ B can read or call the private self-defined hotword list, i.e., the name list in the address book.

The above-mentioned permission opening and uploading operation can be performed by the sub-speech recognition system 1000_ B in fig. 2B, and the private customized hotword list provided by the user is stored in the sub-speech recognition system 1000_ B.

S2004: the speech recognition system 1000 builds expansion rules from the clustering algorithm model. And inputting the phoneme set into the clustering algorithm model to obtain the trained clustering algorithm model. The phone set may be a set of phones built by a linguist based on language, dialect, and the like. Optionally, the phoneme set may also be a set of phonemes corresponding to the audio data in the corpus to be customized. In the embodiment of the present application, the clustering algorithm model may be a binary decision tree or a mean shift clustering model.

Next, an example of using a clustering algorithm model as a binary decision tree model will be described to introduce a method for establishing an extended rule.

FIG. 6 illustrates a binary decision tree model. The nodes in the binary decision tree model are divided into two classes, one class is leaf nodes with zero out-degree, such as

nodes

5010 and 5012. One class is branch nodes with out degrees different from zero, such as

nodes

5000 and 5002. Where node 5000 is a particular one of the branch nodes, as there is no parent node, also referred to as a root node. And inputting the phoneme set into a node 5000 in the binary decision tree model, and dividing each phoneme in the phoneme set into a leaf node according to the binary decision tree model. Wherein each branch node is provided with a division condition. And each phoneme entering the father node enters the corresponding child node according to the division result. After the phoneme set is clustered by using the binary decision tree model, the binary decision tree model with a certain number of leaf nodes and a certain depth can be obtained. In the embodiment of the present application, each leaf node contains at least one phoneme.

After a phoneme set is input into a binary decision tree model and a binary decision tree model containing one or more phonemes in leaf nodes is obtained, the distance between every two phonemes in the binary decision tree model is determined. The distance may be the distance within the binary decision tree of the leaf node into which the two phonemes are divided. Specifically, the number of branch nodes included in the shortest path from the leaf node where one phoneme is located to the leaf node where the other phoneme is located. For example, the distance between the phonemes on the two leaf nodes (5010 and 5012) respectively under the same parent node (5006) is 1. Where distance 1 indicates that a branch node is included in the shortest path between one phoneme at node 5010 and another phoneme at node 5012 (5006). In the embodiment of the present application, according to a preset distance threshold, if the distance between two phonemes does not exceed the distance threshold, it is considered that the two phonemes can be mutually expanded. For each phoneme, traversing the phonemes on all the leaf nodes within a distance threshold range preset by a user, and pairing each phoneme with the phonemes within the distance threshold range to obtain an expansion rule. The expansion rule includes a set of other phonemes to which each phoneme corresponds and which can be expanded. The distance within the binary decision tree for each phoneme and its extended phoneme does not exceed a distance threshold. In some implementations, in the binary decision tree model, the partition condition on the branch node may be a partition condition summarized by a linguist, or may be a voice feature such as mel-scale frequency cepstral coefficients (MFCCs) as the partition condition.

It should be noted that, in this kind of implementation, the building of the clustering algorithm model and the extension rule may occur before the information is uploaded in step S2002. Specifically, the information includes similarity configuration and a custom hotword vocabulary. Optionally, the information may further include a corpus to be customized. The training corpus to be customized comprises audio data and corresponding labeled text information. In other words, in this type of implementation, step S2004 does not necessarily have to be performed after step S2002.

In some implementations, the serving device may upload the corpus that needs to be customized. And according to the training corpora, obtaining a customized clustering algorithm model by using an artificial intelligence model. The artificial intelligence model can be a model constructed based on artificial intelligence algorithms such as neural networks or machine learning. For example, a binary decision tree model is taken as an example. The audio data in the training corpus is used as input quantity, the corresponding labeled text information is used as output quantity, and the customized binary decision tree model aiming at the training corpus can be obtained. The dividing conditions of each branch node are generated by an artificial intelligence model.

In this type of implementation, the construction and training of the customized binary decision tree model described above is performed by the speech recognition system 1000 in FIG. 2A or the sub-speech recognition system 1000_ A in FIG. 2B. In this type of implementation, the construction of the clustering algorithm model and the extension rules should occur after the user information is uploaded. That is, step S2004 needs to be executed after step S2002.

After the two implementation modes construct the binary decision tree model, the expansion rule can be obtained according to the set distance threshold. The distance threshold value can be set by a user at the service device side. For example, as shown in fig. 4, the service device side user can set a distance threshold via distance threshold control 3006. Optionally, the service-side user may also use a default distance threshold without any action, i.e., the distance threshold initialized in the text box of the distance threshold control 3006. The expansion rule may be determined by comparing the distance between two phonemes in the set of phonemes to a distance threshold. Specifically, when the distance between the two phonemes does not exceed the distance threshold, it is considered that the two phonemes can be expanded from each other.

Fig. 7 gives an example of an extended rule. As shown in fig. 7, the phoneme set Z is input into a binary decision tree model. The phone set includes a plurality of phones, such as phone a and phone A, B, C, D, E. Taking phoneme a as an example, an extended phoneme of phoneme a can be obtained according to the binary decision tree partitioned in fig. 6. Where leaf node 5010 in fig. 6 includes phonemes a and B, node 5012 includes phoneme C, node 5014 includes phonemes D and E, and node 5016 includes phonemes F and G. In the embodiment of the present application, the distance threshold is set to 3. Thus, the extended phoneme of phoneme a includes phonemes B, C, D, E, F and G. Phoneme a may be mutually expanded with any of its expanded phonemes. For example, the expanded phoneme of phoneme B includes phoneme a, the expanded phoneme of phoneme C includes phoneme a, and so on. The correspondence between the phoneme a and its expanded phonemes B, C, D, E, F and G is an expansion rule and is also part of the expansion rule of the phoneme set Z.

The extended rules will be stored in the speech recognition system 1000 in fig. 2A or the sub-speech recognition system 1000_ a in fig. 2B. S2006: the speech recognition system 1000 expands the custom hotword phoneme sequence according to the expansion rules. Specifically, after the conversion of the custom hotword into the phoneme sequence is completed by querying a pronunciation dictionary or using the G2P model, the expansion of the custom hotword phoneme sequence can be realized according to the expansion rule determined in S2004.

First, the custom hotword vocabulary may be converted to a custom hotword phoneme sequence by consulting a pronunciation dictionary or using the G2P model. In consideration of the influence of successive reading of the front and back phonemes on the adjacent phonemes, the position information of the phonemes needs to be labeled. Specifically, the user-defined hotword needs to be phonetically converted by inquiring the pronunciation dictionary. If the corresponding phoneme cannot be inquired, the phoneme conversion is carried out by using the G2P model, and the labeling of the position information of the phoneme is completed. Then, according to the expansion rule obtained in S2004, the phoneme sequence obtained by converting the custom hotword is expanded in units of phonemes to obtain an expanded custom hotword phoneme sequence. Specifically, an extended phoneme corresponding to one phoneme is found in the extension rule obtained in step S2004. And the phoneme in the expanded phoneme is utilized to obtain a new phoneme sequence in a replacement mode so as to achieve the purpose of expanding the self-defined hotword phoneme. Wherein the sequence of phonemes may comprise one or more phonemes. Taking the extended rule shown in fig. 7 as an example, the phoneme sequence "xxxAxx" may be extended to "xxxBxx" and "xxxCxx", and so on. Therefore, the expanded phoneme sequence will at least contain 7 different phoneme sequences such as "xxxAxx", "xxxBxx" and "xxxCxx".

In some implementations, as shown in FIG. 2A, custom hotwords uploaded by a service device are stored in the speech recognition system 1000. The above-described conversion from the custom hotword to the phoneme sequence and the expansion operation of the custom hotword phoneme sequence are both performed by the speech recognition system 1000, and the obtained expanded custom hotword phoneme sequence is stored in the speech recognition system 1000.

In some implementations, as shown in fig. 2B, the private custom hotword uploaded by the terminal device to the sub-speech recognition system 1000_ B is stored in the sub-speech recognition system 1000_ B. The conversion of the custom hotword into a phoneme sequence and the expansion operation of the custom hotword phoneme sequence are both performed by the sub-speech recognition system 1000_ B, and the obtained expanded private custom hotword phoneme sequence is stored in the sub-speech recognition system 1000_ B. In this type of implementation, the service device may also upload custom hotwords. The custom hotwords uploaded by the service device are stored in the sub-speech recognition system 1000_ a. The conversion from the self-defined hotword to the phoneme sequence and the expansion operation of the self-defined hotword phoneme sequence are both executed by the processing module of the sub-speech recognition system 1000_ a, and the obtained expanded self-defined hotword phoneme sequence is stored in the sub-speech recognition system 1000_ a.

S2008: the voice recognition system 1000 receives audio data uploaded by the terminal device.

As shown in fig. 5, there are controls for uploading in the interactive interface 4000. Specifically, interactive interface 4000 includes at least audio input control 4002 and determination control 4006.

The terminal device side user can input audio data through the audio input control 4002. Specifically, the user can click the start recording button and the end recording button in the audio input control 4002 to complete the input of a piece of audio data. By clicking on the determination control 4006, the uploading of audio data can be completed. The encoded audio data is then decoded by calling the identification interface. The obtained word lattice containing the decoding result contains N-best paths, and phoneme sequences corresponding to the decoding paths are obtained and recorded.

In some implementations, as shown in fig. 2A, the audio data uploaded by the terminal device side client is converted from the audio data to a phoneme sequence after entering the speech recognition system 1000, and the phoneme sequence is stored in the speech recognition system 1000.

In some implementations, as shown in fig. 2B, the audio data uploaded by the terminal device side client is converted from the audio data to a phoneme sequence after entering the sub-speech recognition system 1000_ B, and the phoneme sequence is stored in the sub-speech recognition system 1000_ B.

It should be noted that step S2006 is not necessarily executed before step S2008. Specifically, the operation of expanding the custom hotword phoneme sequence according to the expansion rule may be performed prior to the operation of converting the audio data into a phoneme sequence. Optionally, the operation of expanding the custom hotword phoneme sequence according to the expansion rule may be performed simultaneously with the operation of converting the audio data into a phoneme sequence.

S2010: the speech recognition system 1000 calculates the similarity between phonemes. Specifically, the similarity between the phoneme sequence corresponding to the audio data and the phoneme sequence corresponding to the expanded self-defined hotword is calculated.

In some implementations, the phoneme sequence corresponding to the audio data and the phoneme sequence corresponding to the expanded custom hotword may be regarded as text sequences, and the difference between the two sequences may be calculated. Specifically, comparing the phoneme sequence corresponding to the audio data with the phoneme sequence corresponding to the expanded self-defined hotword, and counting the number of the deleted, replaced and added characters. According to the number of the character changes, based on a certain calculation rule, the similarity between the two phoneme sequences can be obtained.

In some implementations, models can be built in units of phonemes, fitting each phoneme to one model. In particular, the model may be a GMM feature distribution model or a deep neural network model. And then calculating the similarity between the model of the phoneme sequence fit corresponding to the audio data and the model of the phoneme sequence fit corresponding to the expanded self-defined hotword. First, a model is built for each phoneme of the phoneme sequence corresponding to the audio data in step S2008 and the expanded custom hotword phoneme sequence obtained in step S2006. Specifically, in the case of the GMM feature distribution model, each phoneme is mapped to a feature distribution curve by the GMM feature distribution model. In the case of a deep neural network model, each phoneme is mapped to a matrix by the deep neural network model. Deep neural networks are a technology in the field of Machine Learning (ML), and a model can be trained by setting network parameters and input and output quantities. In the embodiment of the application, a deep neural network model with phonemes as input and matrixes as output can be trained.

Secondly, under the condition of adopting a GMM feature distribution model, after fitting the phoneme sequence corresponding to the audio data and the feature distribution model corresponding to the phoneme sequence corresponding to the self-defined hotword, carrying out pairing calculation on the expanded self-defined hotword phoneme and the phoneme sequence obtained by converting the voice recognition result in a sliding matching mode. And taking the length of the expanded self-defined hotword phoneme as the length of a sliding window and taking the length of one phoneme as the length of one sliding. The calculation within each sliding window from the first phoneme to the last phoneme of the sequence of phonemes to which the audio data corresponds is to calculate the relative entropy (KL divergence) between the corresponding phonemes, which represents a measure of the asymmetry of the difference between the two probability distributions. Further, the corresponding phoneme means that a first phoneme in the expanded self-defined hotword phoneme corresponds to a first phoneme in the sliding window, a second phoneme in the expanded self-defined hotword phoneme corresponds to a second phoneme in the sliding window, and so on. And after the KL divergence between the feature distribution functions GMM corresponding to the phonemes is obtained, calculating the average value of the KL divergence of each pair of phonemes in the current window, and taking the average value as the similarity between the expanded self-defined hotword phoneme sequence and the phoneme sequence corresponding to the audio data in the current window.

In the case of a deep neural network model, the similarity between two phonemes can be measured by calculating the hyperspace distance between matrices corresponding to different phonemes. The specific calculation method is also performed according to a sliding matching method, the length of the expanded self-defined hotword phoneme is taken as the sliding window length, and the length of one phoneme is taken as the length of one sliding. The computation within each sliding window is to compute the hyperspace distance between the corresponding phonemes, from the first phoneme to the last phoneme of the sequence of phonemes to which the audio data corresponds. And calculating the average value of the hyperspace distances of all element pairs in the current window, and taking the average value as the similarity between the expanded self-defined hotword phoneme sequence and the phoneme sequence corresponding to the audio data in the current window.

Next, the action performers of the above two implementations will be described.

In some implementations, as shown in FIG. 2A, the speech recognition system 1000 is used to perform the operations described above for converting a sequence of phonemes into a sequence of text or for fitting each phoneme to a model. The above calculation of the similarity between the expanded custom hotword phoneme sequence and the phoneme sequence corresponding to the audio data is also performed by the speech recognition system 1000. The obtained text sequence or phoneme model converted from the phoneme sequence is stored in the speech recognition system 1000, and the similarity calculation result is also stored in the speech recognition system 1000.

In some implementations, as shown in FIG. 2B, the sub-speech recognition system 1000_ B is used to perform the operations described above for converting a sequence of phonemes into a sequence of text or for fitting each phoneme to a model. The above calculation of the similarity between the expanded custom hotword phoneme sequence and the phoneme sequence corresponding to the audio data is also performed by the sub-speech recognition system 1000_ B. The obtained text sequences or phoneme models converted from the phoneme sequences are stored in the sub-speech recognition system 1000_ B. The similarity calculation result will also be stored in the sub speech recognition system 1000_ B. Wherein the expanded custom hotword phoneme sequence comprises an expanded private custom hotword phoneme sequence stored in the sub-speech recognition system 1000_ B. Optionally, an expanded custom hotword phoneme sequence stored in the sub-speech recognition system 1000_ a may also be included. S2012: the speech recognition system 1000 matches the custom hotword according to the similarity, determines a speech recognition result, and returns the result.

And updating the decoded word lattice by comparing the similarity parameter corresponding to the similarity configuration set by the user at the service equipment side in the step S2002 and the similarity calculated in the step S2010.

Specifically, if the similarity meets the requirement of similarity configuration, the corresponding self-defined hotword is added to the decoded word lattice as a new path. And if the similarity does not meet the requirement of similarity configuration, the existing decoding word lattice is not modified.

Furthermore, the modification mode is to add the self-defined hotword as a new path, and the value of the language model is the original value of the language model multiplied by the weight. Wherein the weight is equal to the similarity. Finally, one or more decoding paths are determined based on the speech recognition results required by the speech recognition system 1000. And performing decoding operation on the decoded word lattice on the path, and returning the voice recognition result to the terminal device through the interaction module 1002 in fig. 2A or the interaction module 1002_ B in fig. 2B. When the voice recognition result is returned, the voice recognition result can be returned only in a coded form, so that the voice recognition result can be conveniently participated in the data processing of the next step. The speech recognition results may also be presented in a display control 4008 as in fig. 5 in a displayed manner.

Next, two architectures of the speech recognition system 1000 in the present application will be described.

Fig. 8 shows an architecture of a speech recognition system 1000. Wherein, the modules in the speech recognition system 1000 are all deployed on a server. In particular, the speech recognition system 1000 includes an interaction module 1002, a storage module 1004, and a processing module 1006.

The interaction module 1002 is configured to receive the similarity configuration and the custom hotword vocabulary uploaded by the user in S2002. Optionally, the interaction module 1002 is further configured to receive the corpus to be customized uploaded in S2002. The service device cluster 1100 may establish communication with the interaction module 1002 through an Application Programming Interface (API) or a client as shown in fig. 4.

The interaction module 1002 is further configured to receive the audio data uploaded in S2008. The interaction module 1002 is further configured to return the speech recognition result in S2012 by one or more of the following: the speech recognition results are returned in encoded form and displayed.

The storage module 1004 is configured to store information received by the interaction module 1002, where the information includes similarity configuration, a custom hot vocabulary, and audio data. Optionally, the method further comprises a corpus to be customized. And also for storing the extended rules obtained in S2004. The storage module 1004 is further configured to store the expanded custom hotword phoneme sequence obtained in S2006. In S2008, the phoneme sequence corresponding to the audio data is stored in the storage module 1004. In S2010, the obtained text sequence or phoneme model converted from the phoneme sequence is stored in the storage module 1004.

In some implementations, the storage module 1004 may contain multiple sub-storage modules. Specifically, when one service device may correspond to one or more types of terminal devices, the service device and the corresponding terminal device cluster share one sub-storage module. For example, in the example of fig. 1, the service device 1 and its corresponding terminal devices 1_1 to 1_ N may share one sub storage module, and so on. Under the framework, each service device can correspond to multiple types of terminal devices, and the sub-storage modules among different types of terminal devices can be multiplexed.

The processing module 1006 is used for executing the building and training of the clustering algorithm model in S2004 and the obtaining operation of the extended rule. The processing module 1006 is further configured to perform an operation of phoneme expansion of the custom hotword according to the expansion rule in S2006. In S2008, the processing module 1006 is configured to convert the audio data into a phoneme sequence. The processing module 1006 is used for performing the operation of converting the phoneme sequence into a text sequence or fitting each phoneme into a model as described in S2010. The calculation of the similarity between the expanded self-defined hotword phoneme sequence and the phoneme sequence corresponding to the audio data in S2010 is also performed by the processing module 1006. The processing module 1006 is further configured to update the decoded word lattice according to the similarity configuration and determine a speech recognition result in S2012.

Fig. 9 shows yet another architecture of a speech recognition system 1000. The modules in the speech recognition system 1000 are partially deployed on a server and partially deployed on a terminal device. Specifically, the speech recognition system 1000 includes a sub-speech recognition system 1000_ a and a sub-speech recognition system 1000_ B. The sub-speech recognition system 1000_ a includes an interaction module 1002_ a, a storage module 1004_ a, and a processing module 1006_ a. The sub-speech recognition system 1000_ B comprises an interaction module 1002_ B, a storage module 1004_ B and a processing module 1006_ B.

The interactive module 1002 in fig. 8 may include an interactive module 1002_ a and an interactive module 1002_ B. The storage module 1004 may include storage modules 1004_ a and 1004_ B. Processing module 1006 may include processing module 1006_ a and processing module 1006_ B.

The interaction module 1002_ a is configured to receive the similarity configuration uploaded by the user in S2002. Optionally, the interaction module 1002_ a is further configured to receive a corpus to be customized and a custom hotword list uploaded by the user in S2002.

The interaction module 1002_ B is configured to receive the private customized hotword list uploaded by the user in S2002. The interactive module 1002_ B is further configured to receive the audio data uploaded in S2008. The interaction module 1002_ B is further configured to return the speech recognition result in S2012 by one or more of the following: the speech recognition results are returned in encoded form and displayed.

The storage module 1004_ a is used for storing the similarity configuration received by the interaction module 1004_ a. Optionally, the method further comprises a training corpus required to be customized and a self-defined hotword vocabulary. And also for storing the extended rules obtained in S2004. The storage module 1004_ a is further configured to store the expanded custom hotword phoneme sequence obtained in S2006. In S2010, the obtained text sequence or phoneme model converted from the expanded custom hotword phoneme sequence is stored in the storage module 1004_ a.

The storage module 1004_ B is configured to store the private custom hot word list and the audio data received by the interaction module 1004_ B, and further configured to store a private custom hot word phoneme sequence and an audio data phoneme sequence. And also for storing the expanded private custom hotword phoneme sequence obtained in S2006. In S2008, the phoneme sequence corresponding to the audio data is stored in the storage module 1004_ B. In S2010, the obtained text sequence or phoneme model converted from the private custom hotword phoneme sequence and the audio data is stored in the storage module 1004_ B.

The processing module 1006_ a is used for performing the building and training of the clustering algorithm model in S2004 and the obtaining operation of the extended rule. The processing module 1006_ a is further configured to perform a phoneme expansion operation of the custom hotword according to the expansion rule in S2006. The processing module 1006_ a is further configured to perform the operation of converting the expanded custom hotword phoneme sequence into a text sequence or fitting each phoneme into a model as described in S2010.

The processing module 1006_ B is configured to convert the audio data into a phoneme sequence in S2008. The processing module 1006_ a is further configured to perform the operation of converting the private custom hotword phoneme sequence and the audio data into a text sequence or fitting each phoneme into a model as described in S2010. The calculation of the similarity between the expanded custom hotword phoneme sequence and the phoneme sequence corresponding to the audio data in S2010 is also performed by the processing module 1006_ B. Wherein the custom hotword phoneme sequence comprises the custom hotword phoneme sequence stored in the storage module 1004_ a and the private custom hotword phoneme sequence stored in 1004_ B. The processing module 1006_ B is further configured to update the decoded word lattice according to the similarity configuration and determine a speech recognition result in S2012.

Fig. 10 provides a schematic diagram of a computing device 6000. As shown in fig. 10, computing device 6000 includes: a bus 6002, a processor 6004, and a memory 6006. The processor 6004 and memory 6006 communicate over a bus 6002. Computing device 6000 may be a server or a terminal device.

The bus 6002 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.

The processor 6004 may be any one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Micro Processor (MP), a Digital Signal Processor (DSP), and the like.

The memory 6006 may include a volatile memory (RAM), such as a Random Access Memory (RAM). The memory 6004 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory, a Hard Disk Drive (HDD) or a Solid State Drive (SSD). The memory 6006 stores executable program code that the processor 6004 executes to implement the functionality of the speech recognition system 1000 described above, or to perform the speech recognition method described in the foregoing embodiments.

The embodiment of the application also provides a computing device cluster. The computing device cluster includes at least one computing device 6000. The computing device 6000 included in the computing device cluster may be a server or a portion of a server may be a terminal device.

Where computing devices 6000 included in the cluster of computing devices are servers, each computing device 6000 runs one or more modules of the speech recognition system 1000 (as shown in FIG. 8). In the case where the computing device 6000 included in the computing device cluster is partially a terminal device as a server, one or more modules of the sub speech recognition system 1000_ a are run for the computing device 6000 of the server, and one or more modules of the sub speech recognition system 1000_ B are run for the computing device 6000 of the terminal device (as shown in fig. 9).

The embodiment of the application also provides a computer readable storage medium. The computer-readable storage medium can be any available medium that a computing device can store or a data storage device, such as a data center, that contains one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), among others. The computer-readable storage medium includes instructions that instruct a computing device to perform the speech recognition method described above as applied to the speech recognition system 1000.

The embodiment of the application also provides a computer program product containing instructions. The computer program product may be a software or program product containing instructions capable of being run on a computing device or stored in any available medium. The computer program product, when run on at least one computer device, causes the at least one computer device to perform the speech recognition method as applied to the speech recognition system 1000 described above.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A speech recognition method, wherein the method is applied to a speech recognition system, and wherein the method comprises:

receiving uploaded self-defined hot words and similarity configuration;

converting the self-defined hotword into a self-defined hotword phoneme sequence, wherein the self-defined hotword phoneme sequence is a first phoneme sequence;

expanding the first phoneme sequence according to an expansion rule obtained based on a clustering algorithm model to obtain an expanded first phoneme sequence;

receiving the uploaded audio data;

converting the audio data into an audio data phoneme sequence, wherein the audio data phoneme sequence is a second phoneme sequence;

calculating the similarity between the expanded first phoneme sequence and the second phoneme sequence;

and determining a voice recognition result of the audio data according to the similarity configuration and the similarity.

2. The method of claim 1, in which the clustering algorithm model comprises a binary decision tree.

3. The method of claim 2, wherein the method further comprises:

inputting a phoneme set into the binary decision tree, wherein the phoneme set comprises a plurality of phonemes;

dividing each phoneme in the set of phonemes into leaf nodes of the binary decision tree;

determining distances of the phonemes in the set of phonemes, wherein the distances of the two phonemes in the set of phonemes indicate distances of leaf nodes into which the two phonemes are divided within the binary decision tree;

and obtaining the extension rule according to the distance between the phonemes in the phoneme set, wherein the extension rule comprises the corresponding relation between the phonemes in the phoneme set and the extension phonemes of the phonemes.

4. The method of claim 3, wherein obtaining the expansion rule based on distances between the phonemes in the set of phonemes comprises:

obtaining the distance between a first phoneme and a second phoneme according to the number of branch nodes contained in a path from a leaf node where the first phoneme is located to a leaf node where the second phoneme is located;

and if the distance between the first phoneme and the second phoneme does not exceed a distance threshold, writing the second phoneme into the extension rule as the extension phoneme of the first phoneme.

5. The method of any of claims 1 through 4, wherein calculating the similarity between the expanded first phone sequence and the second phone sequence comprises:

fitting the phonemes in the expanded first phoneme sequence into a model;

fitting phonemes in the second phoneme sequence to a model;

calculating a model distance between a model fitted by phonemes in the expanded first phoneme sequence and a model fitted by phonemes in the second phoneme sequence;

and according to the model distance, obtaining the similarity between the phonemes in the expanded first phoneme sequence and the phonemes in the second phoneme sequence.

6. A speech recognition system, the system comprising an interaction module, a processing module, and a storage module:

the interactive module is used for receiving the uploaded user-defined hot words, similarity configuration and audio data;

the storage module is used for storing the user-defined hot words, the similarity configuration and the audio data;

the processing module is used for expanding the first phoneme sequence according to an expansion rule obtained based on a clustering algorithm model to obtain an expanded first phoneme sequence; converting the audio data into an audio data phoneme sequence, wherein the audio data phoneme sequence is a second phoneme sequence; calculating the similarity between the expanded first phoneme sequence and the second phoneme sequence; determining a voice recognition result of the audio data according to the similarity configuration and the similarity;

the interactive module is further used for returning a voice recognition result of the audio data.

7. The system of claim 6, wherein the clustering algorithm model comprises a binary decision tree.

8. The system of claim 7, wherein the processing module is further to:

inputting the phoneme set into the binary decision tree, wherein the phoneme set comprises a plurality of phonemes;

9. The system of claim 7 or 8, wherein the processing module is further to:

obtaining the distance between the first phoneme and the second phoneme according to the number of branch nodes contained in a path from the leaf node where the first phoneme is located to the leaf node where the second phoneme is located;

10. A cluster of computing devices comprising at least one computing device, each computing device comprising a processor and a memory;

the processor of the at least one computing device is to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the method of any of claims 1-5.

11. A computer-readable storage medium comprising computer program instructions that, when executed by a cluster of computing devices, perform the method of any of claims 1 to 5.

12. A computer program product comprising instructions which, when executed by a cluster of computer devices, cause the cluster of computer devices to perform the method of any one of claims 1-5.