CN111739514B

CN111739514B - Voice recognition method, device, equipment and medium

Info

Publication number: CN111739514B
Application number: CN201910710043.1A
Authority: CN
Inventors: 马浩
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2023-11-14
Anticipated expiration: 2039-07-31
Also published as: CN111739514A

Abstract

The embodiment of the invention discloses a voice recognition method, a device, equipment and a medium, wherein the method comprises the following steps: acquiring voice data to be recognized, and determining original pinyin data corresponding to the voice data to be recognized; correcting the original pinyin data to obtain pinyin data to be matched; and matching the pinyin data to be matched with a pre-constructed standard pinyin sequence, and determining text data corresponding to the voice data to be recognized according to a matching result. According to the voice recognition method provided by the embodiment of the invention, the original pinyin data is corrected, and recognition is performed based on the corrected voice data, so that the accuracy of voice recognition is improved, and the response accuracy of voice intelligent customer service is further improved.

Description

Voice recognition method, device, equipment and medium

Technical Field

The embodiment of the invention relates to the field of information processing, in particular to a voice recognition method, a voice recognition device, voice recognition equipment and voice recognition media.

Background

Along with the continuous development of network technology, the application of voice recognition is also becoming wider and wider, for example, in the response scene of voice intelligent customer service, the purpose of solving the user problem in the response scene by voice response interaction can be achieved by a voice robot.

The response for realizing the intelligent voice customer service comprises the following steps: and converting the voice input by the user into characters, identifying the user intention based on the voice-to-character result, acquiring a response text corresponding to the characters based on the user intention, and then converting the response text into voice to broadcast the response. The main mode of converting the voice into the text at present is as follows: the method comprises the steps of collecting a voice sample, marking the characteristics in the voice sample, training a model based on an algorithm of deep learning (such as a cyclic neural network, a rolling machine neural network and the like), obtaining a trained voice recognition model, and carrying out real-time voice recognition through the trained voice recognition model to convert voice into characters.

In the process of implementing the present invention, the inventor finds that at least the following technical problems exist in the prior art: the general speech corpus training can lead to relatively fixed recognition results, but due to the characteristics of accents and Chinese expressions of users and background noise or volume changes of dictation of users, problems such as near-speech word recognition errors, word missing recognition and the like are caused, the speech-to-text result is wrong, further, user intention recognition based on the speech-to-text result is inconsistent with actual intention of the users, response is inaccurate, the user expressions are diversified, and the training of the model suitable for all users is not easy to realize.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method, a device, equipment and a medium, which are used for realizing the improvement of the accuracy of voice recognition and further improving the response accuracy of voice intelligent customer service.

In a first aspect, an embodiment of the present invention provides a method for voice recognition, including:

acquiring voice data to be recognized, and determining original pinyin data corresponding to the voice data to be recognized;

correcting the original pinyin data to obtain pinyin data to be matched;

and matching the pinyin data to be matched with a pre-constructed standard pinyin sequence, and determining text data corresponding to the voice data to be recognized according to a matching result.

In a second aspect, an embodiment of the present invention further provides a voice recognition apparatus, including:

the phonetic data acquisition module is used for acquiring the voice data to be recognized and determining the original phonetic data corresponding to the voice data to be recognized;

the pinyin data calibration module is used for correcting the original pinyin data to obtain pinyin data to be matched;

and the text data determining module is used for matching the pinyin data to be matched with the pre-constructed standard pinyin sequence and determining text data corresponding to the voice data to be identified according to a matching result.

In a third aspect, an embodiment of the present invention further provides a computer apparatus, the apparatus including:

one or more processors;

a storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the speech recognition method as provided by any embodiment of the present invention.

In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a speech recognition method as provided by any of the embodiments of the present invention.

According to the embodiment of the invention, the original pinyin data corresponding to the voice data to be recognized is determined by acquiring the voice data to be recognized; correcting the original pinyin data to obtain pinyin data to be matched; and matching the pinyin data to be matched with a pre-constructed standard pinyin sequence, determining text data corresponding to the voice data to be recognized according to a matching result, correcting the original pinyin data, recognizing based on the corrected voice data, and improving the accuracy of voice recognition, thereby improving the response accuracy of voice intelligent customer service.

Drawings

FIG. 1 is a flow chart of a method for speech recognition according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a voice recognition method according to a second embodiment of the present invention;

FIG. 3a is a flowchart of a speech recognition method according to a third embodiment of the present invention;

fig. 3b is a schematic structural diagram of an intelligent customer service system according to a third embodiment of the present invention;

fig. 3c is a schematic flow chart of an intelligent customer service response method according to a third embodiment of the present invention;

FIG. 3d is a schematic diagram of an undirected search graph in a speech recognition method according to a third embodiment of the present invention;

fig. 3e is a schematic diagram of a bi-directional matching method in a voice recognition method according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a voice recognition device according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Example 1

Fig. 1 is a flowchart of a voice recognition method according to an embodiment of the present invention. The embodiment can be suitable for the situation when voice data are recognized, and is particularly suitable for the situation when voice intelligent customer service performs voice response. The method may be performed by a speech recognition device, which may be implemented in software and/or hardware, e.g. the speech recognition device may be configured in a computer apparatus. As shown in fig. 1, the method includes:

s110, acquiring voice data to be recognized, and determining original pinyin data corresponding to the voice data to be recognized.

In this embodiment, the voice data to be recognized may be question information input by the user through voice. In order to make the voice recognition result more accurate, in this embodiment, the initial recognition result is corrected according to the pinyin data, and the final recognition result is determined based on the corrected pinyin data.

Optionally, after the problem information (to-be-recognized voice data) in the voice form is obtained, the to-be-recognized voice data can be input into the voice recognition model through the existing voice recognition model for converting the voice data into the text information, the text form data output by the voice recognition model is obtained, and then the text form data is converted into the original pinyin data corresponding to the to-be-recognized voice data through the pinyin conversion tool. Optionally, a pinyin data recognition model for converting the to-be-recognized voice data into pinyin data can be trained, and after the problem information in the voice form is obtained, the to-be-recognized voice data is input into the trained pinyin data recognition model to obtain the original pinyin data corresponding to the to-be-recognized voice data.

In one embodiment, when the voice data to be recognized is first converted into data in a text form and then converted into data in a pinyin form, before the data in the text form is converted into the data in the pinyin form, the data in the text form can be subjected to generalization processing, entity words in the data in the text form are generalized, and the data in the generalized text form is converted into the data in the pinyin form, so that original pinyin data corresponding to the voice data to be recognized is obtained. For example, if the text form data corresponding to the voice data to be recognized is "when the mobile phone of me buying arrives", the text form data is subjected to generalization processing, the entity word "mobile phone" in the text form data is generalized into "PRODSORT", so as to obtain the generalized text form data "when the mobile phone of me buying arrives", and the generalized text form data is converted into the pinyin form data, so as to obtain the original pinyin data "wo mai de PRODSORT shen me shi hou dao" corresponding to the voice data to be recognized.

In one embodiment, if the speech data to be recognized is converted into the original pinyin data by the trained pinyin data recognition model, the sample speech data to be recognized and the original pinyin data corresponding to the sample speech data to be recognized may be obtained in advance, a pinyin data recognition sample pair is formed based on the sample speech data to be recognized and the original pinyin data corresponding to the sample speech data to be recognized, and the pre-constructed speech recognition model is trained by using the pinyin data recognition sample pair, so as to obtain the trained speech data recognition model.

S120, correcting the original pinyin data to obtain pinyin data to be matched.

In order to simplify the correction process, in this embodiment, the initial recognition result is corrected by the voice data, considering that the same pinyin data may represent different text data. Optionally, correcting the original pinyin data may be: correcting the incorrect pinyin in the original pinyin data to the standard pinyin. The high-frequency error-prone near-pronunciation pinyin can be manually arranged in advance, the mapping relation between the error pinyin and the standard pinyin is carded out, the carded mapping relation is used as a pinyin near-pronunciation table, and the original pinyin data is corrected based on the preset pinyin near-pronunciation table.

In one embodiment of the present invention, the correcting the original pinyin data to obtain pinyin data to be matched includes: determining the error pinyin contained in the original pinyin data as the pinyin to be corrected according to a preset pinyin near-sound table; wherein, the phonetic near-tone table stores at least one corresponding relation between the incorrect phonetic alphabet and the standard phonetic alphabet; correcting the pinyin to be corrected contained in the original pinyin data into standard pinyin corresponding to the pinyin to be corrected, and obtaining the pinyin data to be matched.

Optionally, traversing the original pinyin data, determining the to-be-corrected pinyin which is contained in the original pinyin data and is the same as the error pinyin in the preset pinyin near-pronunciation table, determining the standard pinyin corresponding to the to-be-corrected pinyin according to the pinyin near-pronunciation table, and correcting the to-be-corrected pinyin in the original pinyin data to the standard pinyin corresponding to the to-be-corrected pinyin. For example, if the original pinyin data is "wo de PRODSORT dao la le", by searching a preset pinyin near-sound table, it is determined that the incorrect pinyin contained in the original pinyin data is "la" and the standard pinyin corresponding to the incorrect pinyin is "na", then "la" in the original pinyin data is taken as the pinyin to be corrected, and "la" is corrected to the standard pinyin "na" corresponding to the same, so as to obtain the pinyin data "wo de PRODSORT dao na le" to be matched.

And S130, matching the pinyin data to be matched with a pre-constructed standard pinyin sequence, and determining text data corresponding to the voice data to be identified according to a matching result.

In this embodiment, the standard pinyin sequence is used to determine whether the correction of the pinyin data to be corrected is accurate. Alternatively, a high-frequency error-prone sentence containing the high-frequency error-prone near-tone pinyin can be manually carded in advance, standard description of the high-frequency error-prone sentence is generalized and then converted into pinyin format data, standard pinyin data of the high-frequency error-prone sentence are obtained, and a standard pinyin sequence composed of pinyin nodes is constructed based on the standard pinyin data of each high-frequency error-prone sentence.

In one embodiment, the pinyin data to be matched can be matched with a pre-constructed standard pinyin sequence, if the target standard pinyin sequence matched with the pinyin data to be matched can be matched, the correction of the original pinyin data is accurate, and text data corresponding to the target standard pinyin sequence is used as text data corresponding to the voice data to be recognized. If the target standard pinyin sequence matched with the pinyin data to be matched cannot be matched, the correction of the original pinyin data is inaccurate, and the text data corresponding to the original pinyin data is used as the text data corresponding to the voice data to be recognized.

Example two

Fig. 2 is a flowchart of a voice recognition method according to a second embodiment of the present invention. The present embodiment is optimized on the basis of the above embodiment. As shown in fig. 2, the method includes:

s210, acquiring voice data to be recognized, and determining original pinyin data corresponding to the voice data to be recognized.

S220, correcting the original pinyin data to obtain pinyin data to be matched.

S230, determining matching nodes in the pinyin data to be matched, and determining a target standard pinyin sequence matched with the pinyin data to be matched by matching the matching nodes with standard pinyin nodes in the standard pinyin sequence.

In this embodiment, the standard pinyin sequence is a sequence formed by standard pinyin nodes, in order to match the pinyin data to be matched with the standard pinyin sequence, the matching nodes in the pinyin data to be matched need to be determined, and the matching nodes in the pinyin data to be matched with the matching nodes in the standard pinyin sequence are sequentially matched according to the node data, so as to obtain a target standard pinyin sequence matched with the pinyin data to be matched. Optionally, first standard pinyin nodes matched with first matching nodes in the pinyin data to be matched can be determined first, then a second matching node behind the first matching node is matched with standard pinyin nodes connected with the first standard pinyin nodes, so that second standard pinyin nodes matched with the second matching nodes are obtained, standard pinyin nodes matched with all the matching nodes in the pinyin data to be matched are obtained in sequence, and a sequence formed by the standard pinyin nodes is used as a target standard pinyin sequence matched with the pinyin data to be matched.

By way of example, the target standard pinyin sequence matched with the pinyin data to be matched may be obtained in a perfect match manner. Assuming that the pinyin data to be matched is wo mai de PRODSORT dao na le, determining matching nodes in the pinyin data to be matched, such as wo, mai, de, PRODSORT, dao, na and le, sequentially matching the matching nodes with standard pinyin nodes, and determining standard pinyin nodes which are matched with the matching nodes and are connected with the standard pinyin node matched with the last matching node of the matching nodes. Determining a standard pinyin node 'wo' matched with the matching node 'wo'; a standard pinyin node "mai" that is matched to the matching node "mai" and that is connected to the standard pinyin node "wo"; a standard pinyin node "de" that is matched to the matching node "de" and that is connected to the standard pinyin node "mai"; a standard pinyin node "procort" that is matched to the matching node "procort" and that is connected to the standard pinyin node "de"; a standard pinyin node "dao" that is matched to the matching node "dao" and that is connected to the standard pinyin node "prodort"; a standard pinyin node "na" that is matched to the matching node "na" and that is connected to the standard pinyin node "dao"; the standard pinyin node 'le' which is matched with the matching node 'le' and is connected with the standard pinyin node 'na' takes the sequence 'wo mai de PRODSORT dao na le' formed by the standard pinyin nodes 'wo', 'mai', 'de', 'PRODSORT', 'dao', 'na' and 'le' as a target standard pinyin sequence.

Considering that the situation that the word is lost when the word is used for expressing habits or voice is recorded, the word is lost when the pinyin data to be matched relative to the standard pinyin sequence, in this embodiment, a bidirectional matching algorithm may be adopted, and the target standard pinyin sequence matched with the pinyin data to be matched is determined by a supplementary matching mode. In one embodiment of the present invention, the determining the matching node in the pinyin data to be matched, by matching the matching node with a standard pinyin node in the standard pinyin sequence, determines a target standard pinyin sequence that matches the pinyin data to be matched, includes: taking each pinyin in the pinyin data to be matched as a matching node; and matching the matching node with the standard pinyin nodes in the standard pinyin sequence by using a bidirectional matching algorithm, and obtaining the target standard pinyin sequence according to a matching result.

Alternatively, the two-way matching algorithm may perform node matching through forward and/or reverse, and typically, the standard pinyin nodes that match the matching nodes in the pinyin data to be matched are determined through forward matching. Assuming that the matching nodes comprise a first matching node, a second matching node and a third matching node, a first standard pinyin node matched with the first matching node is obtained through forward matching, a second standard pinyin node matched with the second matching node and connected with the first standard pinyin node is obtained through backward matching, but the second standard pinyin node matched with the third matching node is not obtained through backward matching, all first candidate pinyin nodes connected with the second matching node in the standard pinyin sequence can be searched, then a third standard pinyin node matched with the third matching node in the standard pinyin node is searched, whether a fourth standard pinyin node connected with the third standard pinyin node exists in the first candidate pinyin node is judged, and if the fourth standard pinyin node connected with the third standard pinyin node exists in the first candidate pinyin node, the standard pinyin sequence formed by the first standard pinyin node, the second standard pinyin node, the fourth standard pinyin node and the third standard pinyin node in sequence is used as a target standard pinyin sequence matched with data to be matched; if the fourth standard pinyin node connected with the third standard pinyin node does not exist in the first candidate pinyin nodes, a second candidate pinyin node connected with the third standard pinyin node in the standard pinyin sequence is obtained, whether the second candidate pinyin node connected with the standard pinyin node in the first candidate pinyin node exists or not is judged, and if the fifth standard pinyin node in the first candidate pinyin node is connected with the sixth standard pinyin node in the second candidate pinyin node, the standard pinyin sequence formed by the first standard pinyin node, the second standard pinyin node, the fifth standard pinyin node, the sixth standard pinyin node and the third standard pinyin node in sequence is used as a target standard pinyin sequence matched with pinyin data to be matched.

In one embodiment of the present invention, the matching node with the standard pinyin node in the standard pinyin sequence using a bi-directional matching algorithm, and obtaining the target standard pinyin sequence according to a matching result includes: matching the matching node with the standard pinyin node by using the bidirectional matching algorithm to obtain at least one candidate standard pinyin sequence; determining a weight of each candidate standard pinyin sequence according to a sequence heat value of the candidate standard pinyin sequence and a pinyin heat value of each pinyin in the candidate standard pinyin sequence, wherein the sequence heat value is used for representing the use frequency of the standard pinyin sequence, and the pinyin heat value is used for representing the use frequency of the pinyin; and taking the candidate standard pinyin sequence with the maximum weight as the target standard pinyin sequence.

In this embodiment, the number of the obtained standard pinyin sequences matched with the pinyin data to be matched may be multiple, and the standard pinyin sequence with the largest weight may be used as the target standard pinyin sequence matched with the pinyin data to be matched by calculating the weight of each standard pinyin sequence. For example, assuming that the standard pinyin sequences matched with the pinyin data to be matched include a candidate standard pinyin sequence 1, a candidate standard pinyin sequence 2 and a candidate standard pinyin sequence 3, the weight of the candidate standard pinyin sequence 1 is 0.89, the weight of the candidate standard pinyin sequence 2 is 0.65, and the weight of the candidate standard pinyin sequence 3 is 0.78, the candidate standard pinyin sequence 1 with the largest weight is taken as the target standard pinyin sequence.

Optionally, after obtaining a plurality of candidate standard pinyin sequences matched with the pinyin data to be matched, calculating, for each candidate standard pinyin sequence, a weight of the candidate standard pinyin sequence according to the sequence heat value of the candidate standard pinyin sequence and the pinyin heat value of each pinyin in the candidate standard pinyin sequence. The sequence heat value of the candidate standard pinyin sequence may be the number of times the candidate standard pinyin sequence is used as the target standard pinyin sequence, and the pinyin heat value of the pinyin may be the number of times the pinyin exists in the target standard pinyin sequence. Because the pinyin heat value can represent the use frequency of pinyin and the sequence heat value can represent the use frequency of a standard pinyin sequence, the weight of the candidate standard pinyin sequence calculated based on the pinyin heat value and the sequence heat value can accurately screen out the target standard pinyin sequence matched with the pinyin data to be matched.

In this embodiment, the weight of each pinyin in the candidate standard pinyin sequence may be calculated first, and the weight of the candidate standard pinyin sequence may be calculated based on the weights of the respective pinyin in the candidate standard pinyin sequence. For example, (1+log) may be represented by F (i) = ((hw+1)/H (i) +1) ₁₀ (H (i) +1)) calculates the weight of each pinyin in the candidate standard pinyin sequence, and calculates the weight of the candidate standard pinyin sequence by w=f (1) ×f (2) × … … F (n). Wherein F (i) represents the weight value of the ith pinyin in the candidate standard pinyin sequence, HW represents the sequence heat value of the candidate standard pinyin sequence, H (i) represents the pinyin heat value of the ith pinyin in the candidate standard pinyin sequence, W represents the weight value of the candidate standard pinyin sequence, and n is the total number of the pinyins in the candidate standard pinyin sequence.

In this embodiment, after determining the target standard pinyin sequence, the method further includes: and updating the pinyin heat value of each pinyin in the target standard pinyin sequence and the sequence heat value of the target standard pinyin sequence. In order to make the weight calculation of the candidate standard pinyin sequence accurate, after the target standard pinyin sequence is determined, the pinyin heat value of each pinyin in the target standard pinyin sequence and the sequence heat value of the target standard pinyin sequence need to be updated. Specifically, the pinyin heat value of each pinyin in the target standard pinyin sequence is increased by 1, and the sequence heat value of the target standard pinyin sequence is increased by 1.

S240, taking the text data corresponding to the target standard pinyin sequence as the text data corresponding to the voice data to be recognized.

After the target standard pinyin sequence is determined, the text data corresponding to the target standard pinyin sequence is used as the text data corresponding to the voice data to be recognized. For example, if the target standard pinyin sequence is "wo mai de PRODSORT dao na le", the text data "which is the text data" i buy procdort "corresponding to the target standard pinyin sequence is used as the text data corresponding to the voice data to be recognized.

In one embodiment of the present invention, further comprising: and if the standard pinyin sequence does not have the target standard pinyin sequence matched with the pinyin data to be matched, taking the text data corresponding to the original pinyin data as the text data corresponding to the voice data to be identified.

If the target standard pinyin sequence matched with the pinyin data to be matched cannot be obtained in a complete matching mode and a complementary matching mode, namely, the target standard pinyin sequence matched with the pinyin data to be matched does not exist in the standard pinyin sequence, the correction of the original pinyin data is wrong, and text data corresponding to the original pinyin data is used as text data corresponding to the voice data to be identified.

According to the technical scheme, the pinyin data to be matched and the standard pinyin sequences constructed in advance are matched, text data corresponding to the voice data to be recognized is determined according to the matching result, the matched nodes in the pinyin data to be matched are determined, and the target standard pinyin sequences matched with the pinyin data to be matched are determined by matching the matched nodes with the standard pinyin nodes in the standard pinyin sequences; and taking the text data corresponding to the target standard pinyin sequence as the text data corresponding to the voice data to be recognized, so that the matching result is more accurate, and the voice recognition result is further improved.

On the basis of the scheme, after obtaining at least one candidate standard pinyin sequence and before determining the weight of the candidate standard pinyin sequence for each candidate standard pinyin sequence, the method further comprises the steps of:

comparing the candidate standard pinyin sequences with the pinyin data to be matched for each candidate standard pinyin sequence, and determining a difference value between the candidate standard pinyin sequences and the pinyin data to be matched; and deleting the candidate standard pinyin sequence if the difference value between the candidate standard pinyin sequence and the pinyin data to be matched is larger than a preset difference threshold value.

Alternatively, the candidate standard pinyin sequence which is obtained by the bidirectional matching method and matched with the pinyin data to be matched may have a larger difference with the pinyin data to be matched, and after the candidate standard pinyin sequence which is matched with the pinyin data to be matched is obtained, the candidate standard pinyin sequence can be screened. Specifically, a difference threshold may be preset, after the candidate standard pinyin sequence is obtained, a difference value between the candidate standard pinyin sequence and the pinyin sequence to be matched is calculated, and the candidate standard pinyin sequence with the difference value greater than the difference threshold is deleted. The difference threshold may be set according to actual requirements. Alternatively, the variance threshold may be 0.5.

For example, if the candidate standard pinyin sequence includes a candidate standard pinyin sequence 1, a candidate standard pinyin sequence 2 and a candidate standard pinyin sequence 3, the difference threshold is 0.5, the difference value between the candidate standard pinyin sequence 1 and the pinyin sequence to be matched is 0.4, the difference value between the candidate standard pinyin sequence 2 and the pinyin sequence to be matched is 0.5, and the difference value between the candidate standard pinyin sequence 3 and the pinyin sequence to be matched is 0.7, the candidate standard pinyin sequence 3 with the difference value larger than the difference threshold 0.5 with the pinyin sequence to be matched is deleted.

Alternatively, the difference value between the candidate standard pinyin sequence and the pinyin sequence to be matched may be calculated by c=m/n. Wherein, C is the difference value between the candidate standard pinyin sequence and the pinyin sequence to be matched, m is the number of different pinyins between the candidate standard pinyin sequence and the pinyin sequence to be matched, and n is the total number of pinyins of the pinyin sequence to be matched.

Example III

Fig. 3a is a flowchart of a voice recognition method according to a third embodiment of the present invention. On the basis of the embodiment, the embodiment takes voice intelligent customer service as an example, and provides a preferred embodiment. In this embodiment, based on the speech recognition result, an undirected search map (standard pinyin sequence) is generated by using a manually-carded near-pinyin dictionary and error-prone sentences, so that the speech recognition result is dynamically corrected, and the purpose of correctly recognizing the intention of the user is achieved. The voice recognition method provided by the embodiment of the invention can be executed by the intelligent customer service system. Fig. 3b is a schematic structural diagram of an intelligent customer service system according to a third embodiment of the present invention. As shown in fig. 3b, the intelligent customer service system includes a Speech recognition module (Automatic Speech Recognition, ASR) 310, a recognition correction module 320, a natural Speech processing module (Natural Language Processing, NLP) 340, and a Speech synthesis module (Text To Speech, TTS), which is not shown in the Speech synthesis module diagram. The realization of the intelligent voice customer service mainly comprises the following steps: the user voice is converted into characters through the automatic voice recognition technology of the voice recognition module 310, the characters are transmitted to the recognition correction module 320 to obtain character recognition results, the character recognition results are transmitted to the natural language processing module to process response, and finally the response text is converted into voice through the voice synthesis module to broadcast response. The recognition and correction module 320 includes four parts of a near pinyin dictionary 331, an undirected graph matching module 332, an error prone sentence undirected graph 333, and a hotlist 334 of sentences and words. The near-tone pinyin dictionary 331 is used for building a near-tone pinyin dictionary, the error-prone sentence undirected graph 333 is used for building an error-prone sentence undirected retrieval graph, the hotness table 334 of sentences and words is used for storing hotness tables of sentences and words, and the undirected graph matching module 332 is used for matching correct pinyin and error-prone sentence undirected graphs to obtain a matching result.

Fig. 3c is a schematic flow chart of an intelligent customer service response method according to a third embodiment of the present invention. As shown in fig. 3c, a high-frequency error-prone sentence is configured, and after the high-frequency error-prone sentence is subjected to generalization, sentence-to-pinyin conversion, undirected graph initialization and hotlist initialization, a hotlist of sentences and words and undirected search graphs are generated. When receiving the speech information spoken by the user, converting the speech information into characters through ASR, performing generalization processing on the characters, converting sentences into correct pinyin based on a pre-constructed near-phonetic pinyin dictionary, and then searching a matching result matched with the correct pinyin in the undirected graph. Specifically, whether all pinyins contained in the correct pinyins exist in the undirected graph is searched in the undirected graph, if all pinyins contained in the correct pinyins do not exist in the undirected graph, the original sentence is returned, so that the NLU module carries out intention recognition and text response according to the original sentence, and the text response is converted into voice response through TTS and then fed back to the user. If all the pinyins contained in the correct pinyin exist in the undirected graph, matching is carried out by using a bidirectional matching algorithm, all the matching sequences containing the correct pinyin are obtained, the difference value between each matching sequence and the correct pinyin is calculated, the matching sequences with the difference value larger than a set threshold value are deleted, the rest matching sequences are used as matching results, the weight of each matching result is calculated, the matching result with the highest weight is output to the NLU module, so that the NLU module carries out intention recognition and text response according to the matching result with the highest weight, and the text response is converted into voice response through TTS and then fed back to a user.

The voice recognition method provided in this embodiment will be described in detail below. As shown in fig. 3a, the voice recognition method provided in this embodiment includes:

s310, establishing a near-voice pinyin dictionary.

And manually sorting the high-frequency error-prone near-pronunciation pinyin, combing out the mapping relation between the common error word pinyin and the correct near-pronunciation pinyin, storing the mapping relation into a near-pronunciation pinyin dictionary, and putting the near-pronunciation pinyin dictionary into a database. The mapping relationships contained in the near pinyin dictionary are shown by way of example in table 1. As shown in Table 1, the common misword pinyins include "La" and "Wang", the correct near-to-sound pinyin corresponding to "La" being "Na" and the correct near-to-sound pinyin corresponding to "Wang" being "Huang".

TABLE 1

Pinyin for common miswords	Correct near-tone phonetic alphabet
		La	Na
Wang	Huang

S320, establishing an error-prone sentence undirected retrieval graph.

The standard description of the high-frequency error-prone sentences is manually carded out, after generalization, all characters are converted into pinyin and stored in an undirected graph, for example: the word is converted to pinyin using the tinypindin tool. The correspondence between the original sentence, the generalized text and the pinyin is exemplarily shown in table 2.

After the original sentence is converted into pinyin, the pinyin of a single word is used as a node, and the spliced pinyin short sentences are used for constructing an undirected retrieval graph according to the forward sequence. Fig. 3d is a schematic diagram of an undirected search graph in a speech recognition method according to a third embodiment of the present invention. As shown in FIG. 3d, adjacent pinyin nodes in the pinyin phrases are connected to form an undirected search graph containing connection relations.

TABLE 2

S330, initializing a hotlist of sentences and words.

The hotness of the error-prone sentence and the hotness of the single word in the sentence are initialized. Alternatively, the sentence heat value of each sentence and the initial value of the word heat value of each word may be set to 0. Table 3 schematically shows the hotness values of words in the sentence and the manner of representing the hotness values of the sentence. As shown in table 3, the sentence "when i buy's prodort has a sentence heat value of 1, the word" i "in this sentence has a heat value of 3, the" buy "has a heat value of 2, the" buy "has a heat value of 3, the" prodort "has a heat value of 3, the" assorted "has a heat value of 2, the" how "has a heat value of 2, the" when "has a heat value of 2, the" wait "has a heat value of 2, and the" to "has a heat value of 2.

TABLE 3 Table 3

S340, converting the user problem into correct pinyin.

After generalization, the user problem is converted into pinyin, and the pinyin is matched with a near-voice word list to obtain correct near-voice pinyin. For example: and converting the characters into pinyin by using a TinyPinyin tool, and matching with a near-tone pinyin table to obtain correct pinyin. Table 4 schematically shows the correspondence of the original sentence, the generalized sentence, the pinyin and the correct pinyin.

TABLE 4 Table 4

S350, searching the undirected search graph by adopting a bidirectional matching algorithm to obtain at least one matching result.

Firstly, ensuring that all pinyins exist in an undirected search graph, and if any pinyins do not exist in the undirected search graph, failing to match and directly returning to the original character string; if the matching result exists, traversing the correct pinyin sequence in turn according to a bidirectional matching algorithm, returning all matching results in a complete matching and complementary matching mode, and replacing the matched pinyin with a text sequence in the undirected search graph.

In one embodiment, the correct pinyin to be matched is "wo de PRODSORT dao na le". Firstly, checking whether the pinyin in the correct pinyin to be matched completely exists in an undirected retrieval graph, and if so, checking that the pinyin in the correct pinyin to be matched completely exists in the undirected retrieval graph; then, starting from the head "wo" and the tail "le", the front and the back are completely matched to the undirected graph pinyin sequence "wo-de-PRODSORT-dao-na-le". At this point, checking again, any combination containing the current sequence, listing other whole sentence likelihood results, such as "wo de PRODSORT dao na le", "wo mai de PRODSORT dao na le", and returning the corresponding literal sequence "my prodort to which", "my prodort to which" is; and finally, filtering the matching result with the change of more than 50%, wherein the change of the two results is not more than 50% (0/6, 1/6), and all the matching results are returned.

In one embodiment, the correct pinyin to be matched is "wo de PRODSORT shen me fa". Firstly, checking whether the pinyin in the correct pinyin to be matched completely exists in an undirected retrieval graph, and if so, checking that the pinyin in the correct pinyin to be matched completely exists in the undirected retrieval graph; then, the head "wo" and the tail "fa" are matched to find out a connected sequence, and fig. 3e is a schematic diagram of a bi-directional matching method in a voice recognition method according to the third embodiment of the present invention. Wherein the solid unidirectional arrow indicates a forward matching flow, the dashed unidirectional arrow indicates a reverse matching flow, and the solid bidirectional arrow indicates that the forward matching and reverse matching attempts were successful. As shown in fig. 3e, the forward matching wo-de-prodort-shen-me-fa ends and the reverse matching fa-me ends; at this time, the next node in the diagram is selected from the forward matching sequence to try to match wo-de-PRODSORT-shen-me-shi, the next node in the diagram is selected from the reverse matching sequence to try to match fa-hou, and as a result, a path connection exists between shi and hou, the forward matching sequence is completely matched to the connected undirected diagram sequence: wo-de-procdort-pen-me-shi-hou-fa; if the sequence without communication is directly returned to the original source character string; at this point it is again determined whether there are any combinations containing the current sequence, resulting in other whole sentences: "wo mai de PRODSORT shen me shi hou fa", "wo mai de PRODSORT shen me shi hou fa huo", "wo de PRODSORT shen me shi hou fa", "wo de PRODSORT shen me shi hou fa huo", and return the words "when i buy prodort", "when i my prodort" when i pinyin sequences. And finally, filtering the matching result with the change of more than 50%, wherein the last pinyin sequence in the result is changed to be 4/6 and more than 50%, deleting the last pinyin sequence, and returning all other sequences.

When searching the communication sequence through forward and reverse matching, each attempted node attempts one layer and two layers in forward and reverse directions, and each layer possibly has a plurality of nodes, all combinations are needed to be exhausted, all the communication sequences are found out through multiple matching, if all the combinations are not completely matched after being tried, the matching is directly abandoned, and the original text is returned. It should be noted that the levels tried here set a threshold, at most two levels were tried, and too many levels affected the matching performance; for example, if there is no communication path between "shi" and "hou" in the above example, the next level node of "shi" and "hou" is not further tried.

S360, calculating the weight of the whole sentence of the matching result, and taking the matching result with the maximum weight as an output sentence.

(1+log) can be represented by F (i) = ((hw+1)/H (i) +1) ₁₀ (H (i) +1)) calculates the weight of each pinyin in the candidate standard pinyin sequence, and calculates the weight of the candidate standard pinyin sequence by w=f (1) ×f (2) × … … F (n). Wherein F (i) represents the weight value of the ith pinyin in the candidate standard pinyin sequence, HW represents the sequence heat value of the candidate standard pinyin sequence, H (i) represents the pinyin heat value of the ith pinyin in the candidate standard pinyin sequence, W represents the weight value of the candidate standard pinyin sequence, and n is the total number of the pinyins in the candidate standard pinyin sequence.

TABLE 5

/>

Table 5 schematically shows the correspondence among the original sentence, the matched sentence, the whole sentence weight calculation, and the output sentence. As shown in table 5, the original sentence "my prodort to pull" the corresponding matching sentence includes "my prodort to which" and "my prodort to which" having a weight of 0.848, "my prodort to which" having a weight of 0.006, "my prodort to which" has been regarded as its corresponding output sentence, and the answer style is determined based on the output sentence. And S370, updating the word hotness and the sentence hotness.

The word heat in the output sentence is all increased by one, and the output sentence heat is increased by one. Table 6 schematically shows the updated hotness table of the word hotness and the sentence hotness.

TABLE 6

Aiming at the problems that near-voice word recognition errors and recognition word missing errors occur in voice transfer words and user intention recognition errors are caused, the embodiment of the invention adds the manually configurable high-frequency error-prone sentence undirected retrieval graph, repairs the problem of high-frequency voice recognition errors in a manner of bidirectional matching of undirected graphs and sorting of matching results based on weights, adapts to the continuously changing after-sales scene requirements, solves the problem of long training period of a voice recognition model in the prior art, and can achieve the purpose of dynamic adaptation by adjusting high-frequency error-prone sentence configuration, realize high-frequency error-prone sentence error correction, achieve the purpose of improving user intention recognition accuracy and recall rate and improve user experience.

Example IV

Fig. 4 is a schematic structural diagram of a voice recognition device according to a fourth embodiment of the present invention. The speech recognition means may be implemented in software and/or hardware, for example the speech recognition means may be arranged in a computer device. As shown in fig. 4, the apparatus includes a pinyin data acquisition module 410, a pinyin data calibration module 420, and a text data determination module 430, wherein:

the pinyin data acquisition module 410 is configured to acquire to-be-recognized voice data, and determine original pinyin data corresponding to the to-be-recognized voice data;

the pinyin data calibration module 420 is configured to correct the original pinyin data to obtain pinyin data to be matched;

the text data determining module 430 is configured to match the pinyin data to be matched with a pre-constructed standard pinyin sequence, and determine text data corresponding to the speech data to be recognized according to a matching result.

According to the embodiment of the invention, the phonetic data to be recognized is obtained through the phonetic data obtaining module, and the original phonetic data corresponding to the phonetic data to be recognized is determined; the pinyin data calibration module corrects the original pinyin data to obtain pinyin data to be matched; the text data determining module matches the pinyin data to be matched with a pre-constructed standard pinyin sequence, determines text data corresponding to the voice data to be recognized according to a matching result, corrects the original pinyin data, recognizes based on the corrected voice data, improves the accuracy of voice recognition, and further improves the response accuracy of voice intelligent customer service.

Based on the above scheme, the pinyin data calibration module 420 is specifically configured to:

determining the error pinyin contained in the original pinyin data as the pinyin to be corrected according to a preset pinyin near-sound table; wherein, the phonetic near-tone table stores at least one corresponding relation between the incorrect phonetic alphabet and the standard phonetic alphabet;

correcting the pinyin to be corrected contained in the original pinyin data into standard pinyin corresponding to the pinyin to be corrected, and obtaining the pinyin data to be matched.

On the basis of the above scheme, the text data determining module 430 includes:

the target sequence determining unit is used for determining a matching node in the pinyin data to be matched, and determining a target standard pinyin sequence matched with the pinyin data to be matched by matching the matching node with a standard pinyin node in the standard pinyin sequence;

and the text data determining unit is used for taking the text data corresponding to the target standard pinyin sequence as the text data corresponding to the voice data to be recognized.

On the basis of the above-described aspect, the target sequence determining unit includes:

the matching node determining subunit is used for taking each pinyin in the pinyin data to be matched as a matching node;

And the bidirectional matching subunit is used for matching the matching node with the standard pinyin nodes in the standard pinyin sequence by using a bidirectional matching algorithm, and obtaining the target standard pinyin sequence according to a matching result.

Based on the above scheme, the bidirectional matching subunit is specifically configured to:

matching the matching node with the standard pinyin node by using the bidirectional matching algorithm to obtain at least one candidate standard pinyin sequence;

determining a weight of each candidate standard pinyin sequence according to a sequence heat value of the candidate standard pinyin sequence and a pinyin heat value of each pinyin in the candidate standard pinyin sequence, wherein the sequence heat value is used for representing the use frequency of the standard pinyin sequence, and the pinyin heat value is used for representing the use frequency of the pinyin;

and taking the candidate standard pinyin sequence with the maximum weight as the target standard pinyin sequence.

On the basis of the scheme, the bidirectional matching subunit is further configured to:

after at least one candidate standard pinyin sequence is obtained and before the weight of the candidate standard pinyin sequence is determined for each candidate standard pinyin sequence, comparing the candidate standard pinyin sequence with the pinyin data to be matched for each candidate standard pinyin sequence, and determining the difference value between the candidate standard pinyin sequence and the pinyin data to be matched;

And deleting the candidate standard pinyin sequence if the difference value between the candidate standard pinyin sequence and the pinyin data to be matched is larger than a preset difference threshold value.

On the basis of the above scheme, the text data determining module 430 is further configured to:

and if the standard pinyin sequence does not have the target standard pinyin sequence matched with the pinyin data to be matched, taking the text data corresponding to the original pinyin data as the text data corresponding to the voice data to be identified.

On the basis of the scheme, the device further comprises:

and the heat value updating module is used for updating the pinyin heat value of each pinyin in the target standard pinyin sequence and the sequence heat value of the target standard pinyin sequence.

The voice recognition device provided by the embodiment of the invention can execute the voice recognition method provided by any embodiment, and has the corresponding functional modules and beneficial effects of the execution method.

Example five

Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention. Fig. 5 illustrates a block diagram of an exemplary computer device 512 suitable for use in implementing embodiments of the present invention. The computer device 512 shown in fig. 5 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in FIG. 5, computer device 512 is in the form of a general purpose computing device. Components of computer device 512 may include, but are not limited to: one or more processors 516, a system memory 528, a bus 518 that connects the various system components (including the system memory 528 and the processor 516).

Bus 518 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor 516, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 512 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 512 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 528 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 530 and/or cache memory 532. The computer device 512 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage 534 may be used to read from or write to a non-removable, non-volatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard disk drive"). Although not shown in fig. 5, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD ROM, DVD ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 518 through one or more data media interfaces. Memory 528 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 540 having a set (at least one) of program modules 542 may be stored in, for example, memory 528, such program modules 542 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 542 generally perform the functions and/or methods in the described embodiments of the invention.

The computer device 512 may also communicate with one or more external devices 514 (e.g., keyboard, pointing device, display 524, etc.), one or more devices that enable a user to interact with the computer device 512, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 512 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 522. Also, the computer device 512 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 520. As shown, network adapter 520 communicates with other modules of computer device 512 via bus 518. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computer device 512, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

Processor 516 executes programs stored in system memory 528 to perform various functional applications and data processing, such as implementing a speech recognition method provided by embodiments of the present invention, including:

correcting the original pinyin data to obtain pinyin data to be matched;

Of course, those skilled in the art will appreciate that the processor may also implement the technical solution of the speech recognition method provided in any embodiment of the present invention.

Example six

The sixth embodiment of the present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the speech recognition method as provided by the embodiments of the present invention, the method comprising:

correcting the original pinyin data to obtain pinyin data to be matched;

Of course, the computer-readable storage medium provided by the embodiments of the present invention, on which the computer program stored, is not limited to the method operations described above, but may also perform the related operations in the speech recognition method provided by any of the embodiments of the present invention.

The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A method of speech recognition, comprising:

correcting the original pinyin data to obtain pinyin data to be matched;

matching the pinyin data to be matched with a pre-constructed standard pinyin sequence, and determining text data corresponding to the voice data to be identified according to a matching result;

the matching the pinyin data to be matched with a pre-constructed standard pinyin sequence, and determining text data corresponding to the voice data to be identified according to a matching result, wherein the matching comprises the following steps:

Determining a matching node in the pinyin data to be matched, and determining a target standard pinyin sequence matched with the pinyin data to be matched by matching the matching node with the standard pinyin node in the standard pinyin sequence, wherein the matching node in the pinyin data to be matched is each pinyin in the pinyin data to be matched;

taking the text data corresponding to the target standard pinyin sequence as the text data corresponding to the voice data to be recognized;

the determining the matching node in the pinyin data to be matched, and determining the target standard pinyin sequence matched with the pinyin data to be matched by matching the matching node with the standard pinyin node in the standard pinyin sequence, includes:

matching the matching node with the standard pinyin node by using a bidirectional matching algorithm to obtain at least one candidate standard pinyin sequence;

2. The method of claim 1, wherein correcting the original pinyin data to obtain pinyin data to be matched comprises:

3. The method of claim 1, wherein after deriving at least one candidate standard pinyin sequence and before determining a weight for each of the candidate standard pinyin sequences, further comprising:

comparing the candidate standard pinyin sequences with the pinyin data to be matched for each candidate standard pinyin sequence, and determining a difference value between the candidate standard pinyin sequences and the pinyin data to be matched;

4. The method as recited in claim 1, further comprising:

5. The method as recited in claim 1, further comprising:

and updating the pinyin heat value of each pinyin in the target standard pinyin sequence and the sequence heat value of the target standard pinyin sequence.

6. A speech recognition apparatus, comprising:

the text data determining module is used for matching the phonetic data to be matched with a pre-constructed standard phonetic sequence and determining text data corresponding to the phonetic data to be identified according to a matching result;

The text data determining module is specifically configured to:

the text data determining module is specifically configured to:

7. A computer device, the device comprising:

one or more processors;

a storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the speech recognition method of any of claims 1-5.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the speech recognition method according to any one of claims 1-5.