CN111739514A

CN111739514A - Voice recognition method, device, equipment and medium

Info

Publication number: CN111739514A
Application number: CN201910710043.1A
Authority: CN
Inventors: 马浩
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2020-10-02
Anticipated expiration: 2039-07-31
Also published as: CN111739514B

Abstract

The embodiment of the invention discloses a voice recognition method, a voice recognition device, voice recognition equipment and a voice recognition medium, wherein the method comprises the following steps: acquiring voice data to be recognized, and determining original pinyin data corresponding to the voice data to be recognized; correcting the original pinyin data to obtain pinyin data to be matched; and matching the pinyin data to be matched with a pre-established standard pinyin sequence, and determining text data corresponding to the voice data to be recognized according to a matching result. The voice recognition method provided by the embodiment of the invention corrects the original pinyin data and recognizes based on the corrected voice data, thereby improving the accuracy of voice recognition and further improving the response accuracy of voice intelligent customer service.

Description

Voice recognition method, device, equipment and medium

Technical Field

Embodiments of the present invention relate to the field of information processing, and in particular, to a method, an apparatus, a device, and a medium for speech recognition.

Background

With the continuous development of network technology, the application of voice recognition is more and more extensive, for example, in a response scene of voice intelligent customer service, the purpose of solving the user problem in the response scene of the voice robot can be achieved through voice response interaction.

The response for realizing the voice intelligent customer service comprises the following steps: converting voice input by a user into characters, identifying user intention based on a voice-to-character result, acquiring a response text corresponding to the characters based on the user intention, and then converting the response text into voice to broadcast response. The current main way of converting voice into text is as follows: the method comprises the steps of collecting a voice sample, marking features in the voice sample, training a model based on a deep learning algorithm (such as a recurrent neural network, a rolling machine neural network and the like) to obtain a trained voice recognition model, carrying out real-time voice recognition through the trained voice recognition model, and converting voice into characters.

In the process of implementing the invention, the inventor finds that at least the following technical problems exist in the prior art: the recognition result is relatively fixed by using the universal speech corpus training, but due to the characteristics of accent and Chinese expression of a user and the change of the volume of background noise or dictation of the user, the problems of wrong recognition of near-phonetic words, word missing recognition and the like are caused, so that the result of converting speech into characters is wrong, further, the recognition of the user intention based on the result of converting speech into characters is inconsistent with the actual intention of the user, the response is inaccurate, the expression of the user is diversified, and the model suitable for all users is not easy to realize by training.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method, a voice recognition device, voice recognition equipment and a voice recognition medium, so that the accuracy of voice recognition is improved, and the response accuracy of voice intelligent customer service is further improved.

In a first aspect, an embodiment of the present invention provides a speech recognition method, including:

acquiring voice data to be recognized, and determining original pinyin data corresponding to the voice data to be recognized;

correcting the original pinyin data to obtain pinyin data to be matched;

and matching the pinyin data to be matched with a pre-established standard pinyin sequence, and determining text data corresponding to the voice data to be recognized according to a matching result.

In a second aspect, an embodiment of the present invention further provides a speech recognition apparatus, including:

the pinyin data acquisition module is used for acquiring the voice data to be recognized and determining original pinyin data corresponding to the voice data to be recognized;

the pinyin data calibration module is used for correcting the original pinyin data to obtain pinyin data to be matched;

and the text data determining module is used for matching the pinyin data to be matched with a pre-established standard pinyin sequence and determining text data corresponding to the voice data to be recognized according to a matching result.

In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a speech recognition method as provided by any of the embodiments of the invention.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the speech recognition method provided in any embodiment of the present invention.

The embodiment of the invention determines the original pinyin data corresponding to the voice data to be recognized by acquiring the voice data to be recognized; correcting the original pinyin data to obtain pinyin data to be matched; and matching the pinyin data to be matched with a pre-established standard pinyin sequence, determining text data corresponding to the voice data to be recognized according to a matching result, correcting the original pinyin data, and recognizing based on the corrected voice data, so that the accuracy of voice recognition is improved, and further the response accuracy of the voice intelligent customer service is improved.

Drawings

Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a speech recognition method according to a second embodiment of the present invention;

fig. 3a is a flowchart of a speech recognition method according to a third embodiment of the present invention;

fig. 3b is a schematic structural diagram of an intelligent customer service system according to a third embodiment of the present invention;

fig. 3c is a schematic flow chart of an intelligent customer service response method according to a third embodiment of the present invention;

fig. 3d is a schematic diagram of an undirected search graph in a speech recognition method according to a third embodiment of the present invention;

fig. 3e is a schematic diagram of a bidirectional matching method in a speech recognition method according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a speech recognition apparatus according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention. The embodiment can be applied to the situation when voice data is recognized, and is particularly suitable for the situation when voice intelligent customer service performs voice response. The method may be performed by a speech recognition apparatus, which may be implemented in software and/or hardware, for example, the speech recognition apparatus may be configured in a computer device. As shown in fig. 1, the method includes:

s110, voice data to be recognized are obtained, and original pinyin data corresponding to the voice data to be recognized are determined.

In this embodiment, the voice data to be recognized may be question information input by a user through voice. In order to make the voice recognition result more accurate, in this embodiment, the initial recognition result is corrected according to the pinyin data, and the final recognition result is determined based on the corrected pinyin data.

Optionally, after problem information (to-be-recognized voice data) in a voice form is acquired, the voice data to be recognized may be input into the voice recognition model through the existing voice recognition model that converts the voice data into text information, so as to acquire text data output by the voice recognition model, and then the text data may be converted into original pinyin data corresponding to the voice data to be recognized by using a pinyin conversion tool. Optionally, a pinyin data recognition model for converting the voice data to be recognized into pinyin data can be trained, and after the problem information of the voice form is obtained, the voice data to be recognized is input into the trained pinyin data recognition model to obtain original pinyin data corresponding to the voice data to be recognized.

In one embodiment, when the voice data to be recognized is converted into the data in the character form, and then the data in the character form is converted into the data in the pinyin form, before the data in the character form is converted into the data in the pinyin form, the data in the character form can be generalized, entity words in the data in the character form are generalized, the generalized data in the character form is converted into the data in the pinyin form, and the original pinyin data corresponding to the voice data to be recognized is obtained. Illustratively, if the data in the text form corresponding to the voice data to be recognized is "when a mobile phone that is bought by me arrives", the data in the text form is generalized, an entity word "mobile phone" in the data is generalized to "prodport", the data in the generalized text form "when the prodport that is bought by me arrives" is obtained, the data in the generalized text form is converted into the data in the pinyin form, and the original pinyin data "wo mai de prodport name shi hou dao" corresponding to the voice data to be recognized is obtained.

In one embodiment, if the speech data to be recognized is converted into the original pinyin data through the trained pinyin data recognition model, the speech data to be recognized of the sample and the original pinyin data corresponding to the speech data to be recognized of the sample can be obtained in advance, a pinyin data recognition sample pair is formed based on the speech data to be recognized of the sample and the original pinyin data corresponding to the speech data to be recognized of the sample, and the pre-established speech recognition model is trained by using the pinyin data recognition sample pair to obtain the trained speech data recognition model.

And S120, correcting the original pinyin data to obtain pinyin data to be matched.

In order to simplify the correction process, the initial recognition result is corrected by the speech data in this embodiment, considering that the same pinyin data may represent different text data. Optionally, the correcting the original pinyin data may be: and correcting the wrong pinyin in the original pinyin data into standard pinyin. The high-frequency error-prone near-sound pinyin can be manually sorted in advance, the mapping relation between the error pinyin and the standard pinyin is sorted out, the sorted mapping relation is used as a pinyin near-sound table, and the original pinyin data is corrected based on the preset pinyin near-sound table.

In an embodiment of the present invention, the correcting the original pinyin data to obtain pinyin data to be matched includes: determining wrong pinyin contained in the original pinyin data as pinyin to be corrected according to a preset pinyin near-sound table; wherein, the pinyin near sound table stores the corresponding relation between at least one wrong pinyin and a standard pinyin; and correcting the pinyin to be corrected contained in the original pinyin data into the standard pinyin corresponding to the pinyin to be corrected, so as to obtain the pinyin data to be matched.

Optionally, traversing the original pinyin data, determining a pinyin to be corrected which is contained in the original pinyin data and is the same as a wrong pinyin in a preset pinyin near-sound table, determining a standard pinyin corresponding to the pinyin to be corrected according to the pinyin near-sound table, and correcting the pinyin to be corrected in the original pinyin data into the standard pinyin corresponding to the pinyin to be corrected. Illustratively, if the original pinyin data is 'wo de PRODSORT dao la le', the error pinyin contained in the original pinyin data is determined to be 'la' by searching a preset pinyin nearness table, and the standard pinyin corresponding to the error pinyin is 'na', the 'la' in the original pinyin data is taken as the pinyin to be corrected, and the 'la' is corrected to the corresponding standard pinyin 'na', so that the pinyin data 'wo PRODSORT dao na le' to be matched is obtained.

S130, matching the pinyin data to be matched with a pre-established standard pinyin sequence, and determining text data corresponding to the voice data to be recognized according to a matching result.

In this embodiment, whether the pinyin data to be corrected is corrected accurately is determined by the standard pinyin sequence. Optionally, high-frequency error-prone sentences including high-frequency error-prone near-phonetic alphabets can be manually sorted in advance, standard descriptions of the high-frequency error-prone sentences are generalized and converted into data in a pinyin format, standard pinyin data of the high-frequency error-prone sentences are obtained, and a standard pinyin sequence composed of pinyin nodes is constructed based on the standard pinyin data of the high-frequency error-prone sentences.

In one embodiment, the pinyin data to be matched and the pre-established standard pinyin sequence can be matched, if the target standard pinyin sequence matched with the pinyin data to be matched can be matched, the correction of the original pinyin data is accurate, and the text data corresponding to the target standard pinyin sequence is used as the text data corresponding to the voice data to be recognized. And if the target standard pinyin sequence matched with the pinyin data to be matched cannot be matched, the correction of the original pinyin data is inaccurate, and the text data corresponding to the original pinyin data is taken as the text data corresponding to the voice data to be recognized.

Example two

Fig. 2 is a flowchart of a speech recognition method according to a second embodiment of the present invention. The present embodiment is optimized based on the above embodiments. As shown in fig. 2, the method includes:

s210, voice data to be recognized are obtained, and original pinyin data corresponding to the voice data to be recognized are determined.

S220, correcting the original pinyin data to obtain pinyin data to be matched.

And S230, determining a matching node in the pinyin data to be matched, and matching the matching node with a standard pinyin node in a standard pinyin sequence to determine a target standard pinyin sequence matched with the pinyin data to be matched.

In this embodiment, the standard pinyin sequence is a sequence formed by standard pinyin nodes, and in order to match the pinyin data to be matched with the standard pinyin sequence, matching nodes in the pinyin data to be matched need to be determined, and the matching nodes in the pinyin data to be matched are sequentially matched with the matching nodes in the standard pinyin sequence according to the node data, so as to obtain a target standard pinyin sequence matched with the pinyin data to be matched. Optionally, a first standard pinyin node matched with a first matching node in the pinyin data to be matched may be determined, then a second matching node behind the first matching node is matched with a standard pinyin node connected with the first standard pinyin node to obtain a second standard pinyin node matched with the second matching node, the second standard pinyin node and the standard pinyin node are sequentially matched to obtain standard pinyin nodes matched with each matching node in the pinyin data to be matched, and a sequence formed by the standard pinyin nodes is used as a target standard pinyin sequence matched with the pinyin data to be matched.

Illustratively, the target standard pinyin sequence matched with the pinyin data to be matched can be obtained in a complete matching manner. Assuming that the pinyin data to be matched is 'wo mai de PRODSORT dao na le', determining matching nodes such as 'wo', 'mai', 'de', 'PRODSORT', 'dao', 'na' and 'le' in the pinyin data to be matched, sequentially matching the matching nodes with the standard pinyin nodes, and determining the standard pinyin nodes which are matched with the matching nodes and are connected with the standard pinyin node matched with the last matching node of the matching nodes. Determining a standard pinyin node 'wo' matched with the matching node 'wo'; a standard pinyin node "mai" that matches the matching node "mai" and is connected to the standard pinyin node "wo"; a standard pinyin node 'de' which is matched with the matching node 'de' and is connected with the standard pinyin node 'mai'; a standard pinyin node PRODSORT which is matched with the matching node PRODSORT and is connected with the standard pinyin node de; a standard pinyin node "dao" matching the matching node "dao" and connected to the standard pinyin node "prodport"; a standard pinyin node "na" that matches the matching node "na" and is connected to the standard pinyin node "dao"; the standard pinyin node "le" matched with the matching node "le" and connected with the standard pinyin node "na" takes a sequence "wo mai de prodscort dao na le" formed by the standard pinyin nodes "wo", "mai", "de", "prodscort", "dao", "na" and "le" as a target standard pinyin sequence.

In view of the fact that the pinyin data to be matched is lost in comparison with the standard pinyin sequence when the pinyin data to be matched is used for expressing habits or voice input, in this embodiment, a bidirectional matching algorithm may be adopted, and the target standard pinyin sequence matched with the pinyin data to be matched is determined in a complementary matching manner. In an embodiment of the present invention, the determining a matching node in the pinyin data to be matched, and determining a target standard pinyin sequence matching the pinyin data to be matched by matching the matching node with a standard pinyin node in the standard pinyin sequence includes: taking each pinyin in the pinyin data to be matched as a matching node; and matching the matching node with a standard pinyin node in the standard pinyin sequence by using a bidirectional matching algorithm, and obtaining the target standard pinyin sequence according to a matching result.

Optionally, the bidirectional matching algorithm may perform node matching in the forward direction and/or the reverse direction, and generally, the standard pinyin node matched with the matching node in the pinyin data to be matched is determined by the forward matching. Assuming that the matching nodes include a first matching node, a second matching node and a third matching node, and that a first standard pinyin node matching the first matching node, a second standard pinyin node matching the second matching node and connected to the first standard pinyin node, but not matching to a third matching node matching the third matching node and connected to the second matching node, are obtained by forward matching, all first candidate pinyin nodes connected to the second matching node in the standard pinyin sequence may be searched, then a third standard pinyin node matching the third matching node among the standard pinyin nodes may be searched, then it is determined whether a fourth standard pinyin node connected to the third standard pinyin node exists among the first candidate pinyin nodes, if a fourth standard pinyin node connected to the third standard pinyin node exists among the first candidate pinyin nodes, then a standard pinyin sequence consisting of a first standard pinyin node, a second standard pinyin node, a fourth standard pinyin node and a third standard pinyin node in sequence is used as a target standard pinyin sequence matched with the pinyin data to be matched; if the fourth standard pinyin node connected with the third standard pinyin node does not exist in the first candidate pinyin node, acquiring a second candidate pinyin node connected with the third standard pinyin node in the standard pinyin sequence, judging whether a second candidate pinyin node connected with the standard pinyin node in the first candidate pinyin node exists, and if the fifth standard pinyin node in the first candidate pinyin node is connected with the sixth standard pinyin node in the second candidate pinyin node, taking a standard pinyin sequence formed by the first standard pinyin node, the second standard pinyin node, the fifth standard pinyin node, the sixth standard pinyin node and the third standard pinyin node in sequence as a target standard pinyin sequence matched with the pinyin data to be matched.

In an embodiment of the present invention, the matching node with a standard pinyin node in the standard pinyin sequence using a bidirectional matching algorithm, and obtaining the target standard pinyin sequence according to a matching result, including: matching the matching node with the standard pinyin node by using the bidirectional matching algorithm to obtain at least one candidate standard pinyin sequence; aiming at each candidate standard pinyin sequence, determining a weight of the candidate standard pinyin sequence according to a sequence heat value of the candidate standard pinyin sequence and a pinyin heat value of each pinyin in the candidate standard pinyin sequence, wherein the sequence heat value is used for representing the use frequency of the standard pinyin sequence, and the pinyin heat value is used for representing the use frequency of the pinyin; and taking the candidate standard pinyin sequence with the maximum weight as the target standard pinyin sequence.

In this embodiment, there may be a plurality of obtained standard pinyin sequences matching the pinyin data to be matched, and the standard pinyin sequence with the largest weight may be used as the target standard pinyin sequence matching the pinyin data to be matched by calculating the weight of each standard pinyin sequence. For example, assuming that the standard pinyin sequences matched with the pinyin data to be matched include a candidate standard pinyin sequence 1, a candidate standard pinyin sequence 2 and a candidate standard pinyin sequence 3, and the weight of the candidate standard pinyin sequence 1 is 0.89, the weight of the candidate standard pinyin sequence 2 is 0.65, and the weight of the candidate standard pinyin sequence 3 is 0.78, the candidate standard pinyin sequence 1 with the largest weight is taken as the target standard pinyin sequence.

Optionally, after obtaining a plurality of candidate standard pinyin sequences matched with the pinyin data to be matched, calculating the weight of each candidate standard pinyin sequence according to the sequence heat value of the candidate standard pinyin sequence and the pinyin heat value of each pinyin in the candidate standard pinyin sequence aiming at each candidate standard pinyin sequence. The sequence heat value of the candidate standard pinyin sequence may be the number of times that the candidate standard pinyin sequence is taken as the target standard pinyin sequence, and the pinyin heat value of the pinyin may be the number of times that the pinyin exists in the target standard pinyin sequence. Because the pinyin heat value can represent the use frequency of pinyin, and the sequence heat value can represent the use frequency of a standard pinyin sequence, the weight of a candidate standard pinyin sequence calculated based on the pinyin heat value and the sequence heat value can accurately screen a target standard pinyin sequence matched with pinyin data to be matched.

In this embodiment, the weight of each pinyin in the candidate standard pinyin sequence may be first calculated, and the weight of the candidate standard pinyin sequence may be calculated based on the weight of each pinyin in the candidate standard pinyin sequence. Illustratively, one may consider the ratio of f (i) ((HW +1)/h (i) +1) ((1 + log)₁₀(h (i) +1)) calculating a weight value of each pinyin in the candidate standard pinyin sequence, and calculating the weight value of the candidate standard pinyin sequence by W ═ F (1) × F (2) × … … F (n). Wherein, F (i) represents the weight of the ith pinyin in the candidate standard pinyin sequence, HW represents the sequence heat value of the candidate standard pinyin sequence, H (i) represents the pinyin heat value of the ith pinyin in the candidate standard pinyin sequence, W represents the weight of the candidate standard pinyin sequence, and n is the total number of the pinyins in the candidate standard pinyin sequence.

In this embodiment, after determining the target pinyin sequence, the method further includes: and updating the pinyin heat value of each pinyin in the target standard pinyin sequence and the sequence heat value of the target standard pinyin sequence. In order to accurately calculate the weight of the candidate standard pinyin sequence, after the target standard pinyin sequence is determined, the pinyin heat value of each pinyin in the target standard pinyin sequence and the sequence heat value of the target standard pinyin sequence need to be updated. Specifically, the pinyin heat value of each pinyin in the target standard pinyin sequence is added with 1, and the sequence heat value of the target standard pinyin sequence is added with 1.

S240, taking the text data corresponding to the target standard pinyin sequence as the text data corresponding to the voice data to be recognized.

And after the target standard pinyin sequence is determined, using the text data corresponding to the target standard pinyin sequence as the text data corresponding to the voice data to be recognized. Illustratively, if the target standard pinyin sequence is "wo mai de PRODSORT dao nale", the text data "where my bought PRODSORT goes" corresponding to the target standard pinyin sequence is taken as the text data corresponding to the voice data to be recognized.

In one embodiment of the present invention, the method further comprises: and if the target standard pinyin sequence matched with the pinyin data to be matched does not exist in the standard pinyin sequence, using the text data corresponding to the original pinyin data as the text data corresponding to the voice data to be recognized.

And if the target standard pinyin sequence matched with the pinyin data to be matched cannot be obtained in a complete matching mode or a supplementary matching mode, namely the target standard pinyin sequence matched with the pinyin data to be matched does not exist in the standard pinyin sequence, the correction of the original pinyin data is wrong, and the text data corresponding to the original pinyin data is used as the text data corresponding to the voice data to be recognized.

The technical scheme of the embodiment of the invention comprises the steps of matching pinyin data to be matched with a pre-established standard pinyin sequence, determining text data corresponding to the voice data to be recognized according to a matching result, determining a matching node in the pinyin data to be matched, and determining a target standard pinyin sequence matched with the pinyin data to be matched by matching the matching node with the standard pinyin node in the standard pinyin sequence; and the text data corresponding to the target standard pinyin sequence is used as the text data corresponding to the voice data to be recognized, so that the matching result is more accurate, and the voice recognition result is improved.

On the basis of the above scheme, after obtaining at least one candidate standard pinyin sequence and before determining a weight of the candidate standard pinyin sequence for each candidate standard pinyin sequence, the method further includes:

aiming at each candidate standard pinyin sequence, comparing the candidate standard pinyin sequence with the pinyin data to be matched, and determining a difference value between the candidate standard pinyin sequence and the pinyin data to be matched; and if the difference value between the candidate standard pinyin sequence and the pinyin data to be matched is greater than a preset difference threshold value, deleting the candidate standard pinyin sequence.

Optionally, the candidate standard pinyin sequence matched with the pinyin data to be matched, which is obtained in a bidirectional matching manner, may have a larger difference with the pinyin data to be matched, and after the candidate standard pinyin sequence matched with the pinyin data to be matched is obtained, the candidate standard pinyin sequence may be screened. Specifically, a difference threshold value may be preset, after the candidate standard pinyin sequence is obtained, a difference value between the candidate standard pinyin sequence and the pinyin sequence to be matched is calculated, and the candidate standard pinyin sequence whose difference value with the pinyin sequence to be matched is greater than the difference threshold value is deleted. Wherein, the difference threshold value can be set according to the actual requirement. Alternatively, the difference threshold may be 0.5.

Illustratively, if the candidate standard pinyin sequences include a candidate standard pinyin sequence 1, a candidate standard pinyin sequence 2 and a candidate standard pinyin sequence 3, the difference threshold is 0.5, the difference value between the candidate standard pinyin sequence 1 and the pinyin sequence to be matched is 0.4, the difference value between the candidate standard pinyin sequence 2 and the pinyin sequence to be matched is 0.5, and the difference value between the candidate standard pinyin sequence 3 and the pinyin sequence to be matched is 0.7, the candidate standard pinyin sequence 3 with the difference value between the candidate standard pinyin sequence 1 and the pinyin sequence to be matched being greater than the difference threshold 0.5 is deleted.

Optionally, the difference value between the candidate standard pinyin sequence and the pinyin sequence to be matched may be calculated by C ═ m/n. Wherein, C is the difference value between the candidate standard pinyin sequence and the pinyin sequence to be matched, m is the number of different pinyins between the candidate standard pinyin sequence and the pinyin sequence to be matched, and n is the total number of pinyins of the pinyin sequence to be matched.

EXAMPLE III

Fig. 3a is a flowchart of a speech recognition method according to a third embodiment of the present invention. On the basis of the above embodiments, the present embodiment provides a preferred embodiment by taking voice intelligent customer service as an example. In this embodiment, based on the speech recognition result, a non-directional retrieval map (standard pinyin sequence) is generated by using a manually-combed near-phonetic pinyin dictionary and an easily-incorrect sentence, so that the speech recognition result is dynamically corrected, and the purpose of correctly recognizing the user's intention is achieved. The voice recognition method provided by the embodiment of the invention can be executed by an intelligent customer service system. Fig. 3b is a schematic structural diagram of an intelligent customer service system according to a third embodiment of the present invention. As shown in fig. 3b, the intelligent customer service system includes a Speech Recognition module (ASR) 310, a Recognition correction module 320, a Natural Speech Processing (NLP) 340, and a Speech synthesis module (Text To Speech, TTS), wherein the Speech synthesis module is not shown in the figure. The voice intelligent customer service is mainly realized as follows: the voice of the user is converted into characters through the automatic voice recognition technology of the voice recognition module 310, then the characters are transmitted into the recognition correction module 320 to obtain character recognition results, the character recognition results are transmitted into the natural language processing module to be processed and responded, and finally the characters of the response text are converted into voice through the voice synthesis module to be broadcasted and responded. The recognition and correction module 320 comprises a near pinyin dictionary 331, an undirected graph matching module 332, an error-prone sentence undirected graph 333 and a hotlist 334 of sentences and words. The near-phonetic alphabet dictionary 331 is used for establishing a near-phonetic alphabet dictionary, the error-prone sentence undirected graph 333 is used for establishing an error-prone sentence undirected retrieval graph, the hotlist 334 of sentences and words is used for storing the hotlist of sentences and words, and the undirected graph matching module 332 is used for matching the correct pinyin with the error-prone sentence undirected graph to obtain a matching result.

Fig. 3c is a schematic flow chart of an intelligent customer service response method according to a third embodiment of the present invention. As shown in fig. 3c, configuring high-frequency error-prone sentences, generalizing the high-frequency error-prone sentences, converting the sentences into pinyin, initializing undirected graphs, initializing hottables, generating hottables of sentences and words, and generating undirected search graphs. When receiving the spoken voice information of the user, the voice information is converted into characters through ASR, the characters are generalized, sentences are converted into correct pinyin based on a pre-constructed near-phonetic pinyin dictionary, and then matching results matched with the correct pinyin are searched in an undirected graph. Specifically, whether all pinyins contained in the correct pinyin exist in the undirected graph or not is searched in the undirected graph, if all pinyins contained in the correct pinyin do not exist in the undirected graph, the original sentence is returned, so that the NLU module performs intention identification and character response according to the original sentence, and the character response is converted into voice response through TTS and then fed back to the user. If all the pinyin contained in the correct pinyin exists in the undirected graph, a bidirectional matching algorithm is used for matching, all matching sequences containing the correct pinyin are obtained, the difference value between each matching sequence and the correct pinyin is calculated, the matching sequences with the difference values larger than a set threshold value are deleted, the rest matching sequences are used as matching results, the weight of each matching result is calculated, the matching result with the highest weight is output to the NLU module, the NLU module performs intention identification and character response according to the matching result with the highest weight, and the character response is converted into voice response through TTS and then fed back to a user.

The speech recognition method provided by the present embodiment will be described in detail below. As shown in fig. 3a, the speech recognition method provided in this embodiment includes:

s310, establishing a near-phonetic alphabet dictionary.

And manually sorting the high-frequency error-prone near-phonetic alphabets to sort out the mapping relation between the common error-word pinyin and the correct near-phonetic pinyin, storing the mapping relation into a near-phonetic pinyin dictionary, and putting the mapping relation into a database. The mapping relation contained in the near-phonetic alphabet dictionary is exemplarily shown in table 1. As shown in table 1, the pinyin for common wrong words includes "La" and "Wang", the correct proximal pinyin corresponding to "La" is "Na", and the correct proximal pinyin corresponding to "Wang" is "Huang".

TABLE 1

Common wrong word pinyin	Correct near sound phonetic alphabet
		La	Na
Wang	Huang

And S320, establishing an undirected retrieval map of the error-prone sentence.

The standard description of the high-frequency error-prone sentence is combed out manually, and after the standard description is generalized, all characters are converted into pinyin and stored in an undirected graph, for example: the tinypinn tool is used to convert the text to pinyin. The corresponding relationship between the original sentence, the generalized text and the pinyin is exemplarily shown in table 2.

And after the original sentence is converted into pinyin, the pinyin of a single character is used as a node, and the spliced pinyin short sentences are used for constructing a undirected retrieval graph according to the forward sequence. Fig. 3d is a schematic diagram of an undirected search graph in a speech recognition method according to a third embodiment of the present invention. As shown in fig. 3d, adjacent pinyin nodes in the pinyin short sentence are connected to form a undirected search graph containing connection relations.

TABLE 2

S330, initializing a hotlist of sentences and words.

And initializing the heat of the error-prone sentence and the heat of a single character in the sentence. Alternatively, the initial value of the sentence heat value per sentence and the word heat value per word may be set to 0. The heat value of a word in a sentence and the way of representing the heat value of the sentence are schematically shown in table 3. As shown in table 3, the sentence "when my bought prodport arrives" has a heat value of 1, in which the word "i" has a heat value of 3, "buy" has a heat value of 2, "buy" has a heat value of 3, "prodport" has a heat value of 3, "sh" has a heat value of 2, "how" has a heat value of 2, "hour" has a heat value of 2, "wait" has a heat value of 2, and "go" has a heat value of 2.

TABLE 3

And S340, converting the user question into the correct pinyin.

And after generalization, converting the user question into pinyin, matching a near-sound word list and acquiring correct near-sound pinyin. For example: and converting the characters into pinyin by using a TinyPinyin tool, and matching a near-sound pinyin list to obtain correct pinyin. Table 4 schematically shows the correspondence between the original sentence, the generalized sentence, the pinyin, and the correct pinyin.

TABLE 4

And S350, searching the undirected search graph by adopting a bidirectional matching algorithm to obtain at least one matching result.

Firstly, ensuring that all pinyins exist in the undirected search graph, and if any pinyin does not exist in the undirected search graph, failing in matching and directly returning to an original character string; if the correct pinyin sequences exist, traversing the undirected search graph in sequence according to a bidirectional matching algorithm, returning all matching results in a complete matching and complementary matching mode, and replacing the matched pinyin with character sequences in the undirected search graph.

In one embodiment, if the correct pinyin to be matched is "wo de PRODSORT dao na le". Firstly, checking whether the pinyin in the correct pinyin to be matched completely exists in the undirected search graph or not, wherein the checking result is that the pinyin in the correct pinyin to be matched completely exists in the undirected search graph; then, the matching is started from the head "wo" and the tail "le", and both the forward direction and the backward direction are completely matched to the undirected graph pinyin sequence "wo-de-PRODSORT-dao-na-le". At this point, check again, any combination containing the current sequence, list other whole sentence likelihood results, such as "wo de prodscort dao na le", "wo main de prodscortdao na le", and return the corresponding literal sequence "where my prodscort goes", "where does i buy the prodscort"; finally, the matching results with more than 50% of the changes are filtered, and the changes of the two results are not more than 50% (0/6, 1/6), and all the results are returned.

In one embodiment, if the correct pinyin to be matched is "wo de PRODSORT shen fa". Firstly, checking whether the pinyin in the correct pinyin to be matched completely exists in the undirected search graph or not, wherein the checking result is that the pinyin in the correct pinyin to be matched completely exists in the undirected search graph; then, matching is started from the head "wo" and the tail "fa" to find a connected sequence, and fig. 3e is a schematic diagram of a bidirectional matching method in a speech recognition method provided by the third embodiment of the present invention. Wherein the solid one-way arrow represents the forward matching process, the dashed one-way line head represents the reverse matching process, and the solid two-way arrow represents the successful matching of the forward matching and the reverse matching. As shown in fig. 3e, the forward matching wo-de-prodport-shell-me-fa ends, and the reverse matching fa-me ends; at this time, the sequence of forward matching selects the next node from the graph to try to match wo-de-PRODSORT-when-me-shi, and the sequence of reverse matching selects the next node from the graph to try to match fa-hou, and as a result, paths are connected between shi and hou, so that the sequence of the undirected graph is completely matched to the sequence of the connected undirected graph: wo-de-PRODSORT-shen-me-shi-hou-fa; if the connected sequence does not exist, the original source character string is directly returned; at this time, it is judged again whether any combination containing the current sequence exists, and other whole sentences are obtained: "wo mail de PRODSORT form me shi hou fa", "wo mail depDSORT form me shi hou fa huo", "wo de PRODSORT form me shi hou fa", "wo depDSORT form me shi hou fa", and "wo depDSORT form me shi hou fa huo" are returned, and the words "when my PRODSORT bought", "when my PRODSORT ships" are issued "corresponding to the pinyin sequence are returned. And finally, filtering and changing the matching result with the change of more than 50%, wherein the last pinyin sequence in the result is changed to 4/6 and is more than 50%, deleting the pinyin sequence, and returning all other sequences.

When searching for the connected sequence through forward and reverse matching, the tried nodes respectively try two layers in the forward and reverse directions, and each layer possibly has a plurality of nodes, all combinations need to be exhausted, matching is carried out for many times, all connected sequences are found out, if all combinations are not completely matched after trying, matching is directly abandoned, and the original text is returned. It should be noted that, the level tried here sets a threshold value, and at most two levels are tried, so that too many levels affect the matching performance; such as "shi" and "hou" in the above example, if there is no communication path, the next level nodes of "shi" and "hou" will not be further tried.

And S360, calculating the whole sentence weight of the matching result, and taking the matching result with the maximum weight as an output sentence.

May be expressed by f (i) ((HW +1)/h (i) +1) ((1 + log)₁₀(h (i) +1)) calculating a weight value of each pinyin in the candidate standard pinyin sequence, and calculating the weight value of the candidate standard pinyin sequence by W ═ F (1) × F (2) × … … F (n). Wherein, F (i) represents the weight of the ith pinyin in the candidate standard pinyin sequence, HW represents the sequence heat value of the candidate standard pinyin sequence, H (i) represents the pinyin heat value of the ith pinyin in the candidate standard pinyin sequence, W represents the weight of the candidate standard pinyin sequence, and n is the total number of the pinyins in the candidate standard pinyin sequence.

TABLE 5

Table 5 schematically shows the correspondence between the original sentence, the matched sentence, the whole sentence weight calculation, and the output sentence. As shown in table 5, the matching sentence corresponding to the original sentence "my progress to pull" includes "my progress to which" and "my progress to which", "my progress to which" have a weight of 0.848, and "my progress to which" have a weight of 0.006, and then "my progress to which" is taken as its corresponding output sentence, and the answer mode is determined based on the output sentence. And S370, updating the word heat and the sentence heat.

And adding one to all the word heat degrees in the output sentence, and adding one to the heat degree of the output sentence. Table 6 schematically shows the updated heat degree table of the word heat degree and the sentence heat degree.

TABLE 6

Aiming at the problem that the user intention is wrong due to the fact that a phonetic-to-text word has a near-phonetic word recognition error and a word missing recognition error, a manually-configurable undirected search graph of high-frequency error-prone sentences is added, the problem of high-frequency speech recognition errors is solved through undirected graph bidirectional matching and a mode that matching results are sorted based on weights, the problem of changing after-sale scene requirements is met, the problem that training periods of speech recognition models in the prior art are long is solved, the embodiment can achieve the purpose of dynamic adaptation through adjustment of high-frequency error-prone sentence configuration, error correction of the high-frequency error-prone sentences is achieved, the purpose of improving the user intention recognition accuracy and recall rate is achieved, and user experience is improved.

Example four

Fig. 4 is a schematic structural diagram of a speech recognition apparatus according to a fourth embodiment of the present invention. The speech recognition means may be implemented in software and/or hardware, for example, the speech recognition means may be configured in a computer device. As shown in fig. 4, the apparatus includes a pinyin data acquisition module 410, a pinyin data calibration module 420, and a text data determination module 430, wherein:

a pinyin data obtaining module 410, configured to obtain voice data to be recognized, and determine original pinyin data corresponding to the voice data to be recognized;

a pinyin data calibration module 420, configured to correct the original pinyin data to obtain pinyin data to be matched;

and the text data determining module 430 is configured to match the pinyin data to be matched with a pre-established standard pinyin sequence, and determine text data corresponding to the voice data to be recognized according to a matching result.

The embodiment of the invention obtains the voice data to be recognized through a pinyin data obtaining module, and determines the original pinyin data corresponding to the voice data to be recognized; the pinyin data calibration module corrects the original pinyin data to obtain pinyin data to be matched; the text data determining module matches the pinyin data to be matched with a pre-established standard pinyin sequence, determines text data corresponding to the voice data to be recognized according to a matching result, corrects the original pinyin data, and recognizes based on the corrected voice data, so that the accuracy of voice recognition is improved, and further the response accuracy of the voice intelligent customer service is improved.

On the basis of the above scheme, the pinyin data calibration module 420 is specifically configured to:

determining wrong pinyin contained in the original pinyin data as pinyin to be corrected according to a preset pinyin near-sound table; wherein, the pinyin near sound table stores the corresponding relation between at least one wrong pinyin and a standard pinyin;

and correcting the pinyin to be corrected contained in the original pinyin data into the standard pinyin corresponding to the pinyin to be corrected, so as to obtain the pinyin data to be matched.

On the basis of the above scheme, the text data determining module 430 includes:

the target sequence determining unit is used for determining a matching node in the pinyin data to be matched and determining a target standard pinyin sequence matched with the pinyin data to be matched by matching the matching node with a standard pinyin node in the standard pinyin sequence;

and the text data determining unit is used for taking the text data corresponding to the target standard pinyin sequence as the text data corresponding to the voice data to be recognized.

On the basis of the above scheme, the target sequence determining unit includes:

a matching node determining subunit, configured to use each pinyin in the pinyin data to be matched as a matching node;

and the bidirectional matching subunit is used for matching the matching node with the standard pinyin node in the standard pinyin sequence by using a bidirectional matching algorithm and obtaining the target standard pinyin sequence according to a matching result.

On the basis of the above scheme, the bidirectional matching subunit is specifically configured to:

matching the matching node with the standard pinyin node by using the bidirectional matching algorithm to obtain at least one candidate standard pinyin sequence;

aiming at each candidate standard pinyin sequence, determining a weight of the candidate standard pinyin sequence according to a sequence heat value of the candidate standard pinyin sequence and a pinyin heat value of each pinyin in the candidate standard pinyin sequence, wherein the sequence heat value is used for representing the use frequency of the standard pinyin sequence, and the pinyin heat value is used for representing the use frequency of the pinyin;

and taking the candidate standard pinyin sequence with the maximum weight as the target standard pinyin sequence.

On the basis of the above scheme, the bidirectional matching subunit is further configured to:

after obtaining at least one candidate standard pinyin sequence and before determining the weight of the candidate standard pinyin sequence for each candidate standard pinyin sequence, comparing the candidate standard pinyin sequence with the pinyin data to be matched for each candidate standard pinyin sequence, and determining the difference value between the candidate standard pinyin sequence and the pinyin data to be matched;

and if the difference value between the candidate standard pinyin sequence and the pinyin data to be matched is greater than a preset difference threshold value, deleting the candidate standard pinyin sequence.

On the basis of the above scheme, the text data determining module 430 is further configured to:

and if the target standard pinyin sequence matched with the pinyin data to be matched does not exist in the standard pinyin sequence, using the text data corresponding to the original pinyin data as the text data corresponding to the voice data to be recognized.

On the basis of the above scheme, the apparatus further comprises:

and the heat value updating module is used for updating the pinyin heat value of each pinyin in the target standard pinyin sequence and the sequence heat value of the target standard pinyin sequence.

The voice recognition device provided by the embodiment of the invention can execute the voice recognition method provided by any embodiment, and has the corresponding functional modules and beneficial effects of the execution method.

EXAMPLE five

Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention. FIG. 5 illustrates a block diagram of an exemplary computer device 512 suitable for use in implementing embodiments of the present invention. The computer device 512 shown in FIG. 5 is only an example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention.

As shown in FIG. 5, computer device 512 is in the form of a general purpose computing device. Components of computer device 512 may include, but are not limited to: one or more processors 516, a system memory 528, and a bus 518 that couples the various system components including the system memory 528 and the processors 516.

Bus 518 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and processor 516, or a local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 512 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 512 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 528 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)530 and/or cache memory 532. The computer device 512 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage 534 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD ROM, DVD ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 518 through one or more data media interfaces. Memory 528 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 540 having a set (at least one) of program modules 542, including but not limited to an operating system, one or more application programs, other program modules, and program data, may be stored in, for example, the memory 528, each of which examples or some combination may include an implementation of a network environment. The program modules 542 generally perform the functions and/or methods of the described embodiments of the invention.

The computer device 512 may also communicate with one or more external devices 514 (e.g., keyboard, pointing device, display 524, etc.), with one or more devices that enable a user to interact with the computer device 512, and/or with any devices (e.g., network card, modem, etc.) that enable the computer device 512 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 522. Also, computer device 512 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via network adapter 520. As shown, the network adapter 520 communicates with the other modules of the computer device 512 via the bus 518. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the computer device 512, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processor 516 executes various functional applications and data processing by executing programs stored in the system memory 528, for example, implementing a voice recognition method provided by an embodiment of the present invention, the method includes:

correcting the original pinyin data to obtain pinyin data to be matched;

Of course, those skilled in the art will understand that the processor may also implement the technical solution of the speech recognition method provided by any embodiment of the present invention.

EXAMPLE six

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a speech recognition method provided in an embodiment of the present invention, where the method includes:

correcting the original pinyin data to obtain pinyin data to be matched;

Of course, the computer program stored on the computer-readable storage medium provided by the embodiment of the present invention is not limited to the method operations described above, and may also perform related operations in the speech recognition method provided by any embodiment of the present invention.

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CDROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A speech recognition method, comprising:

correcting the original pinyin data to obtain pinyin data to be matched;

2. The method of claim 1, wherein the correcting the original pinyin data to obtain pinyin data to be matched comprises:

3. The method as claimed in claim 1, wherein the matching the pinyin data to be matched with a pre-established standard pinyin sequence, and determining text data corresponding to the voice data to be recognized according to the matching result comprises:

determining a matching node in the pinyin data to be matched, and determining a target standard pinyin sequence matched with the pinyin data to be matched by matching the matching node with a standard pinyin node in the standard pinyin sequence;

and taking the text data corresponding to the target standard pinyin sequence as the text data corresponding to the voice data to be recognized.

4. The method as claimed in claim 3, wherein the determining the matching node in the pinyin data to be matched, and the determining the target pinyin sequence that matches the pinyin data to be matched by matching the matching node with a pinyin node that is a standard in the pinyin sequence, comprises:

taking each pinyin in the pinyin data to be matched as a matching node;

and matching the matching node with a standard pinyin node in the standard pinyin sequence by using a bidirectional matching algorithm, and obtaining the target standard pinyin sequence according to a matching result.

5. The method as claimed in claim 4, wherein the using a bidirectional matching algorithm to match the matching node with a standard pinyin node in the standard pinyin sequence and obtain the target standard pinyin sequence according to the matching result includes:

6. The method of claim 5, wherein after obtaining at least one candidate standard pinyin sequence and before determining a weight for the candidate standard pinyin sequence for each of the candidate standard pinyin sequences, further comprising:

aiming at each candidate standard pinyin sequence, comparing the candidate standard pinyin sequence with the pinyin data to be matched, and determining a difference value between the candidate standard pinyin sequence and the pinyin data to be matched;

7. The method of claim 3, further comprising:

8. The method of claim 5, further comprising:

and updating the pinyin heat value of each pinyin in the target standard pinyin sequence and the sequence heat value of the target standard pinyin sequence.

9. A speech recognition apparatus, comprising:

10. A computer device, the device comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a speech recognition method as recited in any of claims 1-8.

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the speech recognition method according to any one of claims 1 to 8.