CN117116267B

CN117116267B - Speech recognition method and device, electronic equipment and storage medium

Info

Publication number: CN117116267B
Application number: CN202311380902.8A
Authority: CN
Inventors: 刘哲; 孙磊; 张儒瑞; 魏冲洲
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2024-02-13
Anticipated expiration: 2043-10-24
Also published as: CN117116267A

Abstract

The application discloses a voice recognition method and device, electronic equipment and storage medium, wherein the method comprises the following steps: obtaining each candidate text sequence and a candidate pronunciation sequence corresponding to each candidate text sequence which are obtained by carrying out voice recognition on target voice; obtaining a target text sequence corresponding to the target voice based on each candidate text sequence; and determining a target pronunciation sequence by utilizing the candidate pronunciation sequence, wherein the target pronunciation sequence comprises pronunciation corresponding to at least one target word in the target text sequence, and the target pronunciation sequence is used for providing reference for pronunciation of the target word contained in the associated voice of the target voice. By the method, the accuracy of the target pronunciation sequence corresponding to the target text obtained by voice recognition can be improved.

Description

Speech recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of speech processing technologies, and in particular, to a speech recognition method and apparatus, an electronic device, and a storage medium.

Background

With the development of speech recognition technology, the existing speech recognition technology can convert speech into text with higher accuracy, however, the existing speech recognition technology has lower attention to pronunciation of the text, which results in lower pronunciation accuracy corresponding to the text obtained by recognition. Particularly for polyphones, it is often difficult for existing speech recognition techniques to determine the correct pronunciation of the polyphones.

While the correct pronunciation of text is also critical in everyday applications. For example, in the case of voice interaction, the voice recognition technology and the voice synthesis technology are two indispensable parts, and the voice recognition technology enables the machine to hear by converting a voice signal into a recognition text, and the voice synthesis technology enables the machine to speak by capturing the recognition text and converting it into a voice signal, thereby achieving the capability of voice interaction with a user. Therefore, the accuracy of pronunciation corresponding to the text seriously affects the accuracy of the voice signal obtained by converting the text.

Disclosure of Invention

The technical problem that this application mainly solves is to provide a speech recognition method and device, electronic equipment and storage medium, can improve the accuracy of pronunciation sequence that the target text that speech recognition obtained corresponds to.

In order to solve the above technical problems, a first aspect of the present application provides a speech recognition method, which includes: obtaining each candidate text sequence and a candidate pronunciation sequence corresponding to each candidate text sequence which are obtained by carrying out voice recognition on target voice; obtaining a target text sequence corresponding to the target voice based on each candidate text sequence; and determining a target pronunciation sequence by utilizing the candidate pronunciation sequence, wherein the target pronunciation sequence comprises pronunciation corresponding to at least one target word in the target text sequence, and the target pronunciation sequence is used for providing reference for pronunciation of the target word contained in the associated voice of the target voice.

To solve the above technical problem, a second aspect of the present application provides a speech recognition device, including: the system comprises an acquisition module, a text sequence module and a pronunciation sequence module, wherein the acquisition module is used for acquiring each candidate text sequence and a candidate pronunciation sequence corresponding to each candidate text sequence obtained by carrying out voice recognition on target voice; the text sequence module is used for obtaining a target text sequence corresponding to the target voice based on each candidate text sequence; the pronunciation sequence module is used for determining a target pronunciation sequence by utilizing the candidate pronunciation sequence, wherein the target pronunciation sequence comprises pronunciation corresponding to at least one target word in the target text sequence, and the target pronunciation sequence is used for providing reference for pronunciation of the target word contained in the associated voice of the target voice.

To solve the above technical problem, a third aspect of the present application provides an electronic device, including a memory and a processor that are coupled to each other, where the memory stores program instructions; the processor is configured to execute program instructions stored in the memory to implement the method provided in the first aspect.

To solve the above technical problem, a fourth aspect of the present application provides a computer-readable storage medium for storing program instructions that can be executed to implement the method provided in the above first aspect.

The beneficial effects of this application are: different from the condition of the prior art, the method and the device acquire each candidate text sequence and the candidate pronunciation sequence corresponding to each candidate text sequence which are obtained by carrying out voice recognition on target voice; obtaining a target text sequence corresponding to the target voice based on each candidate text sequence; and determining a target pronunciation sequence by utilizing the candidate pronunciation sequence, wherein the target pronunciation sequence comprises pronunciation corresponding to at least one target word in the target text sequence, and the target pronunciation sequence is used for providing reference for pronunciation of the target word contained in the associated voice of the target voice. By combining the candidate pronunciation sequences obtained by the target voice recognition, the target pronunciation sequences are determined, and the accuracy of the target pronunciation sequences corresponding to the target text obtained by the voice recognition can be improved.

Drawings

Fig. 1 is a schematic flow chart of a first embodiment of a speech recognition method provided in the present application;

FIG. 2 is a schematic flow chart of a second embodiment of a speech recognition method provided in the present application;

FIG. 3 is a flowchart illustrating a third embodiment of a speech recognition method provided in the present application;

FIG. 4 is a schematic diagram of a speech recognition device according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a frame structure of an embodiment of an electronic device provided herein;

FIG. 6 is a schematic diagram of a framework of an embodiment of a computer readable storage medium provided herein.

Detailed Description

The following description of the embodiments of the present application, taken in conjunction with the accompanying drawings, will clearly and fully describe the embodiments of the present application, and it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

It should be noted that, in the embodiments of the present application, there is a description of "first", "second", etc., which are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

Referring to fig. 1, fig. 1 is a flowchart of a first embodiment of a voice recognition method provided in the present application, where the method includes:

s11: and obtaining each candidate text sequence and a candidate pronunciation sequence corresponding to each candidate text sequence which are obtained by carrying out voice recognition on the target voice.

In one embodiment, the target speech may be emitted by a person or any device capable of speaking, such as a smart robot, a cell phone, a computer, etc. The speech recognition of the target speech may be performed by a speech recognition model, which may employ a convolutional neural network. Before the target voice is recognized by the voice recognition model, the characteristic extraction can be performed on the target voice, and only effective information in the target voice is extracted, so that the subsequent calculated amount is reduced. In a specific embodiment, feature extraction is performed on the target voice, so that the FBANK features of the voice can be extracted as audio features of the target voice, and the FBANK features can retain more audio information and are more suitable for training and reasoning of a neural network. The FBANK is a front-end processing algorithm, and processes audio in a manner similar to human ears, so that the performance of voice recognition can be improved. Further, the audio features are input into a voice recognition model, so that the voice recognition model decodes the audio features to obtain candidate text sequences and candidate pronunciation sequences corresponding to the candidate text sequences. The candidate text sequence comprises a text obtained by identifying target voice, the candidate pronunciation sequence comprises pronunciation of each word in the candidate text sequence, and the pronunciation of each word comprises pinyin and tone of each word.

S12: and obtaining a target text sequence corresponding to the target voice based on each candidate text sequence.

In one embodiment, after each candidate text sequence is obtained, the language model may be utilized to reorder each candidate text sequence to obtain the target text sequence. It will be appreciated that the target text sequence may be identical to the candidate text sequences, in which case the target text sequence may be the candidate text sequence having the greatest probability value among the candidate text sequences. However, due to the limitation of the decoding time and space of the speech recognition model, there may be some unsaved candidate text sequences, which may cause the target text sequence and the candidate text sequence to be different, where the probability value of the target text sequence is greater than the probability value of each candidate text sequence, that is, the accuracy of representing the target text sequence is greater than the accuracy of each candidate text sequence. The language model may be an N-gram model, such as a 4-gram model.

S13: and determining a target pronunciation sequence by utilizing the candidate pronunciation sequence, wherein the target pronunciation sequence comprises pronunciation corresponding to at least one target word in the target text sequence, and the target pronunciation sequence is used for providing reference for pronunciation of the target word contained in the associated voice of the target voice.

In one embodiment, after the target text sequence is obtained, the pronunciation table may be queried to obtain the pronunciation of each target word in the target text sequence in the pronunciation table, the pronunciation table stores the possible pronunciation of a plurality of words, and the initial pronunciation sequence corresponding to the target text sequence is obtained by using the pronunciation of each target word in the pronunciation table. The target word may be a word or a word, which is not specifically limited herein. It will be appreciated that since there may be multiple pronunciations of the target word segment in the pronunciation table, there may be multiple initial pronunciation sequences corresponding to the target text sequence. Taking "see" word as an example, the word has two pronunciations in the pronunciation table, namely "k a n" and "k ā n", if other target word segments in the target text sequence have only one pronunciation in the pronunciation table, the initial pronunciation sequences corresponding to the target text sequence are two, the pronunciation corresponding to the "see" word in one initial pronunciation sequence is "k a n", the pronunciation corresponding to the "see" word in the other initial pronunciation sequence is "k ā n", and the pronunciations corresponding to the other target word segments in the two initial pronunciation sequences are the same.

Further, from the candidate pronunciation sequences, a candidate pronunciation sequence with the similarity meeting the preset requirement is selected as a reference pronunciation sequence. In an embodiment, the preset requirement may be that the reference pronunciation sequence is the same as the initial pronunciation sequence, and whether the reference pronunciation sequence is the same as the initial pronunciation sequence may be determined by each candidate text sequence and the target text sequence, for example, whether a candidate text sequence identical to the target text sequence exists in each candidate text sequence may be determined, and if so, the candidate pronunciation sequence corresponding to the candidate text sequence identical to the target text sequence is used as the reference pronunciation sequence. In another embodiment, the preset requirement is that the similarity between the reference pronunciation sequence and the initial pronunciation sequence is the highest, the similarity between each candidate text sequence and the target text sequence is calculated, and the candidate pronunciation sequence corresponding to the candidate text sequence with the highest similarity to the target text sequence is used as the reference pronunciation sequence.

After the reference pronunciation sequence is obtained, the initial pronunciation sequence is adjusted based on the reference pronunciation sequence, and a final pronunciation sequence corresponding to the target text sequence is obtained; in one embodiment, the initial pronunciation sequence may be used as the final pronunciation sequence; the pronunciation of the target word in the initial pronunciation sequence can be updated by utilizing the pronunciation of the target word in the reference pronunciation sequence, so that the final pronunciation sequence can be obtained. In another embodiment, if there are a plurality of initial pronunciation sequences, one initial pronunciation sequence may be selected from the plurality of initial pronunciation sequences based on the reference pronunciation sequence, resulting in a final pronunciation sequence.

And finally, selecting pronunciation corresponding to at least one target word from the final pronunciation sequence to form a target pronunciation sequence. In an embodiment, the final pronunciation sequence may be used as the target pronunciation sequence, for example, when the pronunciation of the child is checked, the above processing may be performed on the voice uttered by the child to obtain the final pronunciation sequence, and the final pronunciation sequence is used as the target pronunciation sequence; and comparing the target pronunciation sequence with the labeling pronunciation sequence to judge whether the pronunciation of the child is correct. In another embodiment, the pronunciation corresponding to a part of the target word may be selected from the final pronunciation sequence to form the target pronunciation sequence. The target pronunciation sequence is used for providing reference for pronunciation of target word included in the associated voice of the target voice. For example, in the field of voice interaction, when the robot and the user interact, the robot performs the above processing on the voice sent by the user, taking the voice sent by the user as "i am happy" as an example, the robot selects the pronunciation corresponding to the target word "happy" from the final pronunciation sequence to form the target pronunciation sequence, and replies the voice "happy" when you eat. At this time, the reply voice of the robot is the associated voice of the target voice, and when the robot sends out the associated voice, the robot uses the pronunciation of the target word in the target pronunciation sequence. It will be appreciated that the associated speech of the target speech may be other in other scenarios, and is not specifically limited herein.

In the mode, the target voice is subjected to voice recognition, and each obtained candidate text sequence and the candidate pronunciation sequence corresponding to each candidate text sequence are obtained; obtaining a target text sequence corresponding to the target voice based on each candidate text sequence; and determining a target pronunciation sequence by utilizing the candidate pronunciation sequence, wherein the target pronunciation sequence comprises pronunciation corresponding to at least one target word in the target text sequence, and the target pronunciation sequence is used for providing reference for pronunciation of the target word contained in the associated voice of the target voice. By combining the candidate pronunciation sequences obtained by target voice recognition and determining the target pronunciation sequences, the accuracy of the target pronunciation sequences corresponding to the target text obtained by voice recognition can be improved.

Referring to fig. 2, fig. 2 is a flow chart of a second embodiment of a voice recognition method provided in the present application, where the method includes:

s21: and obtaining each candidate text sequence and a candidate pronunciation sequence corresponding to each candidate text sequence which are obtained by carrying out voice recognition on the target voice.

S22: and obtaining a target text sequence corresponding to the target voice based on each candidate text sequence.

For the detailed implementation of steps S21 and S22, please refer to steps S11 and S12 of the first implementation of the voice recognition method provided in the present application, and the details are not repeated here.

S23: and acquiring an initial pronunciation sequence corresponding to the target text sequence.

In one embodiment, each target word in the target text sequence has a corresponding ID, and the initial pronunciation of the target word is obtained in the pronunciation table by using the ID of the target word. It will be appreciated that there may be one or more initial pronunciations for the target word segment, and thus, the initial pronunciation sequence corresponding to the target text sequence may be one or more.

S24: and judging whether each candidate text sequence is identical to the target text sequence.

In an embodiment, each candidate text sequence may be directly compared with the target text sequence to determine whether the candidate text sequence is identical to the target text sequence; the similarity between each candidate text sequence and the target text sequence can also be calculated through an edit distance algorithm, and if the similarity is 100%, the candidate text sequence is the same as the target text sequence.

S25: and taking the candidate pronunciation sequence corresponding to the candidate text sequence identical to the target text sequence as a reference pronunciation sequence.

S26: and judging whether the number of each reference pronunciation in the reference pronunciation sequence is consistent with the number of each target word of the target text sequence and the number of the target word to obtain a first judging result.

In an embodiment, the comparing and judging is performed on the whole sequence when judging whether the candidate text sequence is identical to the target text sequence. Therefore, it is also required to determine whether each candidate word in the candidate text sequence identical to the target text sequence is identical to each target word in the target text sequence, that is, whether each reference pronunciation in the reference pronunciation sequence is consistent with each target word in the target text sequence; and judging whether the number of the reference pronunciations is the same as the number of the target segmented words, namely judging whether the number of the candidate segmented words in the candidate text sequence which is the same as the target text sequence is the same as the number of the target segmented words.

Taking the example that only one initial pronunciation sequence exists, in a specific embodiment, whether each candidate word in the candidate text sequence identical to the target text sequence is identical to each target word can be judged to determine whether each reference pronunciation in the reference pronunciation sequence is identical to each target word in the target text sequence. And when the candidate word is the same as the target word, determining that the reference pronunciation is consistent with the target word. In another embodiment, it may be determined whether each reference pronunciation in the sequence of reference pronunciations is identical to each initial pronunciation in the sequence of initial pronunciations to determine that each reference pronunciation is consistent with each target word; when the reference pronunciations are the same as the initial pronunciations, it is determined that the reference pronunciations are consistent with the target segmentations. Each reference pronunciation is the pronunciation of each reference word in the reference text sequence corresponding to the reference pronunciation sequence, and the number of the reference pronunciations is consistent with the number of the reference words.

S27: based on the first determination result, a final pronunciation sequence is determined.

In an embodiment, the first determination result is that each reference pronunciation is consistent with each target word, and the number of the reference pronunciations is the same as the number of the target words, and the reference pronunciation sequence is used as the final pronunciation sequence.

In another embodiment, the first determination result is that each reference pronunciation is inconsistent with the target word, the number of the reference pronunciations is the same as the number of the target word, or the first determination result is that each reference pronunciation is consistent with the target word, the number of the reference pronunciations is different from the number of the target word, or the first determination result is that each reference pronunciation is inconsistent with the target word, the number of the reference pronunciations is different from the number of the target word, and the final pronunciation sequence is obtained based on each reference pronunciation and the initial pronunciation of each target word in the initial pronunciation sequence.

Based on the initial pronunciation of each reference pronunciation and each target word in the initial pronunciation sequence, a final pronunciation sequence is obtained, which comprises the following steps: judging whether the reference word is the same as each target word or not in each reference word; taking the reference pronunciation corresponding to the same reference word as the target word in each reference word as the target pronunciation of the first target word; taking the initial pronunciation of the target word in the initial pronunciation sequence as the target pronunciation of the second target word; the first target word is the same target word as the reference word in each target word, and the second target word is a different target word from the reference word in each target word; and obtaining a final pronunciation sequence based on the target pronunciation of the first target word and the target pronunciation of the second target word. That is, the final pronunciation sequence can be obtained by replacing the initial pronunciation of the first target word in the initial pronunciation sequence with the reference pronunciation of the first target word in the reference pronunciation sequence.

S28: and selecting pronunciation corresponding to at least one target word from the final pronunciation sequence to form a target pronunciation sequence.

In the detailed implementation of step S28, please refer to step S13 of the first implementation of the voice recognition method provided in the present application, and the details are not repeated here.

In the embodiment, for the case that candidate text sequences identical to the target text sequence exist in the candidate text sequences, the candidate text sequences identical to the target text sequence are used as reference text sequences, and if the target text sequences are identical to the reference text sequences, whether all target segmentation words in the target text sequences are identical to all reference segmentation words in the reference text sequences is further judged, and if the target segmentation words are identical to all the reference segmentation words in the reference text sequences, a reference pronunciation sequence corresponding to the reference text sequences is used as a final pronunciation sequence; if the initial pronunciation of the first target word in the initial pronunciation sequence is not the same, replacing the initial pronunciation of the first target word in the initial pronunciation sequence with the reference pronunciation of the first target word in the reference pronunciation sequence, and obtaining the final pronunciation sequence.

Referring to fig. 3, fig. 3 is a flowchart of a third embodiment of a voice recognition method provided in the present application, where the method includes:

s31: and obtaining each candidate text sequence and a candidate pronunciation sequence corresponding to each candidate text sequence which are obtained by carrying out voice recognition on the target voice.

S32: and obtaining a target text sequence corresponding to the target voice based on each candidate text sequence.

S33: and acquiring an initial pronunciation sequence corresponding to the target text sequence.

Please refer to steps S21-S23 of the first embodiment of the voice recognition method provided in the present application for the detailed implementation of steps S31-S33, which is not repeated here.

S34: and calculating the similarity between each candidate text sequence and the target text sequence, and taking the candidate pronunciation sequence corresponding to the candidate text sequence with the maximum similarity of the target text sequence as the reference pronunciation sequence.

In one embodiment, the similarity between each candidate text sequence and the target text sequence can be calculated through an edit distance algorithm, and the candidate text sequence with the largest similarity with the target text sequence in each candidate text sequence is used as the reference text sequence. Wherein each candidate text sequence is different from the target text sequence. And taking the candidate pronunciation sequence corresponding to the reference text sequence as the reference pronunciation sequence.

S35: it is determined whether the reference pronunciation sequence and the initial pronunciation sequence are identical to determine a final pronunciation sequence.

In one embodiment, the reference pronunciation sequence is compared with the initial pronunciation sequence, and in response to the reference pronunciation sequence being the same as the initial pronunciation sequence, the reference pronunciation sequence is taken as the final pronunciation sequence; and responding to the difference between the reference pronunciation sequence and the initial pronunciation sequence, acquiring the pronunciation of each target word in the target text sequence in the pronunciation table, and acquiring a final pronunciation sequence based on the pronunciation of each target word in the pronunciation table. In a specific embodiment, if only one pronunciation exists in the pronunciation table, the pronunciation is used as the target pronunciation of the target word, and the target pronunciation of each target word is combined to obtain the final pronunciation sequence.

In another embodiment, if there are at least two pronunciations of the target word in the pronunciation table, a final pronunciation sequence is obtained based on the pronunciations of the target word in the pronunciation table, including: selecting reference pronunciation corresponding to the target word in the reference pronunciation sequence; selecting a pronunciation identical to the pitch of the reference pronunciation from at least two pronunciations as a target pronunciation of the target segmentation; and obtaining a final pronunciation sequence by utilizing target pronunciation of the target segmentation. In one embodiment, the reference pronunciation corresponding to the target word may be determined based on the position of each reference pronunciation in the reference pronunciation sequence and the position of the target word in the target text sequence. For example, if the target word is the third word in the target text sequence, then the reference pronunciation corresponding to the target word is also the third pronunciation in the reference pronunciation sequence. In other embodiments, the reference pronunciation corresponding to the current target word may also be located by the same reference pronunciation as the target pronunciation of other target words. For example, the target text sequence is "open difficultly-sent storm", and the target text sequence contains 3 target words, namely "open", "difficultly-sent" and "storm", respectively; the reference text sequence is "open blue storm", and the reference text sequence contains 4 reference words, namely "open", "difficult", "storm" and "storm", respectively. The target pronunciation of the first target word "open" is the same as the reference pronunciation of the first reference word "open", and similarly, the target pronunciation of the third target word "storm" is the same as the reference pronunciation of the fourth reference word "storm", so that the reference pronunciation corresponding to the second target word can be obtained by positioning the reference pronunciation of the first reference word and the reference pronunciation of the fourth reference word, i.e. the reference pronunciation corresponding to the second target word is the reference pronunciation of the second reference word and the reference pronunciation of the third reference word. After the reference pronunciation corresponding to the target word is determined, the pronunciation which is the same as the tone of the reference pronunciation can be selected from at least two pronunciations to be used as the target pronunciation of the target word. For example, assuming that the target word "difficult" has two pronunciations in the pronunciation table, n and n a n, respectively, and the reference pronunciation corresponding to the target word is i n, n is selected as the target pronunciation of the target word "difficult", and the target pronunciations of the respective target words are combined to obtain the final pronunciation sequence.

S36: and selecting pronunciation corresponding to at least one target word from the final pronunciation sequence to form a target pronunciation sequence.

In the detailed implementation of step S36, please refer to step S13 of the first implementation of the voice recognition method provided in the present application, and the details are not repeated here.

In the embodiment, for the case that the candidate text sequence which is the same as the target text sequence does not exist in the candidate text sequences, the candidate text sequence with the highest similarity with the target text sequence is used as a reference text sequence, whether the reference pronunciation sequence is the same as the initial pronunciation sequence is further judged, and if the candidate text sequence is the same as the initial pronunciation sequence, the reference pronunciation sequence is used as a final pronunciation sequence; if the target pronunciation is different, the pronunciation of each target word in the target text sequence in the pronunciation table is obtained, the reference pronunciation corresponding to the target word in the reference pronunciation sequence is selected, and the pronunciation which is the same as the tone of the reference pronunciation is selected from at least two pronunciations of the target word in the pronunciation table to be used as the target pronunciation of the target word; and combining the target pronunciations of the target segmentation words to obtain a final pronunciation sequence.

In the above embodiment, after the final pronunciation sequence is obtained, the final pronunciation sequence may be modified, specifically, whether each target pronunciation corresponding to each target word in the target text sequence in the final pronunciation sequence meets a preset modification condition is determined; and correcting the target pronunciation meeting the preset correction conditions according to the preset correction rules. And taking the corrected sequence as a final pronunciation sequence. The preset correction conditions and the preset correction rules are set by a user, for example, the preset correction conditions may be set as to whether a preset first pronunciation exists in each target pronunciation, and the preset first pronunciation may be set as to modify the preset first pronunciation into a preset second pronunciation, for example, modify ba4 and ma4 into ba5 and ma5, for example. It will be appreciated that the above-described preset correction conditions and preset correction rules are merely illustrative, and that in other embodiments, the user may set other preset correction conditions and preset correction rules as desired.

In one embodiment, the final pronunciation sequence is modified to repair some of the fixed pronunciation due to model prediction during the pronunciation matching stage. Such as for light-tone correction: the pronunciation of the model predictive language and gas word is predicted to be four sounds (bar: ba4, ma 4), and the pronunciation can be corrected to be light sounds (bar: ba5, ma 5) through correction rules; or for unified pronunciation: for example, "this" has two pronunciations, "zhe4" and "zhei4", respectively, and the pronunciations can be unified as "zhe4" by modifying the rule.

Referring to fig. 4, fig. 4 is a schematic diagram of a frame of an embodiment of a speech recognition device provided in the present application.

The semantic recognition device 40 comprises an acquisition module 41, a text sequence module 42 and a pronunciation sequence module 43, wherein the acquisition module 41 is used for acquiring each candidate text sequence and each candidate pronunciation sequence corresponding to each candidate text sequence obtained by performing voice recognition on target voice; the text sequence module 42 is configured to obtain a target text sequence corresponding to the target voice based on each candidate text sequence; the pronunciation sequence module 43 is configured to determine a target pronunciation sequence by using the candidate pronunciation sequence, where the target pronunciation sequence includes a pronunciation corresponding to at least one target word in the target text sequence, and the target pronunciation sequence is configured to provide a reference to the pronunciation of the target word included in the associated speech of the target speech.

In one embodiment, the pronunciation sequence module 43 determines a target pronunciation sequence using the candidate pronunciation sequence, including: acquiring an initial pronunciation sequence corresponding to a target text sequence; selecting a candidate pronunciation sequence with the similarity meeting the preset requirement from the candidate pronunciation sequences as a reference pronunciation sequence; based on the reference pronunciation sequence, adjusting the initial pronunciation sequence to obtain a final pronunciation sequence corresponding to the target text sequence; and selecting pronunciation corresponding to at least one target word from the final pronunciation sequence to form a target pronunciation sequence.

In one embodiment, the pronunciation sequence module 43 may be configured to determine whether each candidate text sequence is identical to the target text sequence; taking a candidate pronunciation sequence corresponding to the candidate text sequence identical to the target text sequence as a reference pronunciation sequence; judging whether each reference pronunciation in the reference pronunciation sequence and the number of the reference pronunciations are consistent with each target word of the target text sequence and the number of the target word to obtain a first judging result; wherein each reference pronunciation is the pronunciation of each reference word in the reference text sequence corresponding to the reference pronunciation sequence, and the number of the reference pronunciations is consistent with the number of the reference words; based on the first determination result, a final pronunciation sequence is determined.

In one embodiment, the pronunciation sequence module 43 may respond to the first determination result that each reference pronunciation is consistent with each target word, and the number of reference pronunciation is the same as the number of target words, and then take the reference pronunciation sequence as the final pronunciation sequence; and responding to the first judgment result that each reference pronunciation is inconsistent with the target word and/or the number of the reference pronunciations is different from the number of the target word, and obtaining a final pronunciation sequence based on each reference pronunciation and the initial pronunciation of each target word in the initial pronunciation sequence.

In one embodiment, the pronunciation sequence module 43 may be further configured to determine whether the reference word segment is the same as the target word segment; taking the reference pronunciation corresponding to the same reference word as the target word in each reference word as the target pronunciation of the first target word; taking the initial pronunciation of the target word in the initial pronunciation sequence as the target pronunciation of the second target word; the first target word is the same target word as the reference word in each target word, and the second target word is a different target word from the reference word in each target word; and obtaining a final pronunciation sequence based on the target pronunciation of the first target word and the target pronunciation of the second target word.

In an embodiment, the pronunciation sequence module 43 may be further configured to calculate a similarity between each candidate text sequence and the target text sequence, and use a candidate pronunciation sequence corresponding to the candidate text sequence with the maximum similarity to the target text sequence as the reference pronunciation sequence; wherein the candidate text sequence and the target text sequence are different. And a pronunciation sequence module 43 for judging whether the reference pronunciation sequence and the initial pronunciation sequence are the same; in response to the reference pronunciation sequence being the same as the initial pronunciation sequence, taking the reference pronunciation sequence as the final pronunciation sequence; and responding to the difference between the reference pronunciation sequence and the initial pronunciation sequence, acquiring the pronunciation of each target word in the target text sequence in the pronunciation table, and acquiring a final pronunciation sequence based on the pronunciation of each target word in the pronunciation table.

In one embodiment, the target word has at least two pronunciations in the pronunciation table, each pronunciation including pinyin and tone; the pronunciation sequence module 43 obtains a final pronunciation sequence based on the pronunciation of each target word in the pronunciation table, including: selecting reference pronunciation corresponding to the target word in the reference pronunciation sequence; selecting a pronunciation identical to the pitch of the reference pronunciation from at least two pronunciations as a target pronunciation of the target segmentation; and obtaining a final pronunciation sequence by utilizing target pronunciation of the target segmentation.

In an embodiment, the voice recognition device may further include a correction module, where the correction module is configured to, after adjusting the initial pronunciation sequence based on the reference pronunciation sequence to obtain a final pronunciation sequence corresponding to the target text sequence, determine whether each target pronunciation corresponding to each target word in the target text sequence in the final pronunciation sequence meets a preset correction condition; and correcting the target pronunciation meeting the preset correction conditions according to the preset correction rules.

Referring to fig. 5, fig. 5 is a schematic frame structure of an embodiment of an electronic device provided in the present application.

The electronic device 50 comprises a memory 51 and a processor 52 coupled to each other, the memory 51 storing program instructions, the processor 52 being adapted to execute the program instructions stored in the memory 51 to carry out the steps of any of the method embodiments described above. In one particular implementation scenario, electronic device 50 may include, but is not limited to: the microcomputer and the server, and the electronic device 50 may also include a mobile device such as a notebook computer and a tablet computer, which is not limited herein.

In particular, the processor 52 is configured to control itself and the memory 51 to implement the steps of any of the method embodiments described above. The processor 52 may also be referred to as a CPU (Central Processing Unit ). The processor 52 may be an integrated circuit chip having signal processing capabilities. Processor 52 may also be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 52 may be commonly implemented by an integrated circuit chip.

Referring to fig. 6, fig. 6 is a schematic diagram of a framework of an embodiment of a computer readable storage medium provided in the present application.

The computer readable storage medium 60 stores program instructions 61 for implementing the steps of any of the method embodiments described above when the program instructions 61 are executed by a processor.

The computer readable storage medium 60 may be a medium such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, which may store a computer program, or may be a server storing the computer program, which may send the stored computer program to another device for execution, or may also run the stored computer program itself.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed methods and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all or part of the technical solution contributing to the prior art or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information, and obtains independent consent of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a type of personal information to be processed.

The foregoing description is only of embodiments of the present application, and is not intended to limit the scope of the patent application, and all equivalent structures or equivalent processes using the descriptions and the contents of the present application or other related technical fields are included in the scope of the patent application.

Claims

1. A method of speech recognition, comprising:

obtaining each candidate text sequence and each candidate pronunciation sequence corresponding to each candidate text sequence obtained by carrying out voice recognition on target voice;

based on each candidate text sequence, obtaining a target text sequence corresponding to the target voice;

determining a target pronunciation sequence by utilizing the candidate pronunciation sequence, wherein the target pronunciation sequence comprises pronunciation corresponding to at least one target word in the target text sequence, and the target pronunciation sequence is used for providing reference for pronunciation of the target word contained in associated voices of the target voice; the target pronunciation sequence is obtained by adjusting an initial pronunciation sequence corresponding to the target text sequence based on a reference pronunciation sequence, wherein the reference pronunciation sequence is the candidate pronunciation sequence with similarity meeting a preset requirement.

2. The method of claim 1, wherein said determining a target pronunciation sequence using said candidate pronunciation sequence comprises:

acquiring an initial pronunciation sequence corresponding to the target text sequence;

selecting the candidate pronunciation sequences with the similarity meeting the preset requirement from the candidate pronunciation sequences as reference pronunciation sequences;

Based on the reference pronunciation sequence, the initial pronunciation sequence is adjusted to obtain a final pronunciation sequence corresponding to the target text sequence;

and selecting pronunciation corresponding to the at least one target word from the final pronunciation sequence to form the target pronunciation sequence.

3. The method of claim 2, wherein the predetermined requirement is that the reference pronunciation sequence is the same as the initial pronunciation sequence;

the selecting, from the candidate pronunciation sequences, the candidate pronunciation sequence having a similarity with the initial pronunciation sequence meeting a preset requirement as a reference pronunciation sequence includes:

judging whether each candidate text sequence is identical to the target text sequence or not;

taking the candidate pronunciation sequence corresponding to the candidate text sequence identical to the target text sequence as the reference pronunciation sequence;

the step of adjusting the initial pronunciation sequence based on the reference pronunciation sequence to obtain a final pronunciation sequence corresponding to the target text sequence comprises the following steps:

judging whether each reference pronunciation in the reference pronunciation sequence and the number of the reference pronunciation are consistent with each target word of the target text sequence and the number of the target word to obtain a first judgment result; wherein each reference pronunciation is the pronunciation of each reference word in the reference text sequence corresponding to the reference pronunciation sequence, and the number of the reference pronunciations is consistent with the number of the reference words;

And determining the final pronunciation sequence based on the first judgment result.

4. The method of claim 3, wherein the determining the final pronunciation sequence based on the first determination result comprises:

responding to the first judgment result that each reference pronunciation is consistent with each target word, and the number of the reference pronunciations is the same as the number of the target word, and taking the reference pronunciation sequence as the final pronunciation sequence;

and responding to the first judgment result that the reference pronunciations are inconsistent with the target word, and/or the number of the reference pronunciations is different from the number of the target word, and obtaining the final pronunciation sequence based on the initial pronunciations of the reference pronunciations and the target word in the initial pronunciation sequence.

5. The method of claim 4, wherein the deriving the final pronunciation sequence based on the initial pronunciation of each of the reference pronunciation and each of the target tokens in the initial pronunciation sequence comprises:

judging whether the reference word segment is the same as the target word segment or not;

Taking the reference pronunciation corresponding to the reference word which is the same as the target word in each reference word as the target pronunciation of the first target word;

the initial pronunciation of the target word in the initial pronunciation sequence is used as the target pronunciation of a second target word; the first target word is the same target word as the reference word in each target word, and the second target word is a different target word from the reference word in each target word;

and obtaining the final pronunciation sequence based on the target pronunciation of the first target word and the target pronunciation of the second target word.

6. The method of claim 2, wherein the predetermined requirement is that the reference pronunciation sequence has a highest similarity to the initial pronunciation sequence;

calculating the similarity between each candidate text sequence and the target text sequence, and taking the candidate pronunciation sequence corresponding to the candidate text sequence with the maximum similarity of the target text sequence as the reference pronunciation sequence; wherein the candidate text sequence and the target text sequence are different;

judging whether the reference pronunciation sequence is identical to the initial pronunciation sequence;

in response to the reference pronunciation sequence being the same as the initial pronunciation sequence, taking the reference pronunciation sequence as the final pronunciation sequence;

and responding to the difference between the reference pronunciation sequence and the initial pronunciation sequence, acquiring the pronunciation of each target word in the target text sequence in a pronunciation table, and acquiring the final pronunciation sequence based on the pronunciation of each target word in the pronunciation table.

7. The method of claim 6, wherein the target word has at least two pronunciations in the pronunciation table, each of the pronunciations including pinyin and pitch;

the step of obtaining the final pronunciation sequence based on the pronunciation of each target word in the pronunciation table comprises the following steps:

selecting a reference pronunciation corresponding to the target word in the reference pronunciation sequence;

selecting the same pronunciation as the tone of the reference pronunciation from the at least two pronunciations as a target pronunciation of the target word;

And obtaining the final pronunciation sequence by utilizing the target pronunciation of the target segmentation.

8. The method of claim 2, further comprising, after said adjusting said initial pronunciation sequence based on said reference pronunciation sequence to obtain a final pronunciation sequence corresponding to said target text sequence:

judging whether each target pronunciation corresponding to each target word in the target text sequence in the final pronunciation sequence meets a preset correction condition or not;

and correcting the target pronunciation meeting the preset correction conditions according to a preset correction rule.

9. A speech recognition apparatus, comprising:

the acquisition module is used for acquiring each candidate text sequence and each candidate pronunciation sequence corresponding to each candidate text sequence obtained by carrying out voice recognition on the target voice;

the text sequence module is used for obtaining a target text sequence corresponding to the target voice based on each candidate text sequence;

the pronunciation sequence module is used for determining a target pronunciation sequence by utilizing the candidate pronunciation sequence, wherein the target pronunciation sequence comprises pronunciation corresponding to at least one target word in the target text sequence, and the target pronunciation sequence is used for providing reference for pronunciation of the target word contained in associated voices of the target voice; the target pronunciation sequence is obtained by adjusting an initial pronunciation sequence corresponding to the target text sequence based on a reference pronunciation sequence, wherein the reference pronunciation sequence is the candidate pronunciation sequence with similarity meeting a preset requirement.

10. An electronic device, comprising: a memory and a processor coupled to each other,

the memory stores program instructions;

the processor is configured to execute program instructions stored in the memory to implement the method of any one of claims 1-8.

11. A computer readable storage medium for storing program instructions executable to implement the method of any one of claims 1-8.