WO2023035525A1

WO2023035525A1 - Speech recognition error correction method and system, and apparatus and storage medium

Info

Publication number: WO2023035525A1
Application number: PCT/CN2022/071074
Authority: WO
Inventors: 庄子扬; 魏韬; 马骏; 王少军; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-09-10
Filing date: 2022-01-10
Publication date: 2023-03-16
Also published as: CN113779972B; CN113779972A

Abstract

The present application relates to artificial intelligence technology. Disclosed are a speech recognition error correction method and system, and an apparatus and a storage medium. The method comprises: performing speech recognition on speech to be subjected to detection, so as to obtain text to be subjected to detection and a corresponding pronunciation sequence to be subjected to detection; according to said pronunciation sequence, constructing an FST to be subjected to detection; according to said FST and a keyword FST, determining several words to be subjected to error correction in said text, and determining a sentence to be subjected to error correction that includes said words; if said words are present in a Chinese character confusion set, determining a replacement word corresponding to each of said words; replacing said words in said sentence with the replacement words, so as to generate a replacement sentence; and when a first logic score of said sentence is less than a second logic score of the replacement sentence, replacing said sentence in said text with the replacement sentence, thereby completing error correction. By means of the technical solution in the present application, words which may have an error can be determined according to pronunciation, thereby realizing effective error correction on speech recognition text.

Description

Speech recognition error correction method, system, device and storage medium

This application claims the priority of the Chinese patent application with the application number 202111064048.5 filed on September 10, 2021, and the title of the invention is "Speech Recognition Error Correction Method, System, Device, and Storage Medium", the entire content of which is incorporated by reference incorporated in this application.

technical field

The present application relates to artificial intelligence technology, and in particular to a speech recognition error correction method, system, device and storage medium.

Background technique

With the continuous development of deep learning technology, a major breakthrough has been made in the field of speech recognition using deep learning technology, and the accuracy of automatic speech recognition (ASR) is getting higher and higher. Compared with other human-computer interaction methods, the interaction based on speech recognition is simpler and conforms to people's daily habits. Therefore, speech recognition technology is gradually penetrating into smart home, digital medical care, automatic driving and other fields.

However, the inventor realizes that in practical applications, speech recognition technology is still subject to great limitations, such as the user's pronunciation is not standard enough, environmental noise is large and other factors will affect the accuracy of speech recognition, in order to improve the accuracy of speech recognition In the related art, a text-based grammatical or syntactic error correction scheme for the speech recognition text is proposed, but the accuracy of this scheme is low, and there are different error modes for different vertical fields, so it is difficult for the scheme in the related art to be effective. Errors in speech recognition text are found, resulting in decreased accuracy of speech recognition technology.

Contents of the invention

This application aims to solve one of the technical problems in the related art at least to a certain extent. To this end, the present application provides a voice recognition error correction method, system, device and storage medium, which can realize the purpose of correcting voice recognition text and improving the accuracy of voice recognition.

In order to achieve the above object, an embodiment of the present application provides a speech recognition error correction method, the method includes the following steps: performing speech recognition on the speech to be detected, obtaining the text to be detected and the corresponding pronunciation sequence to be detected; according to the speech to be detected Pronunciation sequence, construct FST to be detected; Obtain keyword FST and Chinese character confusion set; Wherein, described keyword FST, described Chinese character confusion set and described FST to be detected belong to the same vertical field; According to described FST to be detected and described The keyword FST determines some words to be corrected and some sentences to be corrected in the text to be detected; wherein, the sentences to be corrected include the words to be corrected; if the words to be corrected Exist in the Chinese character confusion set, determine the replacement words corresponding to each word to be corrected according to the Chinese character confusion set; replace the word to be corrected in the sentence to be corrected with the Replace words to obtain a replacement sentence; calculate the first logic score of the sentence to be corrected and the second logic score of the replacement sentence; when the first logic score is less than the second logic score, the The sentence to be corrected in the detected text is replaced with the replacement sentence.

In order to achieve the above purpose, the embodiment of the present application also proposes a speech recognition error correction system, the system includes a first module, a second module, a third module, a fourth module, a fifth module, a sixth module and a seventh module module; the first module is used to perform speech recognition on the speech to be detected to obtain the text to be detected and the corresponding pronunciation sequence to be detected; the second module is used to construct the FST to be detected according to the pronunciation sequence to be detected; the The third module is used to obtain keyword FST and Chinese character confusion set; Wherein, described keyword FST, described Chinese character confusion set and described to-be-detected FST belong to the same vertical domain; Described fourth module is used for according to described to-be-detected FST and the keyword FST determine some words to be corrected and some sentences to be corrected in the text to be detected; wherein, the sentences to be corrected include the words to be corrected; the fifth The module is used to determine the replacement words corresponding to each of the words to be corrected according to the confusion set of Chinese characters if the word to be corrected exists in the Chinese character confusion set; the sixth module is used to The word to be corrected in the sentence to be corrected is replaced with the replacement word to obtain a replacement sentence; the seventh module is used to calculate the first logic score of the sentence to be corrected and the replacement sentence The second logic score; the eighth module is used to replace the sentence to be corrected in the text to be detected with the replacement sentence when the first logic score is less than the second logic score.

In order to achieve the above purpose, an embodiment of the present application also proposes a device, the device comprising: at least one processor; at least one memory for storing at least one program; when the at least one program is executed by the at least one processor Executing, so that the at least one processor implements a speech recognition error correction method; wherein, the speech recognition error correction method includes: performing speech recognition on the speech to be detected to obtain the text to be detected and the corresponding pronunciation sequence to be detected; according to the Describe the pronunciation sequence to be detected, construct the FST to be detected; obtain the keyword FST and the confusion set of Chinese characters; wherein, the keyword FST, the confusion set of Chinese characters and the FST to be detected belong to the same vertical field; according to the FST to be detected and the keyword FST, determine some words to be corrected and some sentences to be corrected in the text to be detected; wherein, the sentences to be corrected include the words to be corrected; if the words to be corrected Wrongly written words exist in the confusion set of Chinese characters, and according to the confusion set of Chinese characters, the replacement words corresponding to each word to be corrected are determined; the words to be corrected in the sentence to be corrected are replaced For the replacement words, obtain a replacement sentence; calculate the first logic score of the sentence to be corrected and the second logic score of the replacement sentence; when the first logic score is less than the second logic score, the The sentence to be corrected in the text to be detected is replaced with the replacement sentence.

In order to achieve the above object, the embodiment of the present application also provides a computer storage medium, which stores a program executable by the processor, and the program executable by the processor implements a speech recognition correction function when executed by the processor Error method; wherein, the speech recognition error correction method includes: performing speech recognition on the speech to be detected, obtaining the text to be detected and the corresponding pronunciation sequence to be detected; according to the pronunciation sequence to be detected, constructing the FST to be detected; obtaining the keyword FST and Chinese character confusion set; wherein, the keyword FST, the Chinese character confusion set and the FST to be detected belong to the same vertical field; according to the FST to be detected and the keyword FST, determine the text to be detected Some words to be corrected and some sentences to be corrected; wherein, the sentences to be corrected include the words to be corrected; if the words to be corrected exist in the confusion set of Chinese characters, according to the Chinese characters The confusion set determines the replacement word corresponding to each word to be corrected; the word to be corrected in the sentence to be corrected is replaced by the replacement word to obtain a replacement sentence; the word to be corrected is calculated. The first logic score of the error correction sentence and the second logic score of the replacement sentence; when the first logic score is less than the second logic score, replace the error correction sentence in the text to be detected with the replace statement.

The beneficial effects of the embodiments of the present application are as follows: first, speech recognition is performed on the speech to be detected, and the text to be detected and the corresponding pronunciation sequence to be detected are obtained; the FST to be detected is constructed according to the pronunciation sequence to be detected; the FST to be detected is constructed according to the keyword to be detected FST, determine some words to be corrected in the text to be detected, and determine the sentence to be corrected that contains the word to be corrected; if there is the word to be corrected in the obtained Chinese character confusion set, determine each word to be corrected The replacement word corresponding to the wrong word, and replace the word to be corrected in the sentence to be corrected with the replacement word to generate a replacement sentence; calculate the first logic score of the sentence to be corrected and the second logic score of the replacement sentence , when the first logic score is smaller than the second logic score, the sentence to be corrected in the text to be detected is replaced with a replacement sentence, thereby completing the error correction of the speech recognition text. Compared with the solutions that rely on grammar or syntax in the related art, the speech recognition error correction method proposed in the embodiment of the present application is to determine the words to be corrected that may have errors according to the pronunciation of the speech recognition text, and to confuse them according to the Chinese characters in the corresponding business. Set to provide replacement words for the word to be corrected, and finally determine whether error correction is required according to the logical score of the corresponding sentence before and after the replacement of the word to be corrected. It can be seen that the embodiment of the present application can find the recognition errors caused by mispronunciation in the specified business field, thereby increasing the probability of finding the words to be corrected; Make effective corrections, reduce the miscorrection rate of speech recognition texts, thereby effectively improving the accuracy of speech recognition texts, so that speech recognition technology can play a greater role in digital medical, smart home and other fields.

Description of drawings

The accompanying drawings are used to provide a further understanding of the technical solution of the present application, and constitute a part of the specification, and are used together with the embodiments of the present application to explain the technical solution of the present application, and do not constitute a limitation to the technical solution of the present application.

Fig. 1 is a flow chart of the steps of the speech recognition error correction method provided by the embodiment of the present application;

Fig. 2 is the schematic diagram of the FST to be detected provided by the embodiment of the present application;

Fig. 3 is the step flowchart of constructing keyword FST and constructing Chinese character confusion set that the embodiment of the present application provides;

Fig. 4 is the flow chart of the steps of constructing the confusion set of Chinese characters provided by the embodiment of the present application;

Fig. 5 is the flow chart of the steps of constructing the pronunciation confusion set provided by the embodiment of the present application;

FIG. 6 is a flow chart of steps for constructing a keyword table provided by an embodiment of the present application;

Fig. 7 is the flow chart of the steps of constructing keyword FST provided by the embodiment of the present application;

FIG. 8 is a schematic diagram of the keyword FST provided by the embodiment of the present application;

FIG. 9 is a schematic diagram of a speech recognition error correction system provided by an embodiment of the present application;

Fig. 10 is a schematic diagram of the device provided by the embodiment of the present application.

Detailed ways

In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.

It should be noted that although the functional modules are divided in the system schematic diagram and the logical order is shown in the flow chart, in some cases, it can be executed in a different order than the module division in the system or the flow chart steps shown or described. The terms "first", "second" and the like in the specification and claims and the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.

In the subsequent description, use of suffixes such as 'module', 'part' or 'unit' for denoting elements is only for facilitating the description of the present application and has no specific meaning by itself. Therefore, 'module', 'part' or 'unit' may be used in combination.

The embodiments of the present application will be further described below in conjunction with the accompanying drawings.

Referring to FIG. 1, FIG. 1 is a flow chart of the steps of the speech recognition error correction method provided by the embodiment of the present application. The method involves the field of artificial intelligence speech recognition error correction. The method includes but is not limited to steps S100-S170:

Step S100, performing speech recognition on the speech to be detected to obtain the text to be detected and the corresponding pronunciation sequence to be detected;

Specifically, the voice to be detected in this embodiment of the present application refers to a voice segment generated by a person performing a business in a vertical business field. For example, in the digital medical industry, the voice to be detected can be the recording of a discussion meeting conducted by doctors on a certain case, or the recording of the online communication between the patient and the doctor, or the telephone communication between the patient and the front desk of the hospital. For the vertical field of medical treatment, its business recordings will contain a large number of medical-related nouns, including but not limited to hospital names, surgical names, or drug names, etc. Due to the low voice quality of the detected speech, or the speaker has an accent, these related nouns may be misrecognized, for example, "aspirin" is recognized as "asipirin", and these errors are based on grammar or syntax in related technologies Speech recognition error correction schemes cannot be found and corrected, thus resulting in lower accuracy of speech recognition.

Based on this, the embodiment of the present application proposes to determine the words to be corrected according to the pronunciation. Therefore, in this step S100, the speech to be detected is firstly recognized, and the text to be detected is generated. The text to be detected is a paragraph of text corresponding to the speech to be detected. sequence. Table lookup is performed for each word in the text to be detected to obtain the corresponding pinyin unit, and the pinyin units of all characters are recorded in the pronunciation sequence to be detected, so the pronunciation sequence to be detected is a pinyin sequence corresponding to the text to be detected.

For example, if the text to be detected is: "I am very happy", then the generated pronunciation sequence to be detected is: "wo hen kuai le". In addition, it should be noted that in the embodiment of the present application, when the word in the text to be detected is a polyphonic word, the common pronunciation of the word is generally selected as the corresponding pronunciation, for example, the corresponding pinyin unit of "乐" is selected as " le".

In this way, after the pinyin of each word in the text to be detected is determined, the pronunciation sequence to be detected can be generated.

Step S110, constructing an FST to be detected according to the pronunciation sequence to be detected;

Specifically, according to the pronunciation sequence to be detected, a FST to be detected is constructed. FST refers to a finite state transducer (Finite State Transducers). This structure is similar to a tree diagram and can be used to construct a dictionary to express different status and transition paths. However, in the embodiment of the present application, the FST to be detected constructed according to the pronunciation sequence to be detected can actually be regarded as a path expressing the pinyin unit corresponding to each word in the text to be detected.

For example, according to step S110, the text to be detected is: "the weather is fine today", then the obtained pronunciation sequence to be detected can be expressed as: "jin tian tian qi qing lang", and then according to the order of occurrence of pinyin in the pronunciation sequence to be detected , the FST to be detected corresponding to the text to be detected can be constructed. Concrete construction result is with reference to Fig. 2, and Fig. 2 is the schematic diagram of the FST to be detected that the embodiment of the present application provides, as shown in Fig. 2, first establishes a root node 200, treats each pinyin unit in the pronunciation sequence to be detected according to the pinyin unit order Arrange to get six sub-nodes 210, the order of which is: "jin-tian-tian-qi-qing-lang", the last sub-node points to the end point 220. According to the method in the above content, a pronunciation path corresponding to the text to be detected can be obtained, that is, the FST to be detected as shown in FIG. 2 .

Step S120, obtaining keyword FST and Chinese character confusion set;

Specifically, the keyword FST corresponding to the service records the pinyin corresponding to the keyword table in this service. The structure of the keyword FST is similar to that of the FST to be detected, and its specific steps will be elaborated below. The Chinese character confusion set records words or words that are easily confused in this business field. The specific construction process of the Chinese character confusion set will be elaborated below.

It should be noted that the keyword FST, the Chinese character confusion set and the FST to be detected belong to the same vertical field, so the errors in the text to be detected can be found more accurately by using the keyword FST and the Chinese character confusion set.

Step S130, according to the FST to be detected and the keyword FST, determine a number of words to be corrected and a number of sentences to be corrected in the text to be detected; wherein, the sentences to be corrected include words to be corrected;

Specifically, since the structure of the keyword FST and the FST to be detected is similar, it is convenient to reorganize and compare the keyword FST and the FST to be detected, that is, to compare the pinyin unit in the FST to be detected with the pinyin unit in the keyword FST For comparison, if the pinyin is the same, it is recorded as the same node, if there is one or more identical nodes in the FST to be detected and the keyword FST, then the words in the text to be detected corresponding to these same nodes are used as words to be corrected , therefore, the number of these identical nodes is the same as the number of words of the corresponding word to be corrected.

For example, there is a node "lang" in the FST to be detected as shown in Figure 2, if there is also a node "lang" in the keyword FST, then the word "Lang" in the corresponding position in the text to be detected is used as the error to be corrected words; and there are two continuous child nodes in Fig. 2 as "qing-lang", if there are also two continuous nodes in the keyword FST as "qing-lang", then the "qing-lang" in the corresponding position in the text to be detected will be Sunny" as the word to be corrected.

It should be noted that since the FST to be detected is constructed according to the order of the pinyin units of the text to be detected, the order of the pinyin units should also be followed when using the keyword FST for reorganization. That is to say, if the FST to be detected and the key The word FST contains multiple identical nodes, and these nodes should be continuous. For example, there are two consecutive nodes "tian-qi" in the FST to be detected as shown in Figure 2, and there are three consecutive nodes "tian-ran-qi" in the keyword FST, although the two FSTs There are two identical nodes "tian" and "qi", but in the keyword FST, the two nodes "tian" and "qi" are not continuous, so "tian-qi" cannot be regarded as the The same node in the embodiment of the application cannot use "weather" in the text to be detected as the word to be corrected.

According to the above content, the words to be corrected in the text to be detected can be determined, and these words to be corrected all exist in the sentence. There is already a relatively mature sentence segmentation scheme in the related technology, and the text to be detected can be segmented by using the related technology, and the sentence containing one or more words to be corrected can be determined as the sentence to be corrected.

According to this step S130, all the words to be corrected in the text to be detected can be determined, and the sentences to be detected containing these words to be corrected can be determined. Since the words to be corrected are obtained through pronunciation screening based on the recombination with the keyword FST, it can effectively increase the discovery rate of recognition errors caused by mispronunciation in the specified business field, and help reduce Re-correction rate, improve the accuracy of speech recognition.

Step S140, if the word to be corrected exists in the Chinese character confusion set, determine the replacement word corresponding to each word to be corrected according to the Chinese character confusion set;

Each word to be corrected to be determined in step S130 is matched in the confusion set of Chinese characters in this business field, if the word to be corrected currently exists in the confusion set of Chinese characters, then the confusion set of Chinese characters includes the words that are related to the confusion to be corrected. The replacement term for the term.

Step S150, replacing the word to be corrected in the sentence to be corrected with a replacement word to obtain a replacement sentence;

Specifically, the words to be corrected in the sentence to be corrected are replaced with the replacement words determined in step S140, while other parts of the sentence to be corrected are not changed, so as to obtain a new replacement sentence.

For example, if the sentence to be corrected is: "The weather is fine today", and the word to be corrected is: "weather", and "weather" exists in the Chinese character confusion set corresponding to this business, the corresponding replacement word for "weather" is : "Tianqi", after replacing the word to be corrected with the replacement word, the generated replacement sentence is: "It's sunny today in Tianqi".

According to this step S150, the words to be corrected in all the sentences to be corrected in the text to be detected are replaced, and several replacement sentences containing the replaced words are determined.

Step S160, calculating the first logic score of the sentence to be corrected and the second logic score of the replacement sentence;

Specifically, in this step S160, the first logic score of the sentence to be corrected and the second logic score of the corresponding replacement sentence need to be calculated.

In some embodiments, the language logic model of the corresponding service can be used to calculate the logic score of the statement. For example, use the N-gram model to calculate the logic score of the sentence. N-gram is an algorithm based on a statistical language model. gram refers to a byte segment, and N refers to the number of bytes. This model mainly estimates the probability of the Nth word appearing based on the previous (N-1) words, such as Binary Bi-gram and ternary Tri-gram, for the whole sentence, the probability of this sentence can be obtained according to the probability of each word in the sentence, and the probability of each word in the sentence can be obtained by training The training corpus of the N-gram model is calculated. The embodiment of this application does not specifically limit the specific training process of the language logic model, nor does it specifically limit the way the language logic model calculates the logic score. What the embodiment of the application wants to illustrate is that by corresponding to a large amount of text data in the business field A language logic model capable of calculating logic scores of sentences in the business domain can be trained.

Therefore, the first logic score can be determined by inputting the sentence to be corrected into the language logic model. For example, the sentence to be corrected is: "I love reading", and after performing necessary word segmentation and other processing on the sentence to be corrected according to related technologies such as business dictionaries, it can be determined that the sentence to be corrected can be divided into the following words: "I", "Love", "reading", in some embodiments, the logic score formula of the sentence to be corrected can be expressed as follows:

p(I love reading)＝p(I|)+p(love|I)+p(reading|love)+p(|reading)

Among them, p represents the probability, and "|" represents the set.

And through the pre-trained language logic model, the occurrence probability of each part can be determined, for example, p(me|)=-0.2, p(love|me)=-0.8, p(reading|love)=-0.7, p (|Reading)=-0.4, then according to the above calculation, p(I love reading)=-2.1, that is, the first logic score is -2.1.

Similarly, the replacement sentence is input into the same language logic model, and the second logic score of the replacement sentence is calculated according to the above steps.

Step S170, when the first logic score of the sentence to be corrected is less than the second logic score of the replacement sentence, replace the sentence to be corrected in the text to be detected with the replacement sentence;

Specifically, compare the first logical score and the second logical score calculated in step S160, assuming that the first logical score is -2.1 and the second logical score is -2.0, then the first logical score is smaller than the second logical score, and That is to say, for this business field, the language logic of the replacement sentence is more fluent, and it is more likely to be a correct sentence. Therefore, the sentence to be corrected is replaced with the corresponding replacement sentence.

Similarly, by comparing the logic scores of all the sentences to be corrected in the text to be detected with the corresponding replacement sentences, the speech recognition error correction of the text to be detected can be completed.

And it can be understood that, according to this step S170, whether the sentence to be corrected is replaced with a replacement sentence depends on the logic scores of the two sentences, and in step S130, it is explained that the sentence to be corrected is a sentence containing several words to be corrected Words, when the sentence to be corrected contains multiple words to be corrected, then it is understandable that the same sentence to be corrected may generate multiple replacement sentences, and different replacement sentences may get different The second logical score of .

For example, the sentence to be corrected is "the weather is sunny today", and the words to be corrected in the sentence to be corrected are "weather" and "sunny", the corresponding replacement words for "weather" are "Tianqi", " The replacement word corresponding to "sunny" is "love man". In different embodiments, different replacement sentences can be generated.

For example, in some embodiments, for the sentence to be corrected, it is necessary to arrange and combine the words to be corrected in the sentence to be corrected to obtain multiple replacement sentences. Take the above sentence to be corrected "the weather is fine today" as an example: after replacing the words to be corrected with the replacement words, three kinds of replacement sentences can be obtained, which are respectively the first replacement sentence: "It is sunny today in Tianqi", the second Replacement sentence: "Today Tian Qiqing Lang", the third replacement sentence: "Today's weather lover", in this embodiment, respectively calculate the sentence corresponding to the error correction sentence, the first replacement sentence, the second replacement sentence and the third replacement sentence logic score, and select the sentence with the highest logic score as the final error correction result. This embodiment integrates all possible replacements in the entire sentence to be corrected, and calculates the logical score of all possible replacement sentences, which can reduce the accuracy of the entire sentence caused by inaccurate replacement of some words to be corrected.

For another example, in some other embodiments, the words to be corrected are replaced one by one according to the sequence in the sentence to be corrected. Taking the above-mentioned sentence to be corrected "the weather is fine today" as an example, first replace the word "weather" according to the order of front and back, and obtain the replacement sentence "Tianqi is sunny today", and calculate the corresponding The first logical score is to calculate the second logical score corresponding to the replacement sentence "Today Tianqi is sunny", and according to the comparison result of the logical score, it is determined that the original sentence to be corrected "the weather is sunny today" is more in line with language logic, and there is no need to correct the "weather ” to replace it. And behind the word "weather", there is another word to be corrected in the sentence to be corrected, which is "sunny", and the corresponding replacement word is "love man", then the next replacement sentence is generated as: "today's weather is love man" , the logic score of the sentence to be corrected and the replacement sentence can also be calculated, and finally it is determined that the word "sunny" does not need to be replaced, thereby completing the error correction of the sentence to be corrected. According to the sequence of the words to be corrected in the sentence to be corrected, the present embodiment replaces one by one and compares them one by one, the logic is simpler, and for the situation that one word to be corrected corresponds to multiple replacement words, the number of calculations can be reduced and the improvement can be improved. error correction efficiency.

According to the above embodiments, this application does not specifically limit the processing method of the sentence to be corrected that contains multiple words to be corrected. , can improve the accuracy rate of the replacement of the words to be corrected, thereby reducing the miscorrection rate of speech recognition.

Through steps S100-S170, the embodiment of the present application provides a speech recognition and error correction method, firstly perform speech recognition on the speech to be detected, obtain the text to be detected and the corresponding pronunciation sequence to be detected; construct the FST to be detected according to the pronunciation sequence to be detected ; According to the FST to be detected and the keyword FST obtained, determine some words to be corrected in the text to be detected, and determine the sentence to be corrected that contains the words to be corrected; Wrong word, determine the replacement word corresponding to each word to be corrected, and replace the word to be corrected in the sentence to be corrected with the replacement word to generate a replacement sentence; calculate the first logic of the sentence to be corrected score and the second logic score of the replacement sentence, when the first logic score is smaller than the second logic score, the sentence to be corrected in the text to be detected is replaced with the replacement sentence, thereby completing the error correction of the speech recognition text.

Compared with the solutions that rely on grammar or syntax in the related art, the speech recognition error correction method proposed in the embodiment of the present application is to determine the words to be corrected that may have errors according to the pronunciation of the speech recognition text, and to confuse them according to the Chinese characters in the corresponding business. Set to provide replacement words for the word to be corrected, and finally determine whether error correction is required according to the logical score of the corresponding sentence before and after the replacement of the word to be corrected. It can be seen that the embodiment of the present application can find the recognition errors caused by mispronunciation in the specified business field, thereby increasing the probability of finding the words to be corrected; Make effective corrections, reduce the miscorrection rate of speech recognition texts, thereby effectively improving the accuracy of speech recognition texts, so that speech recognition technology can play a greater role in digital medical, smart home and other fields.

In some embodiments, the speech recognition error correction method proposed by the embodiment of the present application also includes the steps of constructing a keyword FST and constructing a confusion set of Chinese characters. Referring to FIG. 3, FIG. The flow chart of the steps of the Chinese character confusion set, the method includes but not limited to steps S300-S380:

Step S300, acquiring the training voice, the training voice and the voice to be detected belong to the same vertical field;

Specifically, multiple training voices are obtained, and the training voice and the voice to be detected belong to the same vertical field. It has been explained above. as a training voice.

Step S310, performing speech recognition on the training speech to obtain speech recognition text;

Specifically, the speech recognition technology in the related art is used to perform speech recognition on the training speech to obtain a speech recognition text, which is a text sequence corresponding to the training speech.

Step S320, according to the speech recognition text, determine the corresponding first pronunciation sequence;

Specifically, this step S320 can refer to step S100 in FIG. 1 , that is, look up each word in the speech recognition text to determine its corresponding unique pinyin, and record these pinyin in sequence as the first pronunciation sequence.

Step S330, performing manual recognition on the training speech to obtain the manual recognition text;

Specifically, manually recognize the training speech that has undergone speech recognition in step S310, that is, let people listen to the training speech, and convert the heard result into text, and record it into the manually recognized text. Taking China as an example, although most people use Chinese and can listen, speak, read, and write Mandarin, local dialects are still preserved in many places, resulting in a considerable number of people's Mandarin with an accent, which is not standard, resulting in training There are also many non-standard speech fragments in the speech. Without learning, speech recognition cannot distinguish the words that these accents match in the training speech, but people can recognize the training speech more accurately based on social life experience and context. Therefore, in this step, it is necessary to manually recognize the training speech and generate a manually recognized text. The manually recognized text is also a character sequence corresponding to the training speech, which is basically consistent with the number of words and the distribution of words in the speech recognition text, so Speech-recognized text can be compared to human-recognized text.

Step S340, according to the manually recognized text, determine the corresponding second pronunciation sequence;

Specifically, this step S340 can refer to step S100 in FIG. 1 , that is, perform table lookup for each word in the manually recognized text, determine its corresponding unique pinyin, and record these pinyin in sequence as the second pronunciation sequence.

Step S350, determine the Chinese character confusion set according to the speech recognition text and the manual recognition text;

Specifically, as mentioned in the above content, the number of words and the distribution of words in the speech recognition text are basically the same as those of the manual recognition text, so the speech recognition text and the manual recognition text can be compared to generate a Chinese character confusion set, and the specific process of generating a Chinese character confusion set You can refer to Figure 4.

Referring to Fig. 4, Fig. 4 is the flow chart of the steps of constructing the Chinese character confusion set provided by the embodiment of the present application, the method includes but not limited to steps S351-S354:

Step S351, comparing the first word in the speech recognition text with the second word in the corresponding position in the manual recognition text;

Specifically, as mentioned above, the number of words and the distribution of words in the speech recognition text and the artificial recognition text are basically the same, so the word or word in the speech recognition text can be used as the first word, and the word or word in the artificial recognition text can be used as the first word. word as the second word, compare the corresponding position of the first word and the second word. One-to-one correspondence is performed between all the first words and all the second words to complete the comparison between the speech recognition text and the manual recognition text.

It can be understood that the first word and the second word should be corresponding in position and have the same number of words, that is to say, one word is compared with another word, and one word is compared with another word, and the number of words of these two words same.

It should be noted that, in some embodiments, this step S351 can perform word segmentation on the speech recognition text and the artificial recognition text and then compare them. The location corresponds, it should be "I" and "nest" for comparison, and "walking" and "walking" for comparison.

In some other embodiments, this step S351 can also compare the speech recognition text and the manual recognition text word by word. For example, the speech recognition text is "I am walking", and the manual recognition text is "Walk Walking". It should be a comparison between "I" and "wo", "walk" and "walk", and "road" and "road". When the location of "I" and "wo" is found, voice recognition text and manual recognition If there is a difference in the text, it will be processed in the next step S352.

Step S352, if there is a difference between the current first word and the current second word, use the current first word and the current second word as the replacement word, and store the replacement word in the first candidate area;

Specifically, when there is a difference between the first word and the second word, both the first word and the second word are used as replacement words and stored in a candidate area. For example, the speech recognition text is "I am walking", and the manual recognition text is "Wo walking", then the corresponding position should be "I" and "Wo" for comparison, and it is found that the positions of "I" and "Wo" are corresponding, and the number of words is the same. But there are differences, then the first word "I" and the second word "wo" are stored in the same first candidate area. Similarly, the first word and the second word at the next position are compared, and if there is a difference, the first word and the second word are stored in another first candidate area.

As mentioned in the above step S351, in this step, the speech recognition text and the manual recognition text may be segmented and compared, then in this step S352, it can be directly determined whether the replacement word is a word or a phrase. If step S351 is to compare the speech recognition text and the manual recognition text word by word, then in step S352, according to the language logic before and after the first word or the second word in the related art, it can be determined that the replacement word is specifically A word is still a word. The embodiment of the present application does not specifically limit the number of words to be replaced. What this step intends to illustrate is that the words or phrases that can be replaced can be determined according to the difference between the speech recognition text and the manual recognition text.

Step S353, when the comparison between the first word and the second word is completed, several first candidate areas with the same word are merged into the same second candidate area;

Specifically, when the first word in the speech recognition text is compared with the second word in the artificial recognition text, there are as many first candidate areas as there are differences. For example voice recognition text is: " Tian Qiqinglang today ", artificial recognition text is: " weather is fine today ", by step S351-S352, can determine that the replacement word in a first candidate area has " Tian Qi " and " weather " , another replacement word in the first candidate area has " sunny " and " lover ". It is understandable that the training speech segment may be relatively long, the same word may appear multiple times, and speech recognition may generate different recognition results for the same word, for example, the next sentence of the speech recognition text is: "The weather will be clear tomorrow", the next sentence of the artificially recognized text is: "The weather will be sunny tomorrow", then for these two sentences, it can be determined that the replacement words in the third first candidate area are "clear" and " sunny". Since it is necessary to correct as many misidentifications as possible on the basis of artificially recognized texts during error correction, several first candidate areas containing the same words can be merged into the same second candidate area, for For the example given in this step S353, the voice recognition text is: "Today Tian Qiqinglang will be sunny tomorrow", and the artificially recognized text is "The weather will be sunny today and the weather will be sunny tomorrow". After the merging of the first candidate area, it can be determined Two second candidate areas, the replacement words in a second candidate area have " field seven " and " weather ", and the replacement words in another second candidate area have " sunny ", " lover " and " clear and bright ". That is to say, there may be more than two replacement words in one second candidate area.

As mentioned above, step S140 in FIG. 1 is: if there are words to be corrected in the Chinese character confusion set corresponding to the business, determine the replacement word corresponding to each word to be corrected. The word to be corrected is to determine the word that can be replaced in the corresponding candidate area. According to the above example, for example, the word to be corrected is "Qinglang", and the corresponding replacement word is "Qinglang" or "Qinglang". The positions of the error correction words are replaced one by one with all the replacement words, so as to determine the words that are more in line with the language logic as the final error correction result.

Step S354, determining a confusion set of Chinese characters, which includes a plurality of second candidate regions.

Specifically, when all the first candidate areas with the same word are merged, the construction of the Chinese character confusion set is completed. The Chinese character confusion set contains several second candidate areas, and each second candidate area contains more than two replacement words. , the replacement words are the first word and the second word.

Through steps S351-S354, the embodiment of the present application provides a method for constructing a Chinese character confusion set, by comparing the speech recognition text and the manual recognition text, determining the replacement word according to the difference, and performing the first candidate area with the same replacement word Combined to maximize the discovery of different misrecognition results for the same word.

Through steps S351-S354, step S350 has been described clearly, and step S360 will be described below.

Step S360, determine the pronunciation confusion set according to the first pronunciation sequence and the second pronunciation sequence;

Specifically, since the speech recognition text and the artificial recognition text have basically the same word count and word distribution, and a word can correspond to a pinyin unit, the first pronunciation sequence and the second pronunciation sequence can also be compared, and a pronunciation confusion set is generated to generate Refer to Figure 5 for the specific process of the pronunciation confusion set.

Referring to FIG. 5, FIG. 5 is a flow chart of steps for constructing a pronunciation confusion set provided by an embodiment of the present application. The method includes but is not limited to steps S361-S364:

Step S361, comparing the first pinyin unit in the first pronunciation sequence with the second pinyin unit in the corresponding position in the second pronunciation sequence;

Specifically, the pinyin unit is the pinyin of a word, and comparing the first pinyin unit in the first pronunciation sequence with the second pinyin unit in the corresponding position in the second pronunciation sequence is actually the speech recognition text of each word The phonetic unit is compared with the human-recognized phonetic unit for each word in the text.

Step S362, if there is a difference between the current first pinyin unit and the current second pinyin unit, storing the current first pinyin unit and the current second pinyin unit into the first confusion area;

Specifically, if there is a difference between the current first pinyin unit and the current second pinyin unit, for example, the first pronunciation sequence is "fen xi" (the corresponding text is "analysis"), and the second pronunciation sequence is "fen qi" (corresponding to The text is "staging"), then after comparison, the positions of "xi" and "qi" correspond, but there is a difference, then "xi" and "qi" are stored in the first confusion area.

Step S363, when the comparison between the first pinyin unit and the second pinyin unit is completed, a number of first confusion areas with the same pinyin are merged into the same second confusion area;

Specifically, this step S363 can refer to step S353, which is to merge several first confusion areas with the same pinyin into the same confusion area, for example, the first pinyin unit existing in a first confusion area is "xi", and the second pinyin unit is "qi", the first pinyin unit existing in another first confusion zone is "ji", and the second pinyin unit is "qi", then there is the same pinyin unit "qi" in the two first confusion zones, then The two first confusion areas are merged into the same second confusion area, and this new second confusion area includes three pinyin units, namely "xi", "qi" and "ji".

Through the combination of confusion areas, as many homophonic sounds as possible can be collected in the same second confusion area, which helps to improve the discovery rate of homophonic errors in the speech recognition error correction method in the embodiment of the present application, thereby improving error correction the accuracy.

Step S364, determine the pronunciation confusion set, the pronunciation confusion set includes several second confusion areas;

Specifically, when all confusion areas with the same pinyin unit are merged, the pronunciation confusion set is constructed, and the pronunciation confusion set includes several second confusion areas, and each second confusion area contains more than two pinyin units.

Through steps S361-S364, the embodiment of the present application provides a method for constructing a pronunciation confusion set. The pronunciation confusion set is determined by the difference between the first pronunciation sequence and the second pronunciation sequence, and homophonic pronunciations are assembled as much as possible by merging confusion regions. , help to reduce the impact of the user's non-standard pronunciation on speech recognition error correction, and improve the correct rate of speech recognition error correction.

Through steps S361-S364, step S360 has been described, and step S370 will be described below.

Step S370, determine the keyword list according to the pronunciation confusion set, the speech recognition text, the manual recognition text, the first pronunciation sequence and the second pronunciation sequence;

Specifically, this step S370 needs to determine the keyword table corresponding to the service, and the keyword table is used to characterize words that are easily recognized incorrectly in speech recognition of the service. Refer to Figure 6 for the construction of the keyword table.

Referring to FIG. 6, FIG. 6 is a flow chart of steps for constructing a keyword table provided by an embodiment of the present application. The method includes but is not limited to steps S371-S372:

Step S371, determine the key pinyin according to the pronunciation confusion set;

Specifically, since the pronunciation confusion set is constructed according to the difference between the first pronunciation sequence and the second pronunciation sequence, it is also possible to correspondingly determine which parts of the first pronunciation sequence and the second pronunciation sequence are different according to the pronunciation confusion set, Determine the pinyin unit with difference, determine whether the pinyin unit with difference is a single character or a word in a word according to related technologies, if it is a word, then use the pinyin unit as key pinyin, key pinyin includes the first pronunciation in the sequence Several first key pinyin and some second key pinyin in the second pronunciation sequence. Therefore, according to this pinyin unit being positioned at the first pronunciation sequence or the second pronunciation sequence, it is determined that this pinyin unit is the first key pinyin or the second key pinyin; if it is a word in a word, then will include this pinyin unit A plurality of pinyin units are used as the first key pinyin or the second key pinyin.

For example, "xi" and "qi" are contained in a confusion area of the pronunciation confusion set, corresponding to the first pronunciation sequence, "xi" is actually a word in a word "fen-xi", so "fen-xi " as the first key pinyin, and correspondingly take "fen-qi" in the second pronunciation sequence as the second key pinyin.

Step S372: Use the words corresponding to the first key pinyin in the speech recognition text and the words corresponding to the second key pinyin in the manual recognition text as keywords, and store the keywords in the keyword table.

Specifically, the first key pinyin is associated with the speech recognition text, and the second key pinyin is associated with the artificially recognized text to obtain key words. As shown in the above example, the first key pinyin "fen-xi" can correspond to The keyword "analysis" in the speech recognition text, the second key pinyin "fen-qi" can correspond to the keyword "stage" in the manual recognition text, then both "analysis" and "stage" are stored in the keyword table middle.

Through steps S371-S372, the embodiment of the present application provides a method for constructing a keyword table, which mainly stores words with differences between the first pronunciation sequence and the second pronunciation sequence into the keyword table.

Step S370 has been described through steps S371-S372, and step S380 will be described below.

Step S380, determine the keyword FST according to the keyword table;

Specifically, the keyword FST can be constructed through the words in the keyword table and their corresponding pinyin, and the construction process of the keyword FST can refer to FIG. 7 .

Referring to FIG. 7, FIG. 7 is a flow chart of steps for constructing a keyword FST provided by an embodiment of the present application. The method includes but is not limited to steps S381-S385:

Step S381, constructing the root node in the keyword FST;

Specifically, refer to FIG. 8 , which is a schematic diagram of a keyword FST provided in an embodiment of the present application. In this step S381, a root node is constructed as the starting point of the keyword FST, as shown in FIG. 8 , the root node is denoted by a label 800 .

Step S382, under the root node, construct the first child node according to the first pinyin unit in the key pinyin;

Specifically, any one of the key pinyin corresponding to the keyword list is selected, and the first pinyin unit in the current key pinyin is determined, and the first child node under the root node is constructed according to the pinyin unit. That is to say, in fact, the first child node is constructed according to the pinyin of the first character of each keyword in the joint vocabulary.

It is understandable that since the sub-nodes are constructed according to the key pinyin, different keywords may share the same first sub-node, such as "analysis" and "period", which correspond to the first pinyin unit of the key pinyin Both are "fen", so these two keywords share a first child node.

As shown in Figure 8, for example keywords include "staging", "analysis", "isolation" and "firm", then according to the first pinyin unit in the key pinyin, three first child nodes 810 can be determined, "fen", "ge" and "shi" respectively.

Step S383, under the first sub-node, construct several second sub-nodes according to the second pinyin unit in the key pinyin and the order of the pinyin units in the key pinyin;

Specifically, after determining the first child node, according to the order of the pinyin units, all the remaining pinyin units in the key pinyin except the first pinyin unit are seated in the second pinyin unit, and according to the order of the pinyin units and the second pinyin unit, sequentially Build the second child node down. It can be understood that when there are multiple remaining pinyin units, there are also multiple corresponding second child nodes.

As shown in Figure 8, for example keywords include "staging", "analysis", "isolation" and "firm", under the 3 first child nodes 810, according to the remaining second phonetic unit of each keyword , five second child nodes 820 can be constructed, namely "qi", "xi", "li", "wu" and "suo", and it is understandable that since "firm" has three characters, the corresponding After removing the first pinyin unit, there are two second pinyin units left. According to the order of the pinyin units, two second sub-nodes are built down, that is, under the second sub-node corresponding to "wu". A "suo" corresponding to the second child node.

From the above description, the second sub-node is constructed from all pinyin units except the first pinyin unit of the key pinyin, so for the same first sub-node, there may be multiple second sub-nodes on the same pronunciation path.

It can be understood that when the keyword is a single character, there is no second child node on the pronunciation path.

Step S384, under the second sub-node, construct several third sub-nodes according to the key pinyin and keyword table;

Specifically, step S383 mentions that the second child node has already corresponding all remaining pinyin units after the key pinyin is removed from the first pinyin unit, so a third child node can be added under the second child node at the end of the path, The third sub-node is used to represent the keyword corresponding to the current key pinyin, and the keyword can be found in the keyword table.

As shown in Fig. 8, there are four nodes 830 in the third column, which are "staging", "analysis", "isolation" and "firm".

Step S385, adding an arc returning to the root node for each first child node and second child node;

Specifically, as shown in FIG. 8, an arc 940 returning to the root node is added to each first child node and second child node. After adding the arc, the first child node and the root node, the second child node and the root node A closed search loop is formed. For the clarity of Fig. 8, only two arcs are drawn in Fig. 8, in fact, all first child nodes and all second child nodes have arcs returning to the root node. As mentioned in step S120, the FST to be detected needs to be reorganized with the keyword FST to find the same node, as shown in Figure 8, for example, the node "fen" exists in the FST to be detected, and the node "fen" also exists in the keyword FST fen", the FST to be detected and the keyword FST will be compared with the child nodes of the next layer. If the two consecutive nodes in the FST to be detected are "fen-qi", and the pronunciation path entered by the keyword FST If it is "fen-xi", inconsistent results will appear. But if an arc is added, the keyword FST will return to the root node when it finds that the second child node "xi" cannot be matched, and compare again, then the next key FST will enter the pronunciation path "fen-qi", Thus, two consecutive matching nodes "fen-qi" in the FST to be detected and the keyword FST can be obtained, and the formation of a search closed loop can ensure as much as possible that in the reorganization of step S120, all possible matching nodes can be searched in the keyword FST. Nodes with the same FST to be detected, thereby improving the discovery rate of words to be corrected.

Through steps S381-S385, the embodiment of the present application provides a method for constructing a keyword FST, constructing multiple sub-nodes through the order of pinyin units and the keyword list, and improving the discovery rate of words to be corrected by searching a closed loop.

Through steps S381-S385, step S380 has been explained.

Through steps S300-S380, the embodiment of the present application provides a method for constructing the keyword FST and the Chinese character confusion set, and the keyword FST and the Chinese character confusion set can be applied to the method steps shown in Figure 1 to complete the method described in the embodiment of the application Proposed Speech Recognition Error Correction Method.

Through the combination of one or more embodiments, the embodiment of the present application provides a method of first performing speech recognition on the speech to be detected, generating the text to be detected and the corresponding pronunciation sequence to be detected; according to the print sequence to be detected, determine the FST to be detected; then According to the FST to be detected and the keyword FST of the corresponding business, some words to be corrected in the text to be detected can be determined, and the sentences to be corrected that contain the words to be corrected can be determined; Wrong words, determine the replacement word corresponding to each word to be corrected, and replace the word to be corrected in the sentence to be corrected with the replacement word to generate a replacement sentence; when the first logic of the sentence to be corrected If the score is smaller than the second logic score of the replacement sentence, the sentence to be corrected in the text to be detected is replaced with the replacement sentence, thereby completing the error correction of the speech recognition text. In addition, the embodiment of the present application provides a specific construction method of the keyword FST and the Chinese character confusion set.

Compared with the solutions that rely on grammar or syntax in the related art, the speech recognition error correction method proposed in the embodiment of the present application is to determine the words to be corrected that may have errors according to the pronunciation of the speech recognition text, and to confuse them according to the Chinese characters in the corresponding business. Set to provide replacement words for the word to be corrected, and finally determine whether error correction is required according to the logical score of the corresponding sentence before and after the replacement of the word to be corrected. It can be seen that the embodiment of the present application can find the recognition errors caused by mispronunciation in the specified business field, thereby improving the discovery probability of the words to be corrected; and using the Chinese character confusion set and logic score comparison, these recognition errors Effective correction can reduce the miscorrection rate of speech recognition text, thereby effectively improving the accuracy of speech recognition text, so that speech recognition technology can play a greater role in digital medical, smart home and other fields.

Referring to FIG. 9, FIG. 9 is a schematic diagram of a speech recognition error correction system provided by an embodiment of the present application. The system 900 includes but is not limited to a first module 910, a second module 920, a third module 930, a fourth module 940, and a fifth module. Module 950 , sixth module 960 , seventh module 970 and eighth module 980 . The first module is used for speech recognition of the speech to be detected, and obtains the text to be detected and the corresponding pronunciation sequence to be detected; the second module is used to construct the FST to be detected according to the pronunciation sequence to be detected; the third module is used to obtain the keyword FST and Chinese character confusion set; wherein, keyword FST, Chinese character confusion set and FST to be detected belong to the same vertical field; the fourth module is used to determine some words to be corrected in the text to be detected and some Sentence to be corrected; wherein, the sentence to be corrected contains words to be corrected; the fifth module is used to determine the corresponding replacement of each word to be corrected according to the confusion set of Chinese characters if the words to be corrected exist in the confusion set of Chinese characters word; the sixth module is used to replace the word to be corrected in the sentence to be corrected with a replacement word to obtain a replacement sentence; the seventh module is used to calculate the first logic score of the sentence to be corrected and the first logic score of the replacement sentence Two logic scores; the eighth module is used to replace the sentence to be corrected in the text to be detected with a replacement sentence when the first logic score is smaller than the second logic score.

Referring to FIG. 10, FIG. 10 is a schematic diagram of a device provided by an embodiment of the present application. The device 1000 includes at least one processor 1010 and at least one memory 1020 for storing at least one program; in FIG. 10, a processor and a memory as an example.

The processor and the memory may be connected through a bus or in other ways, and connection through a bus is taken as an example in FIG. 10 .

As a non-transitory computer-readable storage medium, memory can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device or other non-transitory solid-state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, which remote memory may be connected to the device via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Another embodiment of the present application also provides an apparatus, which can be used to execute the control method in any of the above embodiments, for example, execute the method steps in FIG. 1 described above.

The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The embodiment of the present application also discloses a computer storage medium, which stores a program executable by the processor, wherein the program executable by the processor is used to implement the speech recognition error correction method proposed by the present application when executed by the processor, The computer readable storage medium can be nonvolatile or volatile.

Those skilled in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware and an appropriate combination thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. permanent, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

The above is a specific description of the preferred implementation of the application, but the application is not limited to the above-mentioned implementation, and those skilled in the art can also make various equivalent deformations or replacements without violating the spirit of the application. Equivalent modifications or replacements are all within the scope defined by the claims of the present application.

Claims

A speech recognition error correction method, wherein the method includes:

performing speech recognition on the speech to be detected to obtain the text to be detected and the corresponding pronunciation sequence to be detected;

According to the pronunciation sequence to be detected, construct the FST to be detected;

Obtain keyword FST and Chinese character confusion set; Wherein, described keyword FST, described Chinese character confusion set and described to-be-detected FST belong to the same vertical field;

According to the FST to be detected and the keyword FST, determine several words to be corrected and several sentences to be corrected in the text to be detected; wherein, the sentences to be corrected include the words to be corrected ;

If the word to be corrected exists in the confusion set of Chinese characters, determine the replacement word corresponding to each word to be corrected according to the confusion set of Chinese characters;

replacing the word to be corrected in the sentence to be corrected with the replacement word to obtain a replacement sentence;

calculating a first logical score of the sentence to be corrected and a second logical score of the replacement sentence;

When the first logic score is smaller than the second logic score, the sentence to be corrected in the text to be detected is replaced with the replacement sentence.
The error correction method according to claim 1, wherein, according to the FST to be detected and the keyword FST, determining several words to be corrected in the text to be detected includes:

Recombining the FST to be detected with the keyword FST;

If there are several identical nodes in the FST to be detected and the keyword FST, the word corresponding to the same node in the text to be detected is determined as the word to be corrected;

Wherein, the number of the same nodes is the same as the word number of the word to be corrected.
The error correction method according to claim 1, wherein said acquisition of keyword FST and Chinese character confusion set comprises:

Obtain training speech, the training speech and the speech to be detected belong to the same vertical field;

Carry out speech recognition to described training speech, obtain speech recognition text;

Determining a corresponding first pronunciation sequence according to the speech recognition text;

Carry out artificial recognition to described training speech, obtain artificial recognition text;

Determining a corresponding second pronunciation sequence according to the artificially recognized text;

Construct the Chinese character confusion set according to the speech recognition text and the manual recognition text;

Construct a pronunciation confusion set according to the first pronunciation sequence and the second pronunciation sequence;

Construct a keyword table according to the pronunciation confusion set, the speech recognition text, the artificial recognition text, the first pronunciation sequence and the second pronunciation sequence;

According to the keyword table, the keyword FST is constructed.
The error correction method according to claim 3, wherein said constructing said Chinese character confusion set according to said speech recognition text and said manual recognition text comprises:

comparing a first word in the voice recognition text with a second word in a corresponding position in the manually recognized text;

If there is a difference between the current first word and the current second word, use the current first word and the current second word as the replacement word, and store the replacement word into the first candidate area;

When the comparison between the first word and the second word is completed, a plurality of the first candidate areas with the same word are merged into the same second candidate area;

Constructing the Chinese character confusion set including several second candidate regions.
The error correction method according to claim 3, wherein said constructing said pronunciation confusion set according to said first pronunciation sequence and said second pronunciation sequence comprises:

comparing the first pinyin unit in the first pronunciation sequence with the second pinyin unit in the corresponding position in the second pronunciation sequence;

If there is a difference between the current first pinyin unit and the current second pinyin unit, storing the current first pinyin unit and the current second pinyin unit into the first confusion area;

When the comparison between the first phonetic unit and the second phonetic unit is completed, several first confusion areas with the same pinyin are merged into the same second confusion area;

Constructing the pronunciation confusion set including several second confusion regions.
The error correction method according to claim 3, wherein, according to the pronunciation confusion set, the speech recognition text, the artificial recognition text, the first pronunciation sequence and the second pronunciation sequence, constructing a key Vocabulary, including:

Determine the key pinyin according to the pronunciation confusion set; wherein, the key pinyin includes several first key pinyins in the first pronunciation sequence and several second key pinyins in the second pronunciation sequence;

The words corresponding to the first key pinyin in the speech recognition text and the words corresponding to the second key pinyin in the artificial recognition text are used as keywords, and the keywords are stored in the List of key words.
The error correction method according to claim 6, wherein said constructing said keyword FST according to said keyword table comprises:

Construct the root node in the keyword FST;

Under the root node, according to the first pinyin unit in the key pinyin, construct the first child node;

Under the first sub-node, according to the second pinyin unit in the key pinyin and the order of the pinyin units in the key pinyin, construct several second sub-nodes; wherein, the second pinyin unit is the key All Pinyin units except the first Pinyin unit in Pinyin;

Under the second sub-node, according to the key pinyin and the keyword table, several third sub-nodes are constructed; wherein, the third sub-node is used to represent the key words corresponding to the key pinyin ;

An arc returning to the root node is added to each of the first child node and the second child node to obtain the keyword FST.
A speech recognition error correction system, wherein the system includes:

The first module is used to perform speech recognition on the speech to be detected to obtain the text to be detected and the corresponding pronunciation sequence to be detected;

The second module is used to construct the FST to be detected according to the pronunciation sequence to be detected;

The third module is used to obtain the keyword FST and the confusion set of Chinese characters; wherein, the keyword FST, the confusion set of Chinese characters and the FST to be detected belong to the same vertical field;

The fourth module is used to determine a number of words to be corrected and a number of sentences to be corrected in the text to be detected according to the FST to be detected and the keyword FST; wherein, the sentences to be corrected include the State the words to be corrected;

The fifth module is used to determine the replacement word corresponding to each word to be corrected according to the Chinese character confusion set if the word to be corrected exists in the Chinese character confusion set;

The sixth module is used to replace the word to be corrected in the sentence to be corrected with the replacement word to obtain a replacement sentence;

The seventh module is used to calculate the first logic score of the sentence to be corrected and the second logic score of the replacement sentence;

An eighth module, configured to replace the sentence to be corrected in the text to be detected with the replacement sentence when the first logic score is smaller than the second logic score.
A device, comprising:

at least one processor;

at least one memory for storing at least one program;

When the at least one program is executed by the at least one processor, the at least one processor is made to implement a voice recognition error correction method;

Wherein, the speech recognition error correction method includes:

performing speech recognition on the speech to be detected to obtain the text to be detected and the corresponding pronunciation sequence to be detected;

According to the pronunciation sequence to be detected, construct the FST to be detected;

Obtain keyword FST and Chinese character confusion set; Wherein, described keyword FST, described Chinese character confusion set and described to-be-detected FST belong to the same vertical field;

According to the FST to be detected and the keyword FST, determine several words to be corrected and several sentences to be corrected in the text to be detected; wherein, the sentences to be corrected include the words to be corrected ;

If the word to be corrected exists in the confusion set of Chinese characters, determine the replacement word corresponding to each word to be corrected according to the confusion set of Chinese characters;

replacing the word to be corrected in the sentence to be corrected with the replacement word to obtain a replacement sentence;

calculating a first logical score of the sentence to be corrected and a second logical score of the replacement sentence;

When the first logic score is less than the second logic score, the sentence to be corrected in the text to be detected is replaced with the replacement sentence.
The device according to claim 9, wherein, according to the FST to be detected and the keyword FST, determining several words to be corrected in the text to be detected includes:

Recombining the FST to be detected with the keyword FST;

If there are several identical nodes in the FST to be detected and the keyword FST, the word corresponding to the same node in the text to be detected is determined as the word to be corrected;

Wherein, the number of the same nodes is the same as the word number of the word to be corrected.
The device according to claim 9, wherein said obtaining the keyword FST and the confusion set of Chinese characters comprises:

Obtain training speech, the training speech and the speech to be detected belong to the same vertical field;

Carry out speech recognition to described training speech, obtain speech recognition text;

Determining a corresponding first pronunciation sequence according to the speech recognition text;

Carry out artificial recognition to described training speech, obtain artificial recognition text;

Determining a corresponding second pronunciation sequence according to the artificially recognized text;

Construct the Chinese character confusion set according to the speech recognition text and the manual recognition text;

Construct a pronunciation confusion set according to the first pronunciation sequence and the second pronunciation sequence;

Construct a keyword table according to the pronunciation confusion set, the speech recognition text, the artificial recognition text, the first pronunciation sequence and the second pronunciation sequence;

According to the keyword table, the keyword FST is constructed.
The device according to claim 11, wherein said constructing said Chinese character confusion set according to said speech recognition text and said artificial recognition text comprises:

comparing a first word in the voice recognition text with a second word in a corresponding position in the manually recognized text;

If there is a difference between the current first word and the current second word, use the current first word and the current second word as the replacement word, and store the replacement word into the first candidate area;

When the comparison between the first word and the second word is completed, a plurality of the first candidate areas with the same word are merged into the same second candidate area;

Constructing the Chinese character confusion set including several second candidate regions.
The device according to claim 11, wherein said constructing said pronunciation confusion set according to said first pronunciation sequence and said second pronunciation sequence comprises:

comparing the first pinyin unit in the first pronunciation sequence with the second pinyin unit in the corresponding position in the second pronunciation sequence;

If there is a difference between the current first pinyin unit and the current second pinyin unit, storing the current first pinyin unit and the current second pinyin unit into the first confusion area;

When the comparison between the first pinyin unit and the second pinyin unit is completed, a plurality of the first confusion areas with the same pinyin are merged into the same second confusion area;

Constructing the pronunciation confusion set including several second confusion regions.
The device according to claim 11, wherein the keyword table is constructed according to the pronunciation confusion set, the speech recognition text, the artificial recognition text, the first pronunciation sequence and the second pronunciation sequence ,include:

Determine the key pinyin according to the pronunciation confusion set; wherein, the key pinyin includes several first key pinyins in the first pronunciation sequence and several second key pinyins in the second pronunciation sequence;

The words corresponding to the first key pinyin in the speech recognition text and the words corresponding to the second key pinyin in the artificial recognition text are used as keywords, and the keywords are stored in the List of key words.
The device according to claim 14, wherein said constructing said keyword FST according to said keyword table comprises:

Construct the root node in the keyword FST;

Under the root node, according to the first pinyin unit in the key pinyin, construct the first child node;

Under the first sub-node, according to the second pinyin unit in the key pinyin and the order of the pinyin units in the key pinyin, construct several second sub-nodes; wherein, the second pinyin unit is the key All Pinyin units except the first Pinyin unit in Pinyin;

Under the second sub-node, according to the key pinyin and the keyword table, several third sub-nodes are constructed; wherein, the third sub-node is used to represent the key words corresponding to the key pinyin ;

An arc returning to the root node is added to each of the first child node and the second child node to obtain the keyword FST.
A computer storage medium, wherein a processor-executable program is stored, wherein the processor-executable program implements a speech recognition error correction method when executed by the processor;

Wherein, the speech recognition error correction method includes:

performing speech recognition on the speech to be detected to obtain the text to be detected and the corresponding pronunciation sequence to be detected;

According to the pronunciation sequence to be detected, construct the FST to be detected;

Obtain keyword FST and Chinese character confusion set; Wherein, described keyword FST, described Chinese character confusion set and described to-be-detected FST belong to the same vertical field;

According to the FST to be detected and the keyword FST, determine several words to be corrected and several sentences to be corrected in the text to be detected; wherein, the sentences to be corrected include the words to be corrected ;

If the word to be corrected exists in the confusion set of Chinese characters, determine the replacement word corresponding to each word to be corrected according to the confusion set of Chinese characters;

replacing the word to be corrected in the sentence to be corrected with the replacement word to obtain a replacement sentence;

calculating a first logical score of the sentence to be corrected and a second logical score of the replacement sentence;

When the first logic score is smaller than the second logic score, the sentence to be corrected in the text to be detected is replaced with the replacement sentence.
The computer storage medium according to claim 16, wherein, according to the FST to be detected and the keyword FST, determining several words to be corrected in the text to be detected includes:

Recombining the FST to be detected with the keyword FST;

If there are several identical nodes in the FST to be detected and the keyword FST, the word corresponding to the same node in the text to be detected is determined as the word to be corrected;

Wherein, the number of the same nodes is the same as the word number of the word to be corrected.
The computer storage medium according to claim 16, wherein said obtaining the keyword FST and the confusion set of Chinese characters comprises:

Obtain training speech, the training speech and the speech to be detected belong to the same vertical field;

Carry out speech recognition to described training speech, obtain speech recognition text;

Determining a corresponding first pronunciation sequence according to the speech recognition text;

Carry out artificial recognition to described training speech, obtain artificial recognition text;

Determining a corresponding second pronunciation sequence according to the artificially recognized text;

Construct the Chinese character confusion set according to the speech recognition text and the manual recognition text;

Construct a pronunciation confusion set according to the first pronunciation sequence and the second pronunciation sequence;

Construct a keyword table according to the pronunciation confusion set, the speech recognition text, the artificial recognition text, the first pronunciation sequence and the second pronunciation sequence;

According to the keyword table, the keyword FST is constructed.
The computer storage medium according to claim 18, wherein said constructing said Chinese character confusion set according to said speech recognition text and said artificial recognition text comprises:

comparing a first word in the voice recognition text with a second word in a corresponding position in the manually recognized text;

If there is a difference between the current first word and the current second word, use the current first word and the current second word as the replacement word, and store the replacement word into the first candidate area;

When the comparison between the first word and the second word is completed, a plurality of the first candidate areas with the same word are merged into the same second candidate area;

Constructing the Chinese character confusion set including several second candidate regions.
The computer storage medium according to claim 18, wherein said constructing said pronunciation confusion set according to said first pronunciation sequence and said second pronunciation sequence comprises:

comparing the first pinyin unit in the first pronunciation sequence with the second pinyin unit in the corresponding position in the second pronunciation sequence;

If there is a difference between the current first pinyin unit and the current second pinyin unit, storing the current first pinyin unit and the current second pinyin unit into the first confusion area;

When the comparison between the first pinyin unit and the second pinyin unit is completed, a plurality of the first confusion areas with the same pinyin are merged into the same second confusion area;

Constructing the pronunciation confusion set including several second confusion regions.