CN116070621A - Error correction method and device for voice recognition result, electronic equipment and storage medium - Google Patents
Error correction method and device for voice recognition result, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN116070621A CN116070621A CN202310081941.1A CN202310081941A CN116070621A CN 116070621 A CN116070621 A CN 116070621A CN 202310081941 A CN202310081941 A CN 202310081941A CN 116070621 A CN116070621 A CN 116070621A
- Authority
- CN
- China
- Prior art keywords
- word
- target
- recognition result
- combined
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012937 correction Methods 0.000 title claims abstract description 135
- 238000000034 method Methods 0.000 title claims abstract description 80
- 230000011218 segmentation Effects 0.000 claims description 107
- 238000004590 computer program Methods 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 13
- 238000012986 modification Methods 0.000 claims description 9
- 230000004048 modification Effects 0.000 claims description 9
- 238000004891 communication Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 150000001875 compounds Chemical class 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 3
- 241001492658 Cyanea koolauensis Species 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Character Discrimination (AREA)
Abstract
The invention provides an error correction method and device for a voice recognition result, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring an initial recognition result corresponding to the voice to be recognized; determining a target scene corresponding to the voice to be recognized; correcting the initial recognition result based on a first target word stock corresponding to the target scene to obtain a target recognition result; the first target thesaurus is determined based on at least one first text in the target scene. The error correction method, the error correction device, the electronic equipment and the storage medium of the voice recognition result can improve the accuracy of the voice recognition result.
Description
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a method and apparatus for correcting errors in speech recognition results, an electronic device, and a storage medium.
Background
Speech recognition technology, also known as automatic speech recognition (Automatic Speech Recognition, ASR), aims at converting lexical content in human speech into computer readable inputs, such as keys, binary codes or character sequences.
However, in existing speech recognition technologies, in some special scenarios, it is not possible to accurately recognize speech uttered by a user, for example: for some users of either homophonic or harmonic sounds. Specifically, for example, when the user says "month, get home and eat", the speech recognition engine may recognize "month" therein as "happy" or "Yue Yue". It can be seen that the accuracy of the existing speech recognition results is still low.
Disclosure of Invention
The invention provides an error correction method, an error correction device, electronic equipment and a storage medium for a voice recognition result, which are used for solving the defect of low accuracy of the voice recognition result in the prior art and achieving the purpose of improving the accuracy of the voice recognition result.
The invention provides an error correction method of a voice recognition result, which comprises the following steps:
acquiring an initial recognition result corresponding to the voice to be recognized;
determining a target scene corresponding to the voice to be recognized;
correcting the initial recognition result based on a first target word stock corresponding to the target scene to obtain a target recognition result; the first target thesaurus is determined based on at least one first text in the target scene.
According to the error correction method of the voice recognition result provided by the invention, the first target word library comprises target words and target combined words determined based on the first text, and the target combined words comprise the target words and the words before and/or after the target words;
the error correction is performed on the initial recognition result based on the first target word stock corresponding to the target scene to obtain a target recognition result, including:
performing word splitting processing on the initial recognition result to obtain at least one word;
for each word segment, a first combination word corresponding to the word segment is obtained, wherein the first combination word comprises the word segment and a previous word segment and/or a next word segment of the word segment;
and correcting the word segmentation based on the target word, the target combined word and the first combined word to obtain the target recognition result.
According to the error correction method of the voice recognition result provided by the invention, the first combined word comprises the word segmentation and the word segmentation before and after the word segmentation, and the target combined word comprises the target word and the word segmentation before and after the target word;
The error correction of the word segment based on the target word, the target combined word and the first combined word includes:
searching a first target combined word with the same length as the first combined word in the first target word bank;
searching a second target combined word, wherein a previous word is identical to the previous word in the first combined word, and a next word is identical to the next word in the first combined word, from the first target combined word;
under the condition that the second target combined word is found, searching a third target combined word with the same pronunciation as the word segmentation of the first combined word in the second target combined word;
and correcting the word segmentation based on the third target combined word under the condition that the third target combined word is searched.
According to the error correction method of a speech recognition result provided by the invention, the error correction method of the speech recognition result is used for correcting the word segmentation based on the third target combined word, and comprises the following steps:
replacing the word segment with a target word in the third target combined word under the condition that the number of the third target combined words is one;
Under the condition that the number of the third target combination words is at least two, replacing the word segmentation by the target word of the third target combination word with the largest word frequency in the at least two third target combination words, wherein the word frequency is used for representing the occurrence frequency of the third target combination word in the target scene;
and under the condition that the word frequencies corresponding to the third target combination words are the same, replacing the word segmentation by the target word of the third target combination word with the largest timestamp in at least two third target combination words, wherein the timestamp is used for representing the occurrence time of the third target combination word under the target scene.
According to the error correction method of the voice recognition result provided by the invention, the method further comprises the following steps:
under the condition that word segmentation corresponding to the first combined word fails to correct errors, searching a first target word which has the same length as the word segmentation and the same pronunciation in the first target word bank;
if one first target word exists, replacing the word segmentation by the first target word;
if at least two first target words exist, replacing the word segmentation with a first target word with the largest word frequency in the at least two first target words, wherein the word frequency is used for representing the occurrence frequency of the first target words in the target scene;
And under the condition that the word frequencies corresponding to the first target words are the same, replacing the word segmentation by a first target word with the largest timestamp in at least two first target words, wherein the timestamp is used for representing the appearance time of the first target word under the target scene.
According to the error correction method of the voice recognition result provided by the invention, the method further comprises the following steps:
and searching the first target word in a second target word bank under the condition that the first target word is not present in the first target word bank, wherein the second target word bank is determined based on at least one second text in a general scene.
According to the error correction method of the voice recognition result provided by the invention, the method further comprises the following steps:
searching a fourth target combined word matched with the word segmentation ambiguous tone of the first combined word in the target word in the second target combined word under the condition that the first target word does not exist in the second target word bank;
under the condition that the fourth target combination words are found and the number of the fourth target combination words is one, replacing the segmentation words by target words in the fourth target combination words;
And under the condition that the fourth target combination words are found and the number of the fourth target combination words is at least two, replacing the word segmentation by target words of the fourth target combination words with word frequency larger than a preset value and time stamp smaller than a preset time in the at least two fourth target combination words.
According to the error correction method of the voice recognition result provided by the invention, the method further comprises the following steps:
acquiring at least one first text in the target scene;
performing word splitting processing on each first text to obtain at least two target words;
combining the target word and the previous word and/or the next word of the target word aiming at each target word to obtain a target combined word;
and determining the first target word stock based on the target word and the target combined word.
According to the error correction method of the voice recognition result provided by the invention, the method further comprises the following steps:
under the condition that a modification instruction which is input by a user and is used for modifying the target word in the target recognition result is received, determining whether the modified word is identical with the pronunciation or the fuzzy sound of the target word;
And under the condition that the modified word is identical to the pronunciation or the fuzzy sound of the target word, reducing the word frequency of a second target word identical to the target word in the first target word bank and the word frequency of a fifth target combined word comprising the second target word.
The invention also provides an error correction device of the voice recognition result, comprising:
the acquisition module is used for acquiring an initial recognition result corresponding to the voice to be recognized;
the determining module is used for determining a target scene corresponding to the voice to be recognized;
the error correction module is used for correcting the initial recognition result based on a first target word stock corresponding to the target scene to obtain a target recognition result; the first target thesaurus is determined based on at least one first text in the target scene.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the error correction method of the speech recognition result as any one of the above when executing the program.
The invention also provides an electronic device comprising a microphone, a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the microphone is used for collecting voice to be recognized; the processor is used for acquiring an initial recognition result corresponding to the voice to be recognized; determining a target scene corresponding to the voice to be recognized; correcting the initial recognition result based on a first target word stock corresponding to the target scene to obtain a target recognition result; the first target thesaurus is determined based on at least one first text in the target scene.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of error correction of speech recognition results as described in any of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a method of error correction of speech recognition results as described in any of the above.
According to the error correction method, the error correction device, the electronic equipment and the storage medium of the voice recognition result, after the initial recognition result corresponding to the acquired voice to be recognized and the determined target scene corresponding to the voice to be recognized are obtained, the initial recognition result is corrected through the first target word stock corresponding to the target scene, and the target recognition result is obtained, wherein the first target word stock is determined based on at least one first text in the target scene. Since different speech recognition results may appear for the same speech to be recognized under different scenes, the initial recognition result may be corrected by determining the target scene corresponding to the speech to be recognized and based on the first target word stock corresponding to the pre-constructed target scene, where the first target word stock is determined based on at least one first text under the target scene, so that the first target word stock has stronger scene pertinence, that is, the first target word stock includes a plurality of vocabularies under the more accurate target scene. Therefore, when the initial recognition result of the voice to be recognized in the target scene is corrected based on the first target word stock, the correction accuracy of the initial recognition result is higher, and the accuracy of the target recognition result can be further improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for correcting errors in speech recognition results according to an embodiment of the present invention;
fig. 2 is a flow chart of a method for constructing a first target word stock according to an embodiment of the present invention;
FIG. 3 is a second flowchart of a method for correcting errors in speech recognition results according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an error correction device for speech recognition results according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an electronic device according to the present invention;
fig. 6 is a second schematic structural diagram of the electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
With the progress of data processing technology and the rapid popularization of mobile internet, voice recognition technology is widely applied to various fields of society, such as intelligent home appliances, unmanned driving, artificial intelligent robots, etc.
At present, most of the voice recognition products on the market adopt an online voice recognition technical scheme. Specifically, the technical scheme of online voice recognition is mainly adopted, real-time network query can be performed through 'cloud' big data, and word libraries with richer semantic information are obtained, so that a machine can understand different language descriptions of users. Therefore, the application range of the voice recognition product is enlarged. But limited by the complexity of Chinese characters, all the voices of users cannot be completely and accurately recognized by the existing voice recognition. Therefore, in the prior art, a solution is proposed in which a user manually uploads personalized words, that is, the user maintains the word stock by himself, and after the speech recognition result is obtained, the user can search the target word with high probability through the word stock to correct the speech recognition result.
However, in the above solution, the user only manually maintains a single word stock, which cannot solve the problem of inaccurate speech recognition results in all application scenarios, that is, the solution performs error correction through the word stock in speech recognition tasks in different scenarios. Since there may be different recognition results for the same voice to be recognized in different scenes, for example, in an office scene, the final recognition result should be "Yue Yue", and in a daily life scene, the final recognition result should be "month and month", etc. However, since the recognition result obtained by the error correction according to the above-described scheme is the same, the accuracy of the speech recognition result obtained by the error correction according to the above-described scheme is still not high. In addition, the personalized word stock is manually uploaded and maintained by the user, so that the maintenance efficiency is low, and the privacy information of the user is revealed.
Based on the above problems, the embodiment of the invention provides an error correction method for a voice recognition result, which aims at the characteristics that the voice recognition result is different in different application scenes, and pre-builds respective corresponding target word libraries in different application scenes. Therefore, after an initial recognition result corresponding to the voice to be recognized is obtained and a target scene corresponding to the voice to be recognized is determined, error correction is carried out on the initial recognition result through a first target word stock corresponding to the target scene, and a target recognition result is obtained. The first target word stock is determined based on at least one first text in the target scene, so that the first target word stock has stronger scene pertinence, namely, the first target word stock comprises a plurality of words in the target scene, namely, error correction information of each word in a more accurate initial recognition result is included. Therefore, the error correction result of the initial recognition result is more accurate, the error correction accuracy of the initial recognition result is improved, and the accuracy of the target recognition result is further improved.
The following describes an error correction method for a speech recognition result according to an embodiment of the present invention with reference to fig. 1 to 3, and the method may be applied to any speech recognition scenario. The subject of the method may be an error correction device for the speech recognition result, such as a cell phone, a computer or any other electronic device capable of speech recognition.
Fig. 1 is a flow chart of a method for correcting errors of a speech recognition result according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step 101: and obtaining an initial recognition result corresponding to the voice to be recognized.
Wherein the speech to be recognized may be speech in different languages, such as chinese, english, etc.
Specifically, after the voice to be recognized is collected through the microphone, an initial recognition result corresponding to the voice to be recognized can be recognized by means of the online voice recognition platform, or the voice to be recognized can be recognized by the electronic equipment, so that the initial recognition result is obtained.
Step 102: and determining a target scene corresponding to the voice to be recognized.
It should be appreciated that the text corresponding to the speech to be recognized may be different in different user usage scenarios, for example, in a authoring scenario, the name of the person in the speech to be recognized is "Wu Yueyue", and in a daily life scenario, the name of the person in the speech to be recognized is "Wu Yueyue". Therefore, the target scene corresponding to the voice to be recognized can be determined according to the attributes of different APP of the source of the voice to be recognized, which are acquired in the electronic equipment.
For example, different scenes can be distinguished according to the attribute of different APP installed in the electronic device, for example, a target scene corresponding to the voice to be recognized, which can be obtained from the social chat APP, is defined as a daily life scene; a target scene corresponding to the voice to be recognized, which is obtained from the writing class APP, is defined as a writing scene; the target scene corresponding to the voice to be recognized, which is obtained from the conference class APP, is defined as a conference scene.
By analogy, the target scene corresponding to the voice to be recognized, which is acquired from all APP in the electronic equipment, can be obtained.
Step 103: and correcting the initial recognition result based on a first target word stock corresponding to the target scene to obtain a target recognition result.
Taking the target scene as a daily life scene as an example, the first target word stock may include a person name, daily necessities, network terms, and the like.
Specifically, after a first target word stock corresponding to a target scene is determined, word segmentation processing can be performed on an initial recognition result to obtain each word in the initial recognition result; and then, searching out words which are the same as or similar to the pronunciation of the words from the first target word stock, so as to correct the error of each word in the initial recognition result, thereby obtaining the target recognition result.
In addition, before error correction is performed on the initial recognition result based on the first target word stock corresponding to the target scene, the first target word stock corresponding to the target scene needs to be constructed. Wherein the first target thesaurus is determined based on at least one first text in the target scene.
According to the error correction method for the voice recognition result, after the initial recognition result corresponding to the acquired voice to be recognized and the determined target scene corresponding to the voice to be recognized, the initial recognition result is corrected through the first target word stock corresponding to the target scene, so that the target recognition result is obtained, wherein the first target word stock is determined based on at least one first text in the target scene. Since different speech recognition results may appear for the same speech to be recognized under different scenes, the initial recognition result may be corrected by determining the target scene corresponding to the speech to be recognized and based on the first target word stock corresponding to the pre-constructed target scene, where the first target word stock is determined based on at least one first text under the target scene, so that the first target word stock has stronger scene pertinence, that is, the first target word stock includes a plurality of vocabularies under the more accurate target scene. Therefore, when the initial recognition result of the voice to be recognized in the target scene is corrected based on the first target word stock, the correction accuracy of the initial recognition result is higher, and the accuracy of the target recognition result can be further improved.
Further, fig. 2 is a flow chart of a method for constructing a first target word stock according to an embodiment of the present invention, as shown in fig. 2, where the method includes:
step 201: at least one first text in a target scene is acquired.
The first text is determined text input by a user in a target scene.
Specifically, the first text may be obtained according to the text that is input and submitted by the user in the target scene each time, where the first text is the content that is finally submitted by the user and is not modified any more, for example, the text that is input by the user in various ways and finally sent by the user, or when the user is detected to exit the page and hide the keyboard after inputting the text, the text input by the user is determined to be the first text.
Step 202: and performing word splitting processing on each first text to obtain at least two target words.
When two or more target words exist in the first text, word splitting processing is required.
Specifically, for each acquired first text, calculating the text length of the first text; if the text length is equal to 1, word splitting is not performed; if the text length is greater than 1, performing word splitting processing on the first text through a BERT (Bidirectional Encoder Representation from Transformers, transformer-based bi-directional encoder representation) model to obtain a word splitting result corresponding to the first text, that is, a target word in the first text, for example: when the first text entered by the user is "call Wu Yueyue," the target words thereof are "call," Wu Yueyue, "" call, "and" telephone.
Step 203: and combining the target word and the previous word and/or the next word of the target word aiming at each target word to obtain a target combined word.
Specifically, after each target word in the first text is obtained, the target word and a word before and/or a word after the target word are sequentially spliced to obtain a plurality of target combination words corresponding to the target word, namely, a 'previous word+target word', 'target word+after word', and a 'previous word+target word+after word' are used as target combination words corresponding to the target word. For example: the target words "Wu Yueyue" in the first text "call Wu Yueyue" correspond to the target combination words "call Wu Yueyue", "Wu Yueyue" and "call Wu Yueyue".
Further, when there is only a preceding word or a following word, that is, when there is a first word or a last word in a sentence, the "preceding word+target word" or the "target word+following word" is used as a target combined word corresponding to the target word. Continuing taking the first text as "call Wu Yueyue" as an example, wherein the target word "call" is the last word in the sentence, so the target word corresponding to the target word "call" is "call". In addition, in order to improve the scene pertinence of the first target word stock, word forming operation can be performed only on target words with the word number greater than 1, namely, target words with poor scene pertinence are removed.
Step 204: a first target word stock is determined based on the target word and the target combined word.
Specifically, after each target word in the first text and the target combination word corresponding to each target word are obtained, all the target word and the target combination word corresponding to the target word are imported into the first target word stock to obtain the latest first target word stock, namely the first target word stock can be updated in real time.
Further, the obtained word frequency and time stamp information of each target word in the first text and the target combination word corresponding to the target word also need to be imported into the first target word stock. Wherein, the minimum value of word frequency is 0, and the maximum value is M. For example, when a text in an electronic device is queried in real time, updating word frequency and timestamp information of each target word in a first text generated by the electronic device each time and a target combination word corresponding to each target word, namely adding 1 to the existing target word and the word frequency of the target combination word corresponding to the target word, and setting 0 to the absent target word and the word frequency of the target combination word corresponding to the target word. In addition, the time stamps of both are updated to the current latest time.
By way of example, table 1 shows the specific formats of the target word "Wu Yueyue" in the first text "call Wu Yueyue" and the imported first target word stock corresponding to "phone":
TABLE 1
User words | Word frequency | Time stamp |
Wu Yueyue | 0 | 1671613975908 |
To Wu Yueyue | 0 | 1671613975908 |
Wu Yueyue dozen | 0 | 1671613975908 |
Wu Yueyue is given a beat | 0 | 1671613975908 |
Telephone set | 0 | 1671613975908 |
Telephone call making | 0 | 1671613975908 |
Similarly, by the method for constructing the first target word stock, the target word stock corresponding to other scenes can be constructed.
The first target word library not only comprises target words obtained by word splitting processing of a first text in a target scene, but also comprises target words and target combination words obtained by combining the previous words and/or the next words of the target words. Therefore, the constructed first target word stock has stronger scene pertinence, namely, the first target word stock comprises a plurality of words under a more accurate target scene. Therefore, when the initial recognition result of the voice to be recognized in the target scene is corrected based on the first target word stock, the correction accuracy of the initial recognition result is higher, and the accuracy of the target recognition result can be improved.
In the above embodiment, when correcting the initial recognition result based on the first target word stock corresponding to the target scene to obtain the target recognition result, the following manner may be adopted: carrying out word splitting treatment on the initial recognition result to obtain at least one word; aiming at each word segment, a first combined word corresponding to the word segment is obtained, wherein the first combined word comprises the word segment, and the previous word segment and/or the next word segment of the word segment; and correcting the segmentation word based on the target word, the target combination word and the first combination word to obtain a target recognition result.
The first target word library comprises target words and target combination words which are determined based on the first text, wherein the target combination words comprise target words and the previous words and/or the next words of the target words.
Specifically, after the initial recognition result is obtained, error correction can be performed on the initial recognition result according to the self-contained word splitting information in the initial recognition result. However, in some cases, the initial recognition result may not include word splitting information, so that the initial recognition result needs to be subjected to word splitting processing by itself to obtain each word in the initial recognition result, and based on each word, a first combined word corresponding to each word is obtained, that is, the word and the previous word and/or the next word of the word are combined. For example, taking the initial recognition result of "call Wu Yueyue" as an example, after the initial recognition result of "call Wu Yueyue" is split, the word segmentation results obtained are "call", "Wu Yueyue", "call" and "phone". Wherein, the first combination word corresponding to the word "give" is "give Wu Yueyue"; the first combination words corresponding to the word "Wu Yueyue" are "given Wu Yueyue", "Wu Yueyue beat" and "given Wu Yueyue beat"; the first combination words corresponding to the word "make" are "Wu Yueyue make", "make" and "Wu Yueyue make"; the first combination word corresponding to the word "telephone" is only "make a call".
Of course, in order to improve the accuracy of error correction of the initial recognition result, only the word with the word segmentation length greater than 1 in the word segmentation may be combined and corrected. This is because the word segmentation length is equal to 1, which tends to be less scene specific. Therefore, only the words "Wu Yueyue" and "telephone" in the above embodiments can be combined and error corrected.
Based on the above, after the first combination word corresponding to each word segment is obtained, the target word and the target combination word corresponding to the first combination word can be found out from the first target word bank based on the first combination word, and each word segment in the initial recognition result is corrected based on the word frequency and the time stamp of the target word and the target combination word, so that the target recognition result is obtained. Because the front and rear association information of the target word in the first text is included in the target combined word, the front and rear association information of the segmentation word in the initial recognition result is included in the first combined word. Therefore, when the word is corrected based on the target word, the target combined word and the first combined word corresponding to the word segmentation in the first target word bank, stronger pertinence is generated, so that the correction of each word segmentation in the initial recognition result is more accurate, the correction accuracy of the initial recognition result is improved, and the accuracy of the target recognition result is further improved.
Next, a specific procedure of correcting the segmentation word based on the target word, the target combination word, and the first combination word will be described in detail.
Specifically, fig. 3 is a second flowchart of a method for correcting errors in a speech recognition result according to an embodiment of the present invention, as shown in fig. 3, where the method includes:
step 301: and searching the first target combined word with the same length as the first combined word in the first target word library.
The first combined word comprises a word segmentation, and a word segmentation before and a word segmentation after the word segmentation; the first target combination word includes a target word, and a preceding word and a following word of the target word.
Specifically, according to the number of characters contained in the first combined word, the first target combined word with the same number of characters can be found out from the first target word stock.
Alternatively, when the first combined word includes two or more languages, the first target combined word may be searched according to the number of characters in each language and the total number of characters.
Step 302: and searching a second target combined word, wherein the previous word is the same as the previous word in the first combined word, and the next word is the same as the next word in the first combined word, from the first target combined word.
The second target combination word comprises a target word, and a previous word and a next word of the target word.
Specifically, when there is only one of the first target compound words, it may be used as a third target compound word, and step 304 is performed; and when the first target combination word is more than one, searching a second target combination word, wherein the previous word of the first combination word is the same as the previous word of the first combination word, and the next word of the first combination word is the same as the next word of the first combination word, from the first target combination word.
Optionally, when the number of the first target combined words is less than the preset value, the first target combined words are all used as the second target combined words, and step 303 is performed.
Step 303: and searching a third target combined word with the same pronunciation as the word segmentation of the first combined word in the second target combined word under the condition that the second target combined word is searched.
The third target combination word comprises a target word, and a word before and a word after the target word.
Specifically, when there is only one second target compound word, it may be directly used as a third target compound word, and step 304 is performed; and when more than one second target combined word is adopted, searching a third target combined word with the same word segmentation pronunciation as the first combined word from the second target combined word.
Step 304: and correcting the segmentation word based on the third target combination word under the condition that the third target combination word is found.
Further, in performing error correction on the segmentation word based on the third target combined word, the error correction may be performed as follows: under the condition that the number of the third target combination words is one, replacing the word segmentation by the target words in the third target combination words; under the condition that the number of the third target combination words is at least two, replacing word segmentation by target words of the third target combination words with the maximum word frequency in the at least two third target combination words, wherein the word frequency is used for representing the occurrence times of the third target combination words in a target scene; and under the condition that the word frequencies corresponding to the third target combination words are the same, replacing the word segmentation by the target word of the third target combination word with the largest timestamp in at least two third target combination words, wherein the timestamp is used for representing the occurrence time of the third target combination word under the target scene.
Specifically, when the third target combination word is obtained and the number of the third target combination word is determined to be only one, the searched target word in the third target combination word is adopted to replace the word segmentation of the first combination word. Taking the initial recognition result of "call Wu Yueyue" and the third target combination word of "call Wu Yueyue" as examples, wherein when the number of the third target combination words is one, namely "call Wu Yueyue", the initial recognition result of "call Wu Yueyue" after error correction, namely the target recognition result, can be obtained only by directly replacing the target word "Wu Yueyue" in the third target combination word of "call Wu Yueyue" with the word "Wu Yueyue" in the first combination word of "call Wu Yueyue".
Further, when it is determined that the number of the third target combined words is more than one, comparison of word frequencies and/or time stamps corresponding to the third target combined words is required to determine the optimal third target combined word for replacing the segmented word in the first combined word. Specifically, the word frequency corresponding to the third target combined word can be compared firstly, so that the third target combined word with the maximum word frequency is used as the optimal third target combined word; if the word frequency is the same, comparing the time stamp corresponding to the third target combination word, taking the third target combination word with the maximum time stamp as the optimal third target combination word, and replacing the determined target word of the optimal third target combination word with the word in the first combination word, thereby obtaining an initial recognition result after error correction, namely a target recognition result.
Specifically, taking the word "Wu Yueyue" in the first combination word "call Wu Yueyue" as an example, it is assumed that the third target combination words found have "call Wu Yueyue" and "call Wu Yueyue", where the word frequency and the time stamp corresponding to the third target combination word "call Wu Yueyue" are "2" and "2022-12-01-10-36", respectively; the word frequency and time stamp corresponding to the third target combination word "given Wu Yueyue" are "2" and "2018-03-15-22-11", respectively. It can be seen that the word frequencies of the third target combination words "Wu Yueyue" and "Wu Yueyue" are the same, and therefore, it is further necessary to compare the time stamps of the third target combination words "to Wu Yueyue" with the time stamp "2022-12-01-10-36" which is significantly greater than the time stamp "2018-03-15-22-11" of the third target combination word "to Wu Yueyue", that is, the third target combination word "to Wu Yueyue" is the third target combination word closest to the current, which has a greater reference value, so that the third target combination word "to Wu Yueyue" is used as the best third target combination word, and the target word "Wu Yueyue" in the best third target combination word "to Wu Yueyue" is replaced with the word "Wu Yueyue" in the first combination word "to Wu Yueyue for making a call, thereby obtaining the target recognition result" to Wu Yueyue for making a call.
In the present embodiment, the word segmentation is replaced by the target word in the third target combined word in the case where the number of the third target combined word is one; under the condition that the number of the third target combination words is at least two, the target words of the third target combination words with the largest word frequency in the at least two third target combination words are adopted to replace the word segmentation, and under the condition that the word frequencies corresponding to the third target combination words are the same, the word segmentation mode is adopted to replace the word segmentation mode of the target words of the third target combination words with the largest time stamp in the at least two third target combination words, so that the error correction is carried out on the word segmentation in the initial recognition result, namely the obtained third target combination words are higher in matching degree with the first combination words, namely the third target combination words with the closest frequency and/or time of occurrence under the target scene are obtained, and therefore the error correction is carried out on the word segmentation more accurately based on the third target combination words, so that the accuracy of the target recognition result is higher.
The following description will take, as an example, one of the words of the initial recognition result "call Wu Yueyue", specifically, the word "Wu Yueyue".
The first word "give Wu Yueyue" corresponding to the word "Wu Yueyue" may be obtained by combining the front and rear words of the word "Wu Yueyue" in the initial recognition result "give Wu Yueyue call".
Further, after the first combination word "make Wu Yueyue" corresponding to the word "Wu Yueyue" in the initial recognition result "make Wu Yueyue call" is obtained, the word "Wu Yueyue" may be corrected as follows:
first, a first target word combination having the same length as the first word combination "given Wu Yueyue" is searched for from a first target word library, for example: "beat Wu Yueyue", "beat Wu Yueyue money", "beat Li Yueyue", etc.
Then, in the first target combined word, a second target combined word in which the previous word is identical to the previous word "given" in the first combined word "given Wu Yueyue" and the next word is identical to the next word "given" in the first combined word is found, for example: the first target combination words "give Wu Yueyue beat" and "give Li Yueyue beat" both include "give" and "beat".
Finally, in the second target combined word, a third target combined word with the same pronunciation as the word "Wu Yueyue" in the first combined word "give Wu Yueyue beat" is searched out, for example: the target word "Wu Yueyue" in the second target combination word "beat Wu Yueyue" is the same pronunciation as the word "Wu Yueyue" in the first combination word "beat Wu Yueyue". Thus, in the case where the third target combination word "beat Wu Yueyue" is found, the target word "Wu Yueyue" is taken as the correct form of the word "Wu Yueyue" in the first combination word "beat Wu Yueyue", and the word "Wu Yueyue" is replaced, that is, the word "Wu Yueyue" is error corrected.
In this embodiment, the purpose of quickly searching the target combination word with higher matching degree with the word segmentation in the first combination word is achieved by searching the first target combination word with high matching degree and small range in the first target word bank, then searching the second target combination word with lower matching degree and larger range and the third target combination word, so that more accurate error correction is performed on the word segmentation based on the target combination word with higher matching degree, and the accuracy of the target recognition result is higher. In addition, in the above-mentioned way of searching for the combined word, the semantic information contained in the initial recognition result is considered, so that the accuracy of error correction can be further improved.
In a possible implementation manner, on the basis of the foregoing embodiment, in a case that error correction of the word segmentation corresponding to the first combined word fails, the following error correction manner may be adopted: searching a first target word with the same length and the same pronunciation as the word segmentation in a first target word library; if a first target word exists, replacing the word segmentation by the first target word; if at least two first target words exist, replacing word segmentation by a first target word with the largest word frequency in the at least two first target words, wherein the word frequency is used for representing the occurrence frequency of the first target words in a target scene; and under the condition that the word frequencies corresponding to the first target words are the same, replacing the word segmentation by the first target word with the largest timestamp in at least two first target words, wherein the timestamp is used for representing the appearance time of the first target word under the target scene.
The above process can be understood as that after the searching of the first combined word of the initial recognition result fails, searching is performed on the segmented word corresponding to the first combined word, that is, the limiting word of the segmented word is removed, and after the searching range is enlarged, the searching is continued. Wherein, the qualifier refers to the former word and the latter word of the segmentation.
Specifically, when only one first target word which is the same as the word in length and the same in pronunciation is found, the word is directly replaced by the first target word. For the situation that a plurality of first target words exist, preferentially comparing word frequencies in the first target words, and selecting the first target word with the largest word frequency to replace the word segmentation; and then comparing the time stamp in the first target word, and selecting the first target word with the largest time stamp to replace the word segmentation.
For example, taking the word "Wu Yueyue" in the first combination word "call Wu Yueyue" as an example, assume that the first target word found has "Wu Yueyue", "Wu Yueyue" and "Wu Yueyue", where the word frequency and the timestamp corresponding to the first target word "Wu Yueyue" are "2" and "2022-12-01-10-36", respectively; the word frequency and the time stamp corresponding to the first target word Wu Yueyue are respectively 2 and 2018-03-15-22-11; the word frequency and time stamp corresponding to the first target word "Wu Yueyue" are "0" and "2011-05-26-17-49", respectively.
As can be seen from the above, the number of the first target words found is plural, so that it is necessary to compare the word frequencies among the plural first target words preferentially, and select the first target word with the largest word frequency to replace the word segment "Wu Yueyue", where the word frequencies of the two first target words with the largest word frequency are "Wu Yueyue" and "Wu Yueyue", and the word frequencies of the two first target words are 2. Therefore, it is further necessary to compare the time stamps of the two words further, and select the first target word with the largest time stamp to replace the word "Wu Yueyue", wherein the time stamp "2022-12-01-10-36" of the first target word "Wu Yueyue" is the largest, and therefore, the first target word is directly replaced by the word "Wu Yueyue" in the first combined word "call Wu Yueyue", so as to obtain the target recognition result "call Wu Yueyue".
In this embodiment, under the condition that word segmentation corresponding to the first combination word fails in error correction, searching for a first target word with the same length and the same pronunciation as the word segmentation in a first target word bank, so as to replace the word segmentation with the first target word when one first target word exists, replace the word segmentation with the first target word with the largest word frequency in at least two first target words when at least two first target words exist, and correct the word segmentation in the initial recognition result by adopting the first target word with the largest time stamp in at least two first target words when the word frequencies corresponding to the first target words are the same. Therefore, the matching degree of the first target word and the word segmentation corresponding to the first combined word is higher, namely the first target word closest to the occurrence times and/or the occurrence time under the target scene is obtained, and the accuracy of the obtained target recognition result is higher when the initial recognition result of the voice to be recognized is corrected based on the first target word.
Further, in the case that the first target word is not present in the first target word stock, the first target word can be searched in a second target word stock, where the second target word stock is determined based on at least one second text in the general scene.
The second text is text data acquired under the general scene. For example: the second text may be address book information stored in the user equipment, or user word information such as pinyin user words stored in the voice input method, and the source of the second text is not limited at all.
Specifically, the above general scene may be understood as a scene that cannot be distinguished as a target scene, for example: the scene of making a call, which cannot accurately judge whether the user is in daily chat or in office, is called a general scene, and thus the scene corresponding to the acquired voice to be recognized is called a general scene. In addition, the construction of the second target word stock and the search method based on the second target word stock are the same as those of the first target word stock in the foregoing embodiment, and will not be described in detail herein.
Further, under the condition of permission of a user, the acquired second text can be split and combined and then imported into a second target word stock for error correction of word segmentation in the initial recognition result. Illustratively, the imported format of the second text is shown in table 2 below:
TABLE 2
User words |
Wu Yueyue |
To Wu Yueyue |
In this embodiment, under the condition that the first target word library does not exist in the first target word library, the first target word library is searched in the second target word library, and error correction is performed on the initial recognition result of the voice to be recognized. The second target word stock is determined based on the second text in the general scene, so that the second target word stock comprises a plurality of words in the more accurate general scene. In this way, under the condition that the error correction of the initial recognition result of the voice to be recognized fails based on the first target word bank under the target scene, the error correction source based on the second target word bank under the general scene is further considered, so that when the error correction of the initial recognition result of the voice to be recognized is performed, a more accurate first target word can be obtained, and therefore the error correction accuracy of the initial recognition result of the voice to be recognized, namely the accuracy of the target recognition result, is improved.
In the foregoing embodiment, in the case where error correction of the segmentation fails based on the target word, the target combined word, and the first combined word, one possible implementation is to correct the segmentation based on the target word, the target combined word, and the first combined word in the case where the combined word includes the target word and a word preceding the target word. The target word stock is a second target word stock corresponding to the general scene.
Specifically, the specific implementation manner of the above steps is the same as that in the foregoing embodiment, and will not be described in detail herein.
In this embodiment, when error correction fails based on the target word, the target combined word and the first combined word, that is, when error correction fails based on the combination of the first target word stock corresponding to the target scene, a combination error correction method based on the second target word stock corresponding to the general scene is adopted, where the second target word stock includes a plurality of words in the more precise general scene, so when error correction fails based on the target word, the target combined word and the first combined word, the second target word stock corresponding to the general scene is adopted, and the combined word is the combination error correction method of the first word of the target word and the target word, when error correction is performed on the segmented word, a more accurate word correction result can be obtained, and further error correction accuracy of the initial recognition result can be improved, so that accuracy of the target recognition result is higher.
Further, in the case of the failure of the above error correction, one possible implementation manner is to correct the word based on the target word, the target combined word, and the first combined word in the case that the combined word includes the target word and the word subsequent to the target word. The target word stock is a second target word stock corresponding to the general scene.
In this embodiment, when the combination correction fails in a manner based on the target word and the preceding word combination of the target word, the combination correction is performed based on the combination of the target word and the following word combination of the target word. Therefore, under the condition that one combination error correction is unsuccessful, the other combination mode can be used for error correction, and the method is equivalent to obtaining words in a more accurate general scene, so that when the words are subjected to error correction based on the target words, the target combination words and the first combination words, the error correction accuracy of the obtained initial recognition result is higher, and the accuracy of the target recognition result is further improved.
Still further, in the case where the above error correction still fails, one possible implementation is to correct the word based on the target word, the target combined word, and the first combined word in the case where the combined word includes the target word, the preceding word and the following word of the target word. The target word stock is a second target word stock corresponding to the general scene.
Specifically, in the present embodiment, in the case where the combination correction is performed based on the target word and the word combination subsequent to the target word, or if the combination correction fails, the combination correction is performed based on the target word, the word preceding to the target word, and the word combination subsequent to the target word. Therefore, under the condition that one combination error correction is unsuccessful, another combination mode can be used for error correction, which is essentially equivalent to obtaining words in a more accurate general scene, namely after more accurate target combination words are obtained through combination, the error correction accuracy of the obtained initial recognition result is higher when the words are subjected to error correction based on the target words, the target combination words and the first combination words, and the accuracy of the target recognition result is further improved.
It should be understood that under different error correction conditions, the above three error correction modes may be freely combined to achieve the purpose of fast error correction of the initial recognition result of the speech to be recognized, which is not particularly limited.
Further, in any of the above embodiments, in the case that the first target word does not exist in the second target word stock, a fourth target word that matches the word segmentation ambiguous tone of the first target word and the target word in the second target word may be searched; under the condition that the fourth target combination words are found and the number of the fourth target combination words is one, replacing word segmentation by the target words in the fourth target combination words; and under the condition that fourth target combined words are found and the number of the fourth target combined words is at least two, replacing the word segmentation by the target words of the fourth target combined words with word frequencies larger than a preset value and time stamps smaller than a preset time in the at least two fourth target combined words.
Specifically, when the first target word is not found in the second target word stock corresponding to the general scene, the word segmentation pronunciation of the target word and the word segmentation pronunciation of the first combined word are considered to be possibly different, that is, the target word and the word segmentation pronunciation of the first combined word are likely to be fuzzy sounds. Therefore, in the second target combined word, the previous word is the same as the previous word in the first combined word, and the next word is the same as the next word in the first combined word, which is searched from the first target word stock, a fourth target combined word matched with the word segmentation ambiguous sound of the first combined word is searched, and the word segmentation of the first combined word is replaced by the target word in the fourth target combined word under the condition that the number of the fourth target combined word is found to be one. Further, if the number of the searched fourth target combination words is more than one, the target words of the fourth target combination words with word frequency larger than a preset value and time stamp smaller than a preset time are adopted to replace the word segmentation, so that error correction of the word segmentation is completed.
Taking the first combination word as "pay Wang Huahua" as an example, when the fourth target combination word which is found to be matched with the fuzzy sound of the first combination word is "pay Wang Fafa", the "flower" in the first combination word is replaced by the "send" in the fourth target combination word, so as to complete the error correction of the segmentation word. On the basis, when the fourth target combination word also comprises other words, such as 'pay Wang Haha', a mode of searching the fourth target combination word with word frequency larger than a threshold M and time stamp smaller than T is adopted to search the more accurate fourth target combination word. Further, when only one fourth target combination word is found, direct replacement is performed according to the mode; otherwise, determining the optimal fourth target combined word according to the mode that the word frequency is larger than a preset value and the time stamp is smaller than a preset time, and then performing word segmentation replacement in the same mode to obtain a final word segmentation result. And similarly, carrying out error correction on each word in the initial recognition result in the mode to obtain a target recognition result.
In this embodiment, under the condition that the first target word does not exist in the second target word library, by searching for a fourth target word in the second target word library, in which the target word is fuzzy-tone matched with the word of the first target word, and under the condition that the fourth target word is found and the number of the fourth target word is one, the word of the fourth target word is replaced by the target word in the fourth target word, and under the condition that the fourth target word is found and the number of the fourth target word is at least two, the word of the fourth target word with the word frequency greater than the preset value and the timestamp less than the preset time is replaced by the target word of the fourth target word in the at least two fourth target words, so that the error correction range of the word in the initial recognition result is enlarged, that is, the search for replacing the target word of the word is not only performed in the word range of the same tone, but also the search for replacing the target word of the word is performed in the word range of fuzzy-tone matched. Therefore, the accuracy of error correction of the word segmentation in the initial recognition result is further improved. In addition, when more than one fourth target combination word is found, the effectiveness of the fourth target combination word is improved by adopting a mode that target words of the fourth target combination word with word frequency larger than a preset value and time stamp smaller than a preset time replace word segmentation in at least two fourth target combination words, namely, a fourth target combination word with more accurate or stronger relevance is selected, so that more accurate error correction can be carried out on word segmentation in an initial recognition result based on the fourth target combination word, and the accuracy of a target recognition result is higher.
Further, in the case that the target word in the target recognition result is derived from the first target word stock and the user modifies the target recognition result, it can be known that the word frequency related to the target word in the first target word stock is not accurate enough and needs to be modified, so, in order to improve the error correction accuracy of the initial recognition result, the word frequency in the first target word stock can be modified by: under the condition that a modification instruction for modifying the target word in the target recognition result is received, determining whether the pronunciation or fuzzy sound of the modified word is the same as that of the target word; and under the condition that the modified word is the same as the pronunciation or the fuzzy sound of the target word, reducing the word frequency of a second target word which is the same as the target word in the first target word bank and the word frequency of a fifth target combined word comprising the second target word.
Specifically, after receiving a modification instruction input by a user for modifying a target word in a target recognition result, judging whether the modified word is the same as the pronunciation or fuzzy sound of the target word or not; if the word frequency is the same, then carrying out next word frequency modification operation; otherwise, no processing is required. In other words, in the case that the word frequency of the second target word identical to the target word in the first target word stock and the word frequency of the fifth target combined word including the second target word are inaccurate, the word frequencies of the second target word and the fifth target word are modified.
The following description will take the example that the target recognition result is "call Wu Yueyue", and the electronic device detects that the user changes "call Wu Yueyue" to "call Wu Yueyue". Wherein the user modifies the target word "Wu Yueyue" to "Wu Yueyue".
Specifically, the modified word "Wu Yueyue" is the same pronunciation as the target word "Wu Yueyue". Thus, the word frequency of the second target word "Wu Yueyue" which is the same as the target word "Wu Yueyue" in the first target word library, and the fifth target word "given Wu Yueyue", "Wu Yueyue beat", "given Wu Yueyue beat", and other words including the second target word "Wu Yueyue" need to be modified, for example, halving the word frequency.
In one possible manner, when modifying the word frequency of the second target word and the fifth target word including the second target word, the following manner may be adopted: the word frequency of the average word frequency which is larger than the first target word stock in the two word frequencies is modified to be the average word frequency, and the word frequency which is smaller than the average word frequency of the first target word stock in the two word frequencies is modified to be half of the original word frequency. In particular, when a word with a word frequency of 1 is included in the second target word and the fifth target combined word including the second target word, it is necessary to modify not only the word frequency of the word to 0 but also delete the word from the first target word stock.
Further, when the target word in the target recognition result is derived from the second target word stock, the word frequency of the third target word identical to the target word in the second target word stock and the word frequency of the sixth target combined word including the third target word need to be modified.
Specifically, the specific implementation manner of word frequency modification in this embodiment is the same as that in the foregoing embodiment, and will not be described here again.
In this embodiment, by determining whether the pronunciation or the ambiguous tone of the modified word is the same as that of the target word under the condition that the modification instruction input by the user for modifying the target word in the target recognition result is received, and by reducing the word frequency of the second target word identical to the target word in the first target word bank and the word frequency of the fifth target combined word including the second target word under the condition that the pronunciation or the ambiguous tone of the modified word is the same as that of the target word, the word frequency of the second target word identical to the target word in the first target word bank and the word frequency of the fifth target combined word including the second target word are modified to be more accurate, so that the error correction accuracy of the subsequent initial recognition result is improved under the condition that the target word is inaccurate due to the word frequency error of the fifth target combined word including the second target word.
In addition to the above-described embodiment, when correcting the initial recognition result of the speech to be recognized, correction may be performed according to different correction manners in the above-described embodiment, such as combination word correction, word segmentation correction, and fuzzy tone correction, and different correction sources, such as first target word stock correction, and second target word stock correction, which are arranged in any combination. The error correction sequence and mode are not particularly limited herein.
The specific implementation manner of the combined error correction method may refer to the description of the corresponding steps in the foregoing embodiments, which is not repeated here.
The following describes an error correction device for a speech recognition result according to an embodiment of the present invention, and the error correction device for a speech recognition result described below and the error correction method for a speech recognition result described above may be referred to correspondingly with each other.
Fig. 4 is a schematic diagram of the result of the error correction device for speech recognition result according to the embodiment of the present invention, as shown in fig. 4, the device includes:
an obtaining module 401, configured to obtain an initial recognition result corresponding to the voice to be recognized;
a determining module 402, configured to determine a target scene corresponding to a voice to be recognized;
the error correction module 403 is configured to correct the initial recognition result based on the first target word stock corresponding to the target scene, so as to obtain a target recognition result; the first target thesaurus is determined based on at least one first text in the target scene.
According to the error correction device for the voice recognition result provided by the embodiment of the invention, after the initial recognition result corresponding to the voice to be recognized, which is obtained through the obtaining module 401, and the target scene corresponding to the voice to be recognized, which is determined through the determining module 402, the initial recognition result is corrected through the error correction module 403 according to the first target word stock corresponding to the target scene, so as to obtain the target recognition result, wherein the first target word stock is determined based on at least one first text in the target scene. Since different speech recognition results may appear for the same speech to be recognized under different scenes, the initial recognition result may be corrected by determining the target scene corresponding to the speech to be recognized and based on the first target word stock corresponding to the pre-constructed target scene, where the first target word stock is determined based on at least one first text under the target scene, so that the first target word stock has stronger scene pertinence, that is, the first target word stock includes a plurality of vocabularies under the more accurate target scene. Therefore, when the initial recognition result of the voice to be recognized in the target scene is corrected based on the first target word stock, the correction accuracy of the initial recognition result is higher, and the accuracy of the target recognition result can be further improved.
Optionally, the first target word library includes a target word and a target combined word determined based on the first text, where the target combined word includes the target word and a word preceding and/or a word following the target word;
the error correction module 403 is specifically configured to:
correcting the initial recognition result based on a first target word stock corresponding to the target scene to obtain a target recognition result, wherein the method comprises the following steps:
carrying out word splitting treatment on the initial recognition result to obtain at least one word;
aiming at each word segment, a first combined word corresponding to the word segment is obtained, wherein the first combined word comprises the word segment, and the previous word segment and/or the next word segment of the word segment;
and correcting the segmentation word based on the target word, the target combination word and the first combination word to obtain a target recognition result.
Optionally, the first combined word includes a word segment, and a word segment before and a word segment after the word segment, and the target combined word includes a target word, and a word segment before and a word segment after the target word;
the error correction module 403 is specifically configured to:
searching a first target combined word with the same length as the first combined word in a first target word library;
Searching a second target combined word, wherein the previous word is the same as the previous word in the first combined word, and the next word is the same as the next word in the first combined word, from the first target combined word;
under the condition that the second target combined word is found, searching a third target combined word with the same pronunciation as the word segmentation of the first combined word in the second target combined word;
and correcting the segmentation word based on the third target combination word under the condition that the third target combination word is found.
Optionally, the error correction module 403 is specifically configured to:
under the condition that the number of the third target combination words is one, replacing the word segmentation by the target words in the third target combination words;
under the condition that the number of the third target combination words is at least two, replacing word segmentation by target words of the third target combination words with the maximum word frequency in the at least two third target combination words, wherein the word frequency is used for representing the occurrence times of the third target combination words in a target scene;
and under the condition that the word frequencies corresponding to the third target combination words are the same, replacing the word segmentation by the target word of the third target combination word with the largest timestamp in at least two third target combination words, wherein the timestamp is used for representing the occurrence time of the third target combination word under the target scene.
Optionally, the error correction module 403 further includes:
the searching unit is used for searching a first target word with the same length and the same pronunciation as the word segmentation in the first target word bank under the condition that the error correction of the word segmentation corresponding to the first combined word fails;
the replacing unit is used for replacing the word segmentation by the first target word under the condition that one first target word exists, and replacing the word segmentation by the first target word with the largest word frequency in the at least two first target words under the condition that at least two first target words exist, wherein the word frequency is used for representing the occurrence times of the first target words under the target scene.
The replacing unit is further configured to replace the word segmentation with a first target word with a largest timestamp in at least two first target words, where the timestamp is used to represent an occurrence time of the first target word in the target scene, where the word frequencies corresponding to the first target words are the same.
Optionally, the searching unit is further configured to search, in a case where the first target word does not exist in the first target word bank, the first target word in a second target word bank, where the second target word bank is determined based on at least one second text in the general scene.
Optionally, the searching unit is further configured to search, in the second target word bank, for a fourth target combined word that matches the word segmentation ambiguous tone of the first combined word with the target word in the second target combined word if the first target word does not exist in the second target word bank;
the replacing unit is further configured to replace the word segment with a target word of the fourth target combination word when the fourth target combination word is found and the number of the fourth target combination words is one, and replace the word segment with a target word of the fourth target combination word with a word frequency greater than a preset value and a time stamp less than a preset time in the at least two fourth target combination words when the fourth target combination word is found and the number of the fourth target combination words is at least two.
Optionally, the obtaining module 401 is further configured to obtain at least one first text in the target scene;
wherein, the device still includes:
the word splitting module is used for carrying out word splitting processing on each first text to obtain at least two target words;
the combination module is used for combining the target word and the previous word and/or the next word of the target word aiming at each target word to obtain a target combination word;
The determining module 402 is further configured to determine a first target word stock based on the target word and the target word combination.
Optionally, the determining module 402 is further configured to determine, when a modification instruction for modifying the target word in the target recognition result, which is input by the user, whether the pronunciation or the fuzzy sound of the modified word is the same as that of the target word;
wherein, the device still includes:
and the reducing module is used for reducing the word frequency of a second target word identical with the target word in the first target word bank and the word frequency of a fifth target combined word comprising the second target word under the condition that the modified word is identical with the pronunciation or the fuzzy voice of the target word.
The apparatus of this embodiment may be used to execute the method of any one of the embodiments of the error correction apparatus side method of the speech recognition result, and the specific implementation process and technical effects thereof are similar to those of the embodiment of the error correction apparatus side method of the speech recognition result, and specific reference may be made to the detailed description of the embodiment of the error correction apparatus side method of the speech recognition result, which is not repeated herein.
Fig. 5 illustrates one of the physical schematic diagrams of an electronic device, as shown in fig. 5, the electronic device may include: a processor (processor) 501, a communication interface (Communications Interface) 502, a memory (memory) 503 and a communication bus 504, wherein the processor 501, the communication interface 502, and the memory 503 communicate with each other via the communication bus 504. The processor 501 may invoke logic instructions in the memory 503 to perform the error correction method for the speech recognition result provided by the methods described above, including: acquiring an initial recognition result corresponding to the voice to be recognized; determining a target scene corresponding to the voice to be recognized; correcting the initial recognition result based on a first target word stock corresponding to the target scene to obtain a target recognition result; the first target thesaurus is determined based on at least one first text in the target scene.
Further, the logic instructions in the memory 503 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Fig. 6 illustrates a second physical schematic diagram of an electronic device, as shown in fig. 6, where the electronic device may include: processor 601, communication interface (Communications Interface) 602, memory 603 and communication bus 604, and further comprises microphone 605, wherein processor 601, communication interface 602, memory 603, microphone 605 complete communication with each other via communication bus 604. The microphone 605 is used for collecting the voice to be recognized, and the processor 601 may call the logic instructions in the memory 603 to execute the error correction method of the voice recognition result provided by the above methods, including: acquiring an initial recognition result corresponding to the voice to be recognized; determining a target scene corresponding to the voice to be recognized; correcting the initial recognition result based on a first target word stock corresponding to the target scene to obtain a target recognition result; the first target thesaurus is determined based on at least one first text in the target scene.
Further, the logic instructions in the memory 603 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can execute a method for correcting errors of speech recognition results provided by the above methods, where the method includes: acquiring an initial recognition result corresponding to the voice to be recognized; determining a target scene corresponding to the voice to be recognized; correcting the initial recognition result based on a first target word stock corresponding to the target scene to obtain a target recognition result; the first target thesaurus is determined based on at least one first text in the target scene.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a method of correcting errors in speech recognition results provided by the above methods, comprising: acquiring an initial recognition result corresponding to the voice to be recognized; determining a target scene corresponding to the voice to be recognized; correcting the initial recognition result based on a first target word stock corresponding to the target scene to obtain a target recognition result; the first target thesaurus is determined based on at least one first text in the target scene.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (13)
1. An error correction method for a speech recognition result, comprising:
acquiring an initial recognition result corresponding to the voice to be recognized;
determining a target scene corresponding to the voice to be recognized;
correcting the initial recognition result based on a first target word stock corresponding to the target scene to obtain a target recognition result; the first target thesaurus is determined based on at least one first text in the target scene.
2. The method according to claim 1, wherein the first target word library includes a target word and a target combined word determined based on the first text, the target combined word including the target word and a word preceding and/or following the target word;
the error correction is performed on the initial recognition result based on the first target word stock corresponding to the target scene to obtain a target recognition result, including:
performing word splitting processing on the initial recognition result to obtain at least one word;
for each word segment, a first combination word corresponding to the word segment is obtained, wherein the first combination word comprises the word segment and a previous word segment and/or a next word segment of the word segment;
And correcting the word segmentation based on the target word, the target combined word and the first combined word to obtain the target recognition result.
3. The method according to claim 2, wherein the first combined word includes the word segment and a word segment preceding and a word segment following the word segment, and the target combined word includes the target word and a word segment preceding and a word segment following the target word;
the error correction of the word segment based on the target word, the target combined word and the first combined word includes:
searching a first target combined word with the same length as the first combined word in the first target word bank;
searching a second target combined word, wherein a previous word is identical to the previous word in the first combined word, and a next word is identical to the next word in the first combined word, from the first target combined word;
under the condition that the second target combined word is found, searching a third target combined word with the same pronunciation as the word segmentation of the first combined word in the second target combined word;
And correcting the word segmentation based on the third target combined word under the condition that the third target combined word is searched.
4. The method for correcting errors in speech recognition results according to claim 3, wherein the correcting errors in the segmented words based on the third target combined word comprises:
replacing the word segment with a target word in the third target combined word under the condition that the number of the third target combined words is one;
under the condition that the number of the third target combination words is at least two, replacing the word segmentation by the target word of the third target combination word with the largest word frequency in the at least two third target combination words, wherein the word frequency is used for representing the occurrence frequency of the third target combination word in the target scene;
and under the condition that the word frequencies corresponding to the third target combination words are the same, replacing the word segmentation by the target word of the third target combination word with the largest timestamp in at least two third target combination words, wherein the timestamp is used for representing the occurrence time of the third target combination word under the target scene.
5. A method of correcting errors in speech recognition results according to claim 3, characterized in that the method further comprises:
Under the condition that word segmentation corresponding to the first combined word fails to correct errors, searching a first target word which has the same length as the word segmentation and the same pronunciation in the first target word bank;
if one first target word exists, replacing the word segmentation by the first target word;
if at least two first target words exist, replacing the word segmentation with a first target word with the largest word frequency in the at least two first target words, wherein the word frequency is used for representing the occurrence frequency of the first target words in the target scene;
and under the condition that the word frequencies corresponding to the first target words are the same, replacing the word segmentation by a first target word with the largest timestamp in at least two first target words, wherein the timestamp is used for representing the appearance time of the first target word under the target scene.
6. The method for correcting errors in a speech recognition result of claim 5, further comprising:
and searching the first target word in a second target word bank under the condition that the first target word is not present in the first target word bank, wherein the second target word bank is determined based on at least one second text in a general scene.
7. The method for correcting errors in a speech recognition result of claim 6, further comprising:
searching a fourth target combined word matched with the word segmentation ambiguous tone of the first combined word in the target word in the second target combined word under the condition that the first target word does not exist in the second target word bank;
under the condition that the fourth target combination words are found and the number of the fourth target combination words is one, replacing the segmentation words by target words in the fourth target combination words;
and under the condition that the fourth target combination words are found and the number of the fourth target combination words is at least two, replacing the word segmentation by target words of the fourth target combination words with word frequency larger than a preset value and time stamp smaller than a preset time in the at least two fourth target combination words.
8. The method for error correction of speech recognition results according to any of claims 1-7, further comprising:
acquiring at least one first text in the target scene;
performing word splitting processing on each first text to obtain at least two target words;
combining the target word and the previous word and/or the next word of the target word aiming at each target word to obtain a target combined word;
And determining the first target word stock based on the target word and the target combined word.
9. The method for error correction of speech recognition results according to any of claims 4-7, further comprising:
under the condition that a modification instruction which is input by a user and is used for modifying the target word in the target recognition result is received, determining whether the modified word is identical with the pronunciation or the fuzzy sound of the target word;
and under the condition that the modified word is identical to the pronunciation or the fuzzy sound of the target word, reducing the word frequency of a second target word identical to the target word in the first target word bank and the word frequency of a fifth target combined word comprising the second target word.
10. An error correction device for a speech recognition result, comprising:
the acquisition module is used for acquiring an initial recognition result corresponding to the voice to be recognized;
the determining module is used for determining a target scene corresponding to the voice to be recognized;
the error correction module is used for correcting the initial recognition result based on a first target word stock corresponding to the target scene to obtain a target recognition result; the first target thesaurus is determined based on at least one first text in the target scene.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of error correction of speech recognition results according to any one of claims 1 to 9 when executing the program.
12. An electronic device comprising a microphone, a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the microphone is for capturing speech to be recognized; the processor is used for acquiring an initial recognition result corresponding to the voice to be recognized; determining a target scene corresponding to the voice to be recognized; correcting the initial recognition result based on a first target word stock corresponding to the target scene to obtain a target recognition result; the first target thesaurus is determined based on at least one first text in the target scene.
13. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements a method of error correction of speech recognition results according to any of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310081941.1A CN116070621A (en) | 2023-01-16 | 2023-01-16 | Error correction method and device for voice recognition result, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310081941.1A CN116070621A (en) | 2023-01-16 | 2023-01-16 | Error correction method and device for voice recognition result, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116070621A true CN116070621A (en) | 2023-05-05 |
Family
ID=86174591
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310081941.1A Pending CN116070621A (en) | 2023-01-16 | 2023-01-16 | Error correction method and device for voice recognition result, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116070621A (en) |
-
2023
- 2023-01-16 CN CN202310081941.1A patent/CN116070621A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102222317B1 (en) | Speech recognition method, electronic device, and computer storage medium | |
US12125473B2 (en) | Speech recognition method, apparatus, and device, and storage medium | |
CN108305643B (en) | Method and device for determining emotion information | |
JP6334815B2 (en) | Learning apparatus, method, program, and spoken dialogue system | |
CN110493019B (en) | Automatic generation method, device, equipment and storage medium of conference summary | |
CN106570180B (en) | Voice search method and device based on artificial intelligence | |
CN110276071B (en) | Text matching method and device, computer equipment and storage medium | |
US20140172419A1 (en) | System and method for generating personalized tag recommendations for tagging audio content | |
CN111883137B (en) | Text processing method and device based on voice recognition | |
CN111177359A (en) | Multi-turn dialogue method and device | |
CN111445903B (en) | Enterprise name recognition method and device | |
CN105190614A (en) | Search results using intonation nuances | |
JP2022540784A (en) | Derivation of Multiple Semantic Representations for Utterances in Natural Language Understanding Frameworks | |
CN111161739A (en) | Speech recognition method and related product | |
CN112579733B (en) | Rule matching method, rule matching device, storage medium and electronic equipment | |
CN110008471A (en) | A kind of intelligent semantic matching process based on phonetic conversion | |
CN114550718A (en) | Hot word speech recognition method, device, equipment and computer readable storage medium | |
CN112397053B (en) | Voice recognition method and device, electronic equipment and readable storage medium | |
CN111128175B (en) | Spoken language dialogue management method and system | |
CN112562659A (en) | Voice recognition method and device, electronic equipment and storage medium | |
CN107886940B (en) | Voice translation processing method and device | |
CN115858776B (en) | Variant text classification recognition method, system, storage medium and electronic equipment | |
CN116070621A (en) | Error correction method and device for voice recognition result, electronic equipment and storage medium | |
CN117292688A (en) | Control method based on intelligent voice mouse and intelligent voice mouse | |
CN117332062A (en) | Data processing method and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |