CN111144100A

CN111144100A - Question text recognition method and device, electronic equipment and storage medium

Info

Publication number: CN111144100A
Application number: CN201911344917.2A
Authority: CN
Inventors: 赵忠信
Original assignee: Wuba Co Ltd
Current assignee: Wuba Co Ltd
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-12
Anticipated expiration: 2039-12-24
Also published as: CN111144100B

Abstract

The application provides a problem text recognition method and device, electronic equipment and a storage medium, wherein a text to be recognized is divided according to a preset segmentation rule to obtain a plurality of first sub-texts. Then, combining the context data corresponding to each character in each first sub-text, calculating the occurrence probability of each character, and combining the occurrence probability of each character in each first sub-text, calculating the final confusion value of each first sub-text. And finally, determining the problem text in the text to be recognized by comparing the final confusion value with a preset confusion threshold value. Therefore, the error text recognition method provided by the application can more accurately determine the occurrence probability of the characters according to the context of each character in the text, so that the calculation accuracy of the text confusion value is improved, and the recognition accuracy of the problem text is improved.

Description

Question text recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of text processing technologies, and in particular, to a problem text recognition method and apparatus, an electronic device, and a storage medium.

Background

Text errors are caused by various factors such as human input errors, data system errors, network instability and the like, for example, wrongly written characters, missing characters, multiple characters, messy codes and the like appear in the text, and the text quality is reduced by the wrong text, so that ambiguous and even wrong information is conveyed to a user. At this time, the question text needs to be accurately determined in the text to prompt the text inputter to modify.

Generally, the problem text which does not accord with the uniform language rule in the text to be recognized can be determined by collecting the low-quality text data in a large amount, summarizing the rules of the low-quality text from the characters and arranging the rules into the uniform language rule which represents the text data with the same specific mode, so that the uniform language rule is used for checking the text to be recognized. Or predicting the occurrence probability of each character in the text to be recognized by utilizing a traditional language model, further calculating the confusion value of each sentence in the text to be recognized, and determining the problem text in the text to be recognized according to the confusion value.

However, the method for determining the error text by using the unified language rule has high limitation that only the text to be recognized with a specific pattern can be verified; in the method for determining the problem text by using the traditional language model, due to the establishment rule of the traditional language model, in the verification process, the occurrence probability of each character can be calculated only by combining the upper environment of each character in the text to be recognized, and the influence of the lower environment of the character on the occurrence probability of the character is lost, so that the determined problem text is only based on the upper environment, and the accuracy of problem text recognition is seriously influenced.

Disclosure of Invention

The application provides a problem text identification method and device, electronic equipment and a storage medium, so as to improve the accuracy of problem text identification.

In a first aspect, the present application provides a question text recognition method, including:

dividing the text to be recognized according to a preset segmentation rule to obtain a plurality of first sub-texts;

calculating the occurrence probability of each character by combining the context data corresponding to each character in each first sub-text;

calculating a final confusion value of each first sub-text by combining the occurrence probability of each character in each first sub-text;

and determining a problem text in the text to be recognized, wherein the problem text is a first sub-text corresponding to a final confusion value which is greater than a preset confusion threshold value.

In a possible implementation manner of the first aspect of the embodiments of the present invention, the dividing the text to be recognized according to the preset segmentation rule to obtain the plurality of first subfiles includes:

acquiring the text to be recognized;

preprocessing the text to be recognized to obtain a normalized text, wherein the normalized text is a text with a preset text format;

and dividing the normalized text according to a preset segmentation rule to obtain a plurality of first sub-texts.

In a possible implementation manner of the first aspect of the embodiment of the present invention, the preset segmentation rule is segmentation according to punctuation marks, and the dividing the text to be recognized according to the preset segmentation rule to obtain a plurality of first sub-texts includes:

determining punctuation marks in the text to be recognized;

and dividing the text to be recognized by taking the punctuation marks as nodes to obtain a plurality of first sub-texts.

In a possible implementation manner of the first aspect of the embodiment of the present invention, the preset segmentation rule is divided according to a first preset character string length, and the dividing the text to be recognized according to the preset segmentation rule to obtain a plurality of first sub-texts includes:

determining a target starting character and a target ending character in the text to be recognized by combining the length of the first preset character string, wherein the target starting character is a first character in a matched character string, the target ending character is an ending character in the matched character string, the matched character string is a character string with the length of the character string conforming to the length of the first preset character string, an overlapped character string with a preset length exists between every two adjacent matched character strings, and all the matched character strings form the first sub-text;

and dividing the text to be recognized by taking the adjacent target starting character and the target ending character as nodes to obtain a plurality of first sub texts.

In a possible implementation manner of the first aspect of the embodiment of the present invention, the calculating, by combining context data corresponding to each character in each of the first sub-texts, an occurrence probability of each character includes:

calculating context data corresponding to each character in the first sub-text by using a bidirectional probability language model;

and sequentially shielding each character, and calculating the occurrence probability of each character by combining the context data of the shielded character.

In a possible implementation manner of the first aspect of the embodiment of the present invention, the calculating a final confusion value of each first sub-text by combining the occurrence probability of each character in each first sub-text includes:

dividing each first sub-text according to a second preset character string length to obtain a second sub-text corresponding to each first sub-text;

calculating the sub-confusion degree of each second sub-text by combining the occurrence probability of each character in each second sub-text;

and determining a final confusion value of each first sub-text, wherein the final confusion value is the maximum value of all the sub-confusion values corresponding to each first sub-text.

In a second aspect, the present application provides an apparatus for question text recognition, the apparatus comprising:

the first dividing module is used for dividing the text to be recognized according to a preset segmentation rule to obtain a plurality of first sub-texts;

the probability calculation module is used for calculating the occurrence probability of each character by combining context data corresponding to each character in each first sub-text;

the final confusion value calculation module is used for calculating the final confusion value of each first sub-text by combining the occurrence probability of each character in each first sub-text;

and the problem text determining module is used for determining a problem text in the text to be recognized, wherein the problem text is a first sub-text corresponding to a final confusion value which is greater than a preset confusion threshold value.

In a possible implementation manner of the second aspect of the embodiment of the present invention, the first dividing module includes:

the text acquisition module is used for acquiring the text to be identified;

the preprocessing module is used for preprocessing the text to be recognized to obtain a normalized text, and the normalized text is a text with a preset text format;

and the first obtaining module is used for dividing the normalized text according to a preset segmentation rule to obtain a plurality of first sub-texts.

the punctuation mark determining module is used for determining punctuation marks in the text to be recognized;

and the second obtaining module is used for dividing the text to be recognized by taking the punctuation marks as nodes to obtain a plurality of first sub-texts.

a target character determining module, configured to determine, in combination with the first preset character string length, a target starting character and a target ending character in the text to be recognized, where the target starting character is a first character in a matching character string, and the target ending character is an ending character in the matching character string, where the matching character string is a character string whose character string length matches the first preset character string length, an overlapping character string of a preset length exists between every two adjacent matching character strings, and all the matching character strings constitute the first sub-text;

and the third obtaining module is used for dividing the text to be recognized by taking the adjacent target starting character and the target ending character as nodes to obtain a plurality of first sub-texts.

In a possible implementation manner of the second aspect of the embodiment of the present invention, the probability calculating module includes:

the context data calculation module is used for calculating context data corresponding to each character in the first sub-text by utilizing a bidirectional probability language model;

and the appearance probability calculation module is used for sequentially shielding each character and calculating the appearance probability of each character by combining the context data of the shielded character.

In a possible implementation manner of the second aspect of the embodiment of the present invention, the final confusion value calculating module includes:

the second dividing module is used for dividing each first sub-text according to a second preset character string length to obtain a second sub-text corresponding to each first sub-text;

the sub-confusion value calculating module is used for calculating the sub-confusion of each second sub-text by combining the occurrence probability of each character in each second sub-text;

and the maximum value determining module is used for determining a final confusion value of each first sub-text, wherein the final confusion value is the maximum value of all the sub-confusion values corresponding to each first sub-text.

In a third aspect, an embodiment of the present invention provides an electronic device, including:

a processor, and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the question text recognition method via execution of the executable instructions.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the question text recognition method.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

Fig. 1 is a flowchart of a question text recognition method according to an embodiment of the present application;

fig. 2 is a flowchart of a text preprocessing method according to an embodiment of the present application;

fig. 3 is a flowchart of a text partitioning method according to an embodiment of the present application;

fig. 4 is a flowchart of another text partitioning method provided in the embodiment of the present application;

fig. 5 is a flowchart of a method for calculating a character occurrence probability according to an embodiment of the present disclosure;

FIG. 6 is a flowchart of a method for calculating a final confusion value of a first sub-text according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a first problem text recognition apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a second embodiment of a question text recognition apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a third embodiment of a problem text recognition apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a fourth embodiment of a question text recognition apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a fifth embodiment of a question text recognition apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a sixth embodiment of a question text recognition apparatus according to an embodiment of the present application;

fig. 13 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a question text recognition method according to an embodiment of the present application, and as shown in fig. 1, the method includes:

and S1, dividing the text to be recognized according to a preset segmentation rule to obtain a plurality of first sub-texts.

The text to be recognized provided in the application comprises various forms of text such as electronic documents, pictures, tables and the like. Because the text to be recognized can have a plurality of formats, the text format of the text to be recognized does not conform to the specified format of a computing machine or a computing model, so that the problems of computing errors and the like are easy to occur. Therefore, the text to be recognized needs to be preprocessed first before it is input into the computing machine or the computing model.

Specifically, as shown in fig. 2, a flowchart of a text preprocessing method provided in an embodiment of the present application is shown, where the method includes:

s101, acquiring the text to be recognized;

s102, preprocessing the text to be recognized to obtain a normalized text, wherein the normalized text is a text with a preset text format;

s103, dividing the normalized text according to a preset segmentation rule to obtain a plurality of first sub-texts.

The text to be recognized may be a partial text in a certain text, and at this time, the problem of missing detection or multiple detections cannot occur until the text to be recognized needs to be accurately acquired.

The pre-processing of the text to be recognized is mainly to process the text to be recognized into a normalized text having a preset text format. For example, HTML tag characters included in the text, such as a rich text tag < br/> < div > in the text, are removed, and the tag characters can be removed by a regular matching method; or if the English characters appear in the text, in order to unify the character formats, the English characters can be all unified into a lower case format, or the English characters can be all unified into an upper case format; or for some english words and some proprietary numbers appearing in the text, such as word, 2019 and the like, the characters with integrity need to be divided into participles, and at this time, the divided participles can be used as one character to prevent the error segmentation of the minimum semantic unit, namely the participle, from causing errors of semantic analysis and low recognition accuracy of the problem text when the text to be recognized is subsequently divided.

Further, for some texts to be recognized containing a large number of characters, if all the texts to be recognized are input into a computing machine or a computing model at the same time, the computing load of the computing machine or the computing model is increased seriously, and if the existing problem text is only one character, the process of recognizing a large number of characters at the same time makes the problem text inconspicuous, and reduces the recognition accuracy, so that the problem text recognition needs to be performed after the file to be recognized is divided into the first subfile.

Specifically, as shown in fig. 3, a flowchart of a text division method provided in an embodiment of the present application is shown, where the method includes:

s111, determining punctuation marks in the text to be recognized;

and S112, dividing the text to be recognized by taking the punctuations as nodes to obtain a plurality of first sub-texts.

In one embodiment, punctuation-based segmentation may be set as a preset segmentation rule, for example, the text to be recognized is "i am hungry, i want to eat hail and pie. "punctuation can be determined to be". The punctuation marks are used as nodes to divide the text to be recognized to obtain a first sub-text which is respectively 'I hungry', 'I want to eat hail and pie'.

It should be noted that the punctuation marks used in the embodiments of the present application may be ordinary text punctuation marks, such as ",", etc., or may be some specific symbols, such as "￥ #%" etc.

The text division method provided by the embodiment of the application can effectively ensure the integrity of the first sub-text, and further ensure the semantic integrity of each character in the first sub-text, so that the calculation accuracy of the occurrence probability of each character is ensured.

However, this partitioning method cannot be used if punctuation is not included in the text; moreover, if the difference of the number of characters between two adjacent punctuations is large, the difference between the divided first sub-texts is also large, and if the number of characters contained in the first sub-text is still large, the dividing method loses the effect. In this case, another text division method is required.

Specifically, as shown in fig. 4, a flowchart of another text partitioning method provided in the embodiment of the present application is shown, where the method includes:

s121, determining a target initial character and a target final character in the text to be recognized by combining the length of the first preset character string, wherein the target initial character is a first character in a matched character string, the target final character is a last character in the matched character string, the matched character string is a character string with the length of the character string conforming to the length of the first preset character string, an overlapped character string with a preset length exists between every two adjacent matched character strings, all the matched character strings form the first sub-text, and the length of the first preset character string is larger than the length of the second preset character string;

and S122, dividing the text to be recognized by taking the adjacent target starting character and the target ending character as nodes to obtain a plurality of first sub-texts.

In one embodiment, the segmentation according to the first preset character string length may be set as a preset segmentation rule, that is, the text to be recognized is divided into first sub-texts with equal character lengths. Since the first predetermined character string length is not necessarily equal to the character length of each sentence in the text to be recognized, it is easy to appear that the first sub-sentence is not a complete sentence. For example, with the recognition text "i hungry, i want to eat hail and pie. The length of the first preset character string is 7, and then a first sub-text can be obtained after division, namely 'I hungry, I want to eat' and 'hail seed and pie' respectively. "As seen," I want to eat hail and pie. "this complete statement is split into two subfolders. Such a division may affect the expression of semantics and thus the accuracy of the subsequent calculation of the probability of occurrence of each character.

In order to reduce the influence of text division on semantic expression consistency as much as possible, an overlapping part can exist between two adjacent first sub-texts, for example, the length of a first preset character string is set to be 7, but the divided first sub-texts are ' I hungry, I want to eat hail seeds ', ' eat hail seeds and pies. Thus, hailstones are overlapped character strings, although hailstones are too hungry in the first sub-text, and hailstones are not matched with correct contexts, hailstones and pies are eaten in the first sub-text. "can further correspond to the correct context, so that the protection of semantic expression consistency can be improved.

And S2, calculating the probability of occurrence of each character by combining the context data corresponding to each character in each first sub-text.

According to the following formula,

wherein PPL represents a confusion value, w represents a first sub-text_iRepresenting the ith character in the first sub-text, N representing the character length of the first sub-text, context_bi(w_i) Contextual environment data representing the occluded character.

It can be seen that the confusion value of the first sub-text is related to the occurrence probability of each character in the first sub-text, and specifically, as shown in fig. 5, there is provided a flowchart of a method for calculating the occurrence probability of a character according to an embodiment of the present application, where the method includes:

s201, calculating context data corresponding to each character in the first sub-text by using a bidirectional probability language model;

s202, sequentially shielding each character, and calculating the occurrence probability of each character by combining the context data of the shielded character.

Context data of each character in the first sub-text can be calculated by utilizing a bidirectional probabilistic Language Model, such as a Masked Language Model, for example, the context data of the character "hail" in the hail child is calculated according to the fact that the first sub-text is hungry, the character "hail" in the hail child is wanted to eat, and the character "hail" and the character "child" are calculated in the bidirectional probabilistic Language Model; the first sub-text is "hail and pie. The contextual data of the "middle character" hail "is based on" eat "and" son and pie. "calculated in a bi-directional probabilistic language model.

And sequentially shielding each character to calculate the occurrence probability of each character, wherein the shielding in the embodiment of the application is equivalent to actions such as marking, hiding and the like. For example, the occurrence probability of "hail" needs to be calculated, then in the first sub-text "i is hungry, i wants to eat hail" to cover "hail", and "i is hungry, i wants to eat", at this time, the occurrence probability of "hail" can be calculated by using the context data calculated above by using the bidirectional probabilistic language model, and is 0, for example. Then, for example, the occurrence probability of "hunger" is calculated, then, in the first sub-text "i hungry", i want to eat hail son "shelter" hungry ", and" i am, i want to eat hail son "is obtained, at this time, the occurrence probability of" hungry "can be calculated by using the context data obtained by the above calculation by using the bidirectional probabilistic language model, and is 0.8, for example.

The character occurrence probability calculation method provided by the embodiment of the application can accurately obtain the occurrence probability of each character, so that the confusion value of the first sub-text can be accurately calculated.

And S3, calculating the final confusion value of each first sub-text by combining the occurrence probability of each character in each first sub-text.

After calculating the probability of occurrence of each character in the first sub-text, the final confusion value for each first sub-text can be calculated by the formula mentioned above.

However, as can be seen from the above formula, the confusion value of each first sub-text requires a lot of calculations, which not only takes a long time, but also has a large computational pressure on the machine; meanwhile, if the character length of the first sub-text is too long, since the confusion degree needs to be calculated by calculating the average value in the above formula, if the problem character is short, the local problem condition in the first sub-text cannot be accurately detected through the smoothing effect brought by the average value calculation. In order to solve the above problem, a MovingAverage method may be adopted, that is, each first sub-text is further split into a plurality of texts with smaller character string lengths, and then calculation is performed.

Specifically, as shown in fig. 6, there is provided a flowchart of a method for calculating a final confusion value of a first sub-text according to an embodiment of the present application, where the method includes:

s301, dividing each first sub-text according to a second preset character string length to obtain a second sub-text corresponding to each first sub-text;

s302, calculating the sub-confusion degree of each second sub-text by combining the occurrence probability of each character in each second sub-text;

s303, determining a final confusion value of each first sub-text, wherein the final confusion value is the maximum value of all the sub-confusion values corresponding to each first sub-text.

And continuously dividing each first sub text according to the length of the second preset character string. For example, the first sub-text "hail and pie. And if the length of the second preset character string is 3, dividing to obtain second sub-texts which are respectively 'hail sub' and 'pie', and calculating the sub-confusion degree of each second sub-text through a bidirectional probability language model.

Finally, the maximum value of the sub-puzzles is approximately equal to the final puzzles of the first sub-texts, so that the balance between accuracy and performance is achieved.

S4, determining a problem text in the text to be recognized, wherein the problem text is a first sub-text corresponding to a final confusion value which is larger than a preset confusion threshold value.

After the final confusion value of each first sub-text is obtained through the calculation of the process, the problem text in the text to be recognized can be accurately determined so as to be modified and updated by the text input person.

Fig. 7 is a schematic structural diagram of a first problem text recognition apparatus according to an embodiment of the present application, where the apparatus includes: the first dividing module 1 is used for dividing the text to be recognized according to a preset segmentation rule to obtain a plurality of first sub-texts; a probability calculation module 2, configured to calculate an occurrence probability of each character in combination with context data corresponding to each character in each first sub-text; a final confusion value calculating module 3, configured to calculate a final confusion value of each of the first sub-texts by combining an occurrence probability of each character in each of the first sub-texts; and the problem text determining module 4 is used for determining a problem text in the text to be recognized, wherein the problem text is a first sub-text corresponding to a final confusion value which is greater than a preset confusion threshold value.

Fig. 8 is a schematic structural diagram of a second embodiment of a question text recognition apparatus provided in the embodiment of the present application, where the first partitioning module 1 includes: the text acquisition module 11 is configured to acquire the text to be recognized; the preprocessing module 12 is configured to preprocess the text to be recognized to obtain a normalized text, where the normalized text is a text with a preset text format; the first obtaining module 13 is configured to divide the normalized text according to a preset segmentation rule to obtain a plurality of first sub-texts.

Fig. 9 is a schematic structural diagram of a third embodiment of a question text recognition apparatus provided in the embodiment of the present application, where the first partitioning module 1 includes: a punctuation mark determining module 14, configured to determine punctuation marks in the text to be recognized; and a second obtaining module 15, configured to divide the text to be recognized by using the punctuation marks as nodes to obtain a plurality of first sub-texts.

Fig. 10 is a schematic structural diagram of a fourth embodiment of a question text recognition apparatus provided in the embodiment of the present application, where the first partitioning module 1 includes: a target character determining module 16, configured to determine, in combination with the first preset character string length, a target starting character and a target ending character in the text to be recognized, where the target starting character is a first character in a matching character string, and the target ending character is a last character in the matching character string, where the matching character string is a character string whose character string length matches the first preset character string length, an overlapping character string with a preset length exists between every two adjacent matching character strings, and all the matching character strings constitute the first sub-text; a third obtaining module 17, configured to divide the text to be recognized by using the adjacent target start character and target end character as nodes, so as to obtain multiple first sub-texts.

Fig. 11 is a schematic structural diagram of a fifth embodiment of a problem text recognition apparatus provided in the embodiment of the present application, where the probability calculation module 2 includes: a context data calculation module 21, configured to calculate context data corresponding to each character in the first sub-text by using a bi-directional probabilistic language model; and the occurrence probability calculation module 22 is used for sequentially blocking each character and calculating the occurrence probability of each character by combining the context data of the blocked character.

Fig. 12 is a schematic structural diagram of a sixth embodiment of a problem text recognition apparatus according to an embodiment of the present application, where the final confusion value calculation module 3 includes: the second dividing module 31 is configured to divide each first sub-text according to a second preset character string length to obtain a second sub-text corresponding to each first sub-text; a sub-confusion value calculating module 32, configured to calculate a sub-confusion of each of the second sub-texts by combining an occurrence probability of each character in each of the second sub-texts; a maximum value determining module 33, configured to determine a final confusion value of each of the first sub-texts, where the final confusion value is a maximum value of all the sub-confusion values corresponding to each of the first sub-texts.

Fig. 13 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention. The electronic device includes: a memory 101 and a processor 102;

a memory 101 for storing a computer program;

a processor 102 for executing a computer program stored in a memory to implement the question text recognition method in the above embodiments. Reference may be made in particular to the description relating to the method embodiments described above.

Alternatively, the memory 101 may be separate or integrated with the processor 102.

When the memory 101 is a device independent of the processor 102, the electronic apparatus may further include:

a bus 103 for connecting the memory 101 and the processor 102.

The electronic device provided by the embodiment of the present invention may be configured to execute any of the problem text recognition methods shown in the above embodiments, and the implementation manner and the technical effect of the electronic device are similar to each other.

An embodiment of the present invention further provides a readable storage medium, in which a computer program is stored, and when at least one processor of a message sending apparatus executes the computer program, the message sending apparatus executes the question text recognition method according to any one of the above embodiments.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program described above may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A question text recognition method, the method comprising:

2. The method according to claim 1, wherein the dividing the text to be recognized according to the preset segmentation rule to obtain a plurality of first subfolders comprises:

acquiring the text to be recognized;

3. The method according to claim 1, wherein the preset segmentation rule is segmentation according to punctuation marks, and the dividing the text to be recognized according to the preset segmentation rule to obtain a plurality of first subfiles comprises:

determining punctuation marks in the text to be recognized;

4. The method according to claim 1, wherein the preset segmentation rule is a division according to a first preset character string length, and the dividing the text to be recognized according to the preset segmentation rule to obtain a plurality of first subfiles comprises:

5. The method according to claim 1, wherein said calculating a probability of occurrence for each of said characters in combination with context data corresponding to each of said characters in each of said first sub-texts comprises:

6. The method of claim 1, wherein said calculating a final confusion value for each of said first sub-texts in combination with a probability of occurrence of each character in each of said first sub-texts comprises:

7. An apparatus for question text recognition, the apparatus comprising:

8. The apparatus of claim 7, wherein the first partitioning module comprises:

the text acquisition module is used for acquiring the text to be identified;

9. The apparatus of claim 7, wherein the first partitioning module comprises:

10. The apparatus of claim 7, wherein the first partitioning module comprises:

11. The apparatus of claim 7, wherein the probability computation module comprises:

12. The apparatus of claim 7, wherein the final confusion value calculating module comprises:

13. An electronic device, characterized in that the electronic device comprises:

a processor, and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the question text recognition method of any one of claims 1-6 via execution of the executable instructions.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the question text recognition method according to any one of claims 1 to 6.