CN111144100A - Question text recognition method and device, electronic equipment and storage medium - Google Patents

Question text recognition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111144100A
CN111144100A CN201911344917.2A CN201911344917A CN111144100A CN 111144100 A CN111144100 A CN 111144100A CN 201911344917 A CN201911344917 A CN 201911344917A CN 111144100 A CN111144100 A CN 111144100A
Authority
CN
China
Prior art keywords
text
sub
character
recognized
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911344917.2A
Other languages
Chinese (zh)
Other versions
CN111144100B (en
Inventor
赵忠信
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuba Co Ltd
Original Assignee
Wuba Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuba Co Ltd filed Critical Wuba Co Ltd
Priority to CN201911344917.2A priority Critical patent/CN111144100B/en
Publication of CN111144100A publication Critical patent/CN111144100A/en
Application granted granted Critical
Publication of CN111144100B publication Critical patent/CN111144100B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a problem text recognition method and device, electronic equipment and a storage medium, wherein a text to be recognized is divided according to a preset segmentation rule to obtain a plurality of first sub-texts. Then, combining the context data corresponding to each character in each first sub-text, calculating the occurrence probability of each character, and combining the occurrence probability of each character in each first sub-text, calculating the final confusion value of each first sub-text. And finally, determining the problem text in the text to be recognized by comparing the final confusion value with a preset confusion threshold value. Therefore, the error text recognition method provided by the application can more accurately determine the occurrence probability of the characters according to the context of each character in the text, so that the calculation accuracy of the text confusion value is improved, and the recognition accuracy of the problem text is improved.

Description

Question text recognition method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of text processing technologies, and in particular, to a problem text recognition method and apparatus, an electronic device, and a storage medium.
Background
Text errors are caused by various factors such as human input errors, data system errors, network instability and the like, for example, wrongly written characters, missing characters, multiple characters, messy codes and the like appear in the text, and the text quality is reduced by the wrong text, so that ambiguous and even wrong information is conveyed to a user. At this time, the question text needs to be accurately determined in the text to prompt the text inputter to modify.
Generally, the problem text which does not accord with the uniform language rule in the text to be recognized can be determined by collecting the low-quality text data in a large amount, summarizing the rules of the low-quality text from the characters and arranging the rules into the uniform language rule which represents the text data with the same specific mode, so that the uniform language rule is used for checking the text to be recognized. Or predicting the occurrence probability of each character in the text to be recognized by utilizing a traditional language model, further calculating the confusion value of each sentence in the text to be recognized, and determining the problem text in the text to be recognized according to the confusion value.
However, the method for determining the error text by using the unified language rule has high limitation that only the text to be recognized with a specific pattern can be verified; in the method for determining the problem text by using the traditional language model, due to the establishment rule of the traditional language model, in the verification process, the occurrence probability of each character can be calculated only by combining the upper environment of each character in the text to be recognized, and the influence of the lower environment of the character on the occurrence probability of the character is lost, so that the determined problem text is only based on the upper environment, and the accuracy of problem text recognition is seriously influenced.
Disclosure of Invention
The application provides a problem text identification method and device, electronic equipment and a storage medium, so as to improve the accuracy of problem text identification.
In a first aspect, the present application provides a question text recognition method, including:
dividing the text to be recognized according to a preset segmentation rule to obtain a plurality of first sub-texts;
calculating the occurrence probability of each character by combining the context data corresponding to each character in each first sub-text;
calculating a final confusion value of each first sub-text by combining the occurrence probability of each character in each first sub-text;
and determining a problem text in the text to be recognized, wherein the problem text is a first sub-text corresponding to a final confusion value which is greater than a preset confusion threshold value.
In a possible implementation manner of the first aspect of the embodiments of the present invention, the dividing the text to be recognized according to the preset segmentation rule to obtain the plurality of first subfiles includes:
acquiring the text to be recognized;
preprocessing the text to be recognized to obtain a normalized text, wherein the normalized text is a text with a preset text format;
and dividing the normalized text according to a preset segmentation rule to obtain a plurality of first sub-texts.
In a possible implementation manner of the first aspect of the embodiment of the present invention, the preset segmentation rule is segmentation according to punctuation marks, and the dividing the text to be recognized according to the preset segmentation rule to obtain a plurality of first sub-texts includes:
determining punctuation marks in the text to be recognized;
and dividing the text to be recognized by taking the punctuation marks as nodes to obtain a plurality of first sub-texts.
In a possible implementation manner of the first aspect of the embodiment of the present invention, the preset segmentation rule is divided according to a first preset character string length, and the dividing the text to be recognized according to the preset segmentation rule to obtain a plurality of first sub-texts includes:
determining a target starting character and a target ending character in the text to be recognized by combining the length of the first preset character string, wherein the target starting character is a first character in a matched character string, the target ending character is an ending character in the matched character string, the matched character string is a character string with the length of the character string conforming to the length of the first preset character string, an overlapped character string with a preset length exists between every two adjacent matched character strings, and all the matched character strings form the first sub-text;
and dividing the text to be recognized by taking the adjacent target starting character and the target ending character as nodes to obtain a plurality of first sub texts.
In a possible implementation manner of the first aspect of the embodiment of the present invention, the calculating, by combining context data corresponding to each character in each of the first sub-texts, an occurrence probability of each character includes:
calculating context data corresponding to each character in the first sub-text by using a bidirectional probability language model;
and sequentially shielding each character, and calculating the occurrence probability of each character by combining the context data of the shielded character.
In a possible implementation manner of the first aspect of the embodiment of the present invention, the calculating a final confusion value of each first sub-text by combining the occurrence probability of each character in each first sub-text includes:
dividing each first sub-text according to a second preset character string length to obtain a second sub-text corresponding to each first sub-text;
calculating the sub-confusion degree of each second sub-text by combining the occurrence probability of each character in each second sub-text;
and determining a final confusion value of each first sub-text, wherein the final confusion value is the maximum value of all the sub-confusion values corresponding to each first sub-text.
In a second aspect, the present application provides an apparatus for question text recognition, the apparatus comprising:
the first dividing module is used for dividing the text to be recognized according to a preset segmentation rule to obtain a plurality of first sub-texts;
the probability calculation module is used for calculating the occurrence probability of each character by combining context data corresponding to each character in each first sub-text;
the final confusion value calculation module is used for calculating the final confusion value of each first sub-text by combining the occurrence probability of each character in each first sub-text;
and the problem text determining module is used for determining a problem text in the text to be recognized, wherein the problem text is a first sub-text corresponding to a final confusion value which is greater than a preset confusion threshold value.
In a possible implementation manner of the second aspect of the embodiment of the present invention, the first dividing module includes:
the text acquisition module is used for acquiring the text to be identified;
the preprocessing module is used for preprocessing the text to be recognized to obtain a normalized text, and the normalized text is a text with a preset text format;
and the first obtaining module is used for dividing the normalized text according to a preset segmentation rule to obtain a plurality of first sub-texts.
In a possible implementation manner of the second aspect of the embodiment of the present invention, the first dividing module includes:
the punctuation mark determining module is used for determining punctuation marks in the text to be recognized;
and the second obtaining module is used for dividing the text to be recognized by taking the punctuation marks as nodes to obtain a plurality of first sub-texts.
In a possible implementation manner of the second aspect of the embodiment of the present invention, the first dividing module includes:
a target character determining module, configured to determine, in combination with the first preset character string length, a target starting character and a target ending character in the text to be recognized, where the target starting character is a first character in a matching character string, and the target ending character is an ending character in the matching character string, where the matching character string is a character string whose character string length matches the first preset character string length, an overlapping character string of a preset length exists between every two adjacent matching character strings, and all the matching character strings constitute the first sub-text;
and the third obtaining module is used for dividing the text to be recognized by taking the adjacent target starting character and the target ending character as nodes to obtain a plurality of first sub-texts.
In a possible implementation manner of the second aspect of the embodiment of the present invention, the probability calculating module includes:
the context data calculation module is used for calculating context data corresponding to each character in the first sub-text by utilizing a bidirectional probability language model;
and the appearance probability calculation module is used for sequentially shielding each character and calculating the appearance probability of each character by combining the context data of the shielded character.
In a possible implementation manner of the second aspect of the embodiment of the present invention, the final confusion value calculating module includes:
the second dividing module is used for dividing each first sub-text according to a second preset character string length to obtain a second sub-text corresponding to each first sub-text;
the sub-confusion value calculating module is used for calculating the sub-confusion of each second sub-text by combining the occurrence probability of each character in each second sub-text;
and the maximum value determining module is used for determining a final confusion value of each first sub-text, wherein the final confusion value is the maximum value of all the sub-confusion values corresponding to each first sub-text.
In a third aspect, an embodiment of the present invention provides an electronic device, including:
a processor, and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the question text recognition method via execution of the executable instructions.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the question text recognition method.
The application provides a problem text recognition method and device, electronic equipment and a storage medium, wherein a text to be recognized is divided according to a preset segmentation rule to obtain a plurality of first sub-texts. Then, combining the context data corresponding to each character in each first sub-text, calculating the occurrence probability of each character, and combining the occurrence probability of each character in each first sub-text, calculating the final confusion value of each first sub-text. And finally, determining the problem text in the text to be recognized by comparing the final confusion value with a preset confusion threshold value. Therefore, the error text recognition method provided by the application can more accurately determine the occurrence probability of the characters according to the context of each character in the text, so that the calculation accuracy of the text confusion value is improved, and the recognition accuracy of the problem text is improved.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.
Fig. 1 is a flowchart of a question text recognition method according to an embodiment of the present application;
fig. 2 is a flowchart of a text preprocessing method according to an embodiment of the present application;
fig. 3 is a flowchart of a text partitioning method according to an embodiment of the present application;
fig. 4 is a flowchart of another text partitioning method provided in the embodiment of the present application;
fig. 5 is a flowchart of a method for calculating a character occurrence probability according to an embodiment of the present disclosure;
FIG. 6 is a flowchart of a method for calculating a final confusion value of a first sub-text according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a first problem text recognition apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a second embodiment of a question text recognition apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a third embodiment of a problem text recognition apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a fourth embodiment of a question text recognition apparatus according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a fifth embodiment of a question text recognition apparatus according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of a sixth embodiment of a question text recognition apparatus according to an embodiment of the present application;
fig. 13 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a question text recognition method according to an embodiment of the present application, and as shown in fig. 1, the method includes:
and S1, dividing the text to be recognized according to a preset segmentation rule to obtain a plurality of first sub-texts.
The text to be recognized provided in the application comprises various forms of text such as electronic documents, pictures, tables and the like. Because the text to be recognized can have a plurality of formats, the text format of the text to be recognized does not conform to the specified format of a computing machine or a computing model, so that the problems of computing errors and the like are easy to occur. Therefore, the text to be recognized needs to be preprocessed first before it is input into the computing machine or the computing model.
Specifically, as shown in fig. 2, a flowchart of a text preprocessing method provided in an embodiment of the present application is shown, where the method includes:
s101, acquiring the text to be recognized;
s102, preprocessing the text to be recognized to obtain a normalized text, wherein the normalized text is a text with a preset text format;
s103, dividing the normalized text according to a preset segmentation rule to obtain a plurality of first sub-texts.
The text to be recognized may be a partial text in a certain text, and at this time, the problem of missing detection or multiple detections cannot occur until the text to be recognized needs to be accurately acquired.
The pre-processing of the text to be recognized is mainly to process the text to be recognized into a normalized text having a preset text format. For example, HTML tag characters included in the text, such as a rich text tag < br/> < div > in the text, are removed, and the tag characters can be removed by a regular matching method; or if the English characters appear in the text, in order to unify the character formats, the English characters can be all unified into a lower case format, or the English characters can be all unified into an upper case format; or for some english words and some proprietary numbers appearing in the text, such as word, 2019 and the like, the characters with integrity need to be divided into participles, and at this time, the divided participles can be used as one character to prevent the error segmentation of the minimum semantic unit, namely the participle, from causing errors of semantic analysis and low recognition accuracy of the problem text when the text to be recognized is subsequently divided.
Further, for some texts to be recognized containing a large number of characters, if all the texts to be recognized are input into a computing machine or a computing model at the same time, the computing load of the computing machine or the computing model is increased seriously, and if the existing problem text is only one character, the process of recognizing a large number of characters at the same time makes the problem text inconspicuous, and reduces the recognition accuracy, so that the problem text recognition needs to be performed after the file to be recognized is divided into the first subfile.
Specifically, as shown in fig. 3, a flowchart of a text division method provided in an embodiment of the present application is shown, where the method includes:
s111, determining punctuation marks in the text to be recognized;
and S112, dividing the text to be recognized by taking the punctuations as nodes to obtain a plurality of first sub-texts.
In one embodiment, punctuation-based segmentation may be set as a preset segmentation rule, for example, the text to be recognized is "i am hungry, i want to eat hail and pie. "punctuation can be determined to be". The punctuation marks are used as nodes to divide the text to be recognized to obtain a first sub-text which is respectively 'I hungry', 'I want to eat hail and pie'.
It should be noted that the punctuation marks used in the embodiments of the present application may be ordinary text punctuation marks, such as ",", etc., or may be some specific symbols, such as "¥ #%" etc.
The text division method provided by the embodiment of the application can effectively ensure the integrity of the first sub-text, and further ensure the semantic integrity of each character in the first sub-text, so that the calculation accuracy of the occurrence probability of each character is ensured.
However, this partitioning method cannot be used if punctuation is not included in the text; moreover, if the difference of the number of characters between two adjacent punctuations is large, the difference between the divided first sub-texts is also large, and if the number of characters contained in the first sub-text is still large, the dividing method loses the effect. In this case, another text division method is required.
Specifically, as shown in fig. 4, a flowchart of another text partitioning method provided in the embodiment of the present application is shown, where the method includes:
s121, determining a target initial character and a target final character in the text to be recognized by combining the length of the first preset character string, wherein the target initial character is a first character in a matched character string, the target final character is a last character in the matched character string, the matched character string is a character string with the length of the character string conforming to the length of the first preset character string, an overlapped character string with a preset length exists between every two adjacent matched character strings, all the matched character strings form the first sub-text, and the length of the first preset character string is larger than the length of the second preset character string;
and S122, dividing the text to be recognized by taking the adjacent target starting character and the target ending character as nodes to obtain a plurality of first sub-texts.
In one embodiment, the segmentation according to the first preset character string length may be set as a preset segmentation rule, that is, the text to be recognized is divided into first sub-texts with equal character lengths. Since the first predetermined character string length is not necessarily equal to the character length of each sentence in the text to be recognized, it is easy to appear that the first sub-sentence is not a complete sentence. For example, with the recognition text "i hungry, i want to eat hail and pie. The length of the first preset character string is 7, and then a first sub-text can be obtained after division, namely 'I hungry, I want to eat' and 'hail seed and pie' respectively. "As seen," I want to eat hail and pie. "this complete statement is split into two subfolders. Such a division may affect the expression of semantics and thus the accuracy of the subsequent calculation of the probability of occurrence of each character.
In order to reduce the influence of text division on semantic expression consistency as much as possible, an overlapping part can exist between two adjacent first sub-texts, for example, the length of a first preset character string is set to be 7, but the divided first sub-texts are ' I hungry, I want to eat hail seeds ', ' eat hail seeds and pies. Thus, hailstones are overlapped character strings, although hailstones are too hungry in the first sub-text, and hailstones are not matched with correct contexts, hailstones and pies are eaten in the first sub-text. "can further correspond to the correct context, so that the protection of semantic expression consistency can be improved.
And S2, calculating the probability of occurrence of each character by combining the context data corresponding to each character in each first sub-text.
According to the following formula,
Figure BDA0002333071070000071
wherein PPL represents a confusion value, w represents a first sub-textiRepresenting the ith character in the first sub-text, N representing the character length of the first sub-text, contextbi(wi) Contextual environment data representing the occluded character.
It can be seen that the confusion value of the first sub-text is related to the occurrence probability of each character in the first sub-text, and specifically, as shown in fig. 5, there is provided a flowchart of a method for calculating the occurrence probability of a character according to an embodiment of the present application, where the method includes:
s201, calculating context data corresponding to each character in the first sub-text by using a bidirectional probability language model;
s202, sequentially shielding each character, and calculating the occurrence probability of each character by combining the context data of the shielded character.
Context data of each character in the first sub-text can be calculated by utilizing a bidirectional probabilistic Language Model, such as a Masked Language Model, for example, the context data of the character "hail" in the hail child is calculated according to the fact that the first sub-text is hungry, the character "hail" in the hail child is wanted to eat, and the character "hail" and the character "child" are calculated in the bidirectional probabilistic Language Model; the first sub-text is "hail and pie. The contextual data of the "middle character" hail "is based on" eat "and" son and pie. "calculated in a bi-directional probabilistic language model.
And sequentially shielding each character to calculate the occurrence probability of each character, wherein the shielding in the embodiment of the application is equivalent to actions such as marking, hiding and the like. For example, the occurrence probability of "hail" needs to be calculated, then in the first sub-text "i is hungry, i wants to eat hail" to cover "hail", and "i is hungry, i wants to eat", at this time, the occurrence probability of "hail" can be calculated by using the context data calculated above by using the bidirectional probabilistic language model, and is 0, for example. Then, for example, the occurrence probability of "hunger" is calculated, then, in the first sub-text "i hungry", i want to eat hail son "shelter" hungry ", and" i am, i want to eat hail son "is obtained, at this time, the occurrence probability of" hungry "can be calculated by using the context data obtained by the above calculation by using the bidirectional probabilistic language model, and is 0.8, for example.
The character occurrence probability calculation method provided by the embodiment of the application can accurately obtain the occurrence probability of each character, so that the confusion value of the first sub-text can be accurately calculated.
And S3, calculating the final confusion value of each first sub-text by combining the occurrence probability of each character in each first sub-text.
After calculating the probability of occurrence of each character in the first sub-text, the final confusion value for each first sub-text can be calculated by the formula mentioned above.
However, as can be seen from the above formula, the confusion value of each first sub-text requires a lot of calculations, which not only takes a long time, but also has a large computational pressure on the machine; meanwhile, if the character length of the first sub-text is too long, since the confusion degree needs to be calculated by calculating the average value in the above formula, if the problem character is short, the local problem condition in the first sub-text cannot be accurately detected through the smoothing effect brought by the average value calculation. In order to solve the above problem, a MovingAverage method may be adopted, that is, each first sub-text is further split into a plurality of texts with smaller character string lengths, and then calculation is performed.
Specifically, as shown in fig. 6, there is provided a flowchart of a method for calculating a final confusion value of a first sub-text according to an embodiment of the present application, where the method includes:
s301, dividing each first sub-text according to a second preset character string length to obtain a second sub-text corresponding to each first sub-text;
s302, calculating the sub-confusion degree of each second sub-text by combining the occurrence probability of each character in each second sub-text;
s303, determining a final confusion value of each first sub-text, wherein the final confusion value is the maximum value of all the sub-confusion values corresponding to each first sub-text.
And continuously dividing each first sub text according to the length of the second preset character string. For example, the first sub-text "hail and pie. And if the length of the second preset character string is 3, dividing to obtain second sub-texts which are respectively 'hail sub' and 'pie', and calculating the sub-confusion degree of each second sub-text through a bidirectional probability language model.
Finally, the maximum value of the sub-puzzles is approximately equal to the final puzzles of the first sub-texts, so that the balance between accuracy and performance is achieved.
S4, determining a problem text in the text to be recognized, wherein the problem text is a first sub-text corresponding to a final confusion value which is larger than a preset confusion threshold value.
After the final confusion value of each first sub-text is obtained through the calculation of the process, the problem text in the text to be recognized can be accurately determined so as to be modified and updated by the text input person.
Fig. 7 is a schematic structural diagram of a first problem text recognition apparatus according to an embodiment of the present application, where the apparatus includes: the first dividing module 1 is used for dividing the text to be recognized according to a preset segmentation rule to obtain a plurality of first sub-texts; a probability calculation module 2, configured to calculate an occurrence probability of each character in combination with context data corresponding to each character in each first sub-text; a final confusion value calculating module 3, configured to calculate a final confusion value of each of the first sub-texts by combining an occurrence probability of each character in each of the first sub-texts; and the problem text determining module 4 is used for determining a problem text in the text to be recognized, wherein the problem text is a first sub-text corresponding to a final confusion value which is greater than a preset confusion threshold value.
Fig. 8 is a schematic structural diagram of a second embodiment of a question text recognition apparatus provided in the embodiment of the present application, where the first partitioning module 1 includes: the text acquisition module 11 is configured to acquire the text to be recognized; the preprocessing module 12 is configured to preprocess the text to be recognized to obtain a normalized text, where the normalized text is a text with a preset text format; the first obtaining module 13 is configured to divide the normalized text according to a preset segmentation rule to obtain a plurality of first sub-texts.
Fig. 9 is a schematic structural diagram of a third embodiment of a question text recognition apparatus provided in the embodiment of the present application, where the first partitioning module 1 includes: a punctuation mark determining module 14, configured to determine punctuation marks in the text to be recognized; and a second obtaining module 15, configured to divide the text to be recognized by using the punctuation marks as nodes to obtain a plurality of first sub-texts.
Fig. 10 is a schematic structural diagram of a fourth embodiment of a question text recognition apparatus provided in the embodiment of the present application, where the first partitioning module 1 includes: a target character determining module 16, configured to determine, in combination with the first preset character string length, a target starting character and a target ending character in the text to be recognized, where the target starting character is a first character in a matching character string, and the target ending character is a last character in the matching character string, where the matching character string is a character string whose character string length matches the first preset character string length, an overlapping character string with a preset length exists between every two adjacent matching character strings, and all the matching character strings constitute the first sub-text; a third obtaining module 17, configured to divide the text to be recognized by using the adjacent target start character and target end character as nodes, so as to obtain multiple first sub-texts.
Fig. 11 is a schematic structural diagram of a fifth embodiment of a problem text recognition apparatus provided in the embodiment of the present application, where the probability calculation module 2 includes: a context data calculation module 21, configured to calculate context data corresponding to each character in the first sub-text by using a bi-directional probabilistic language model; and the occurrence probability calculation module 22 is used for sequentially blocking each character and calculating the occurrence probability of each character by combining the context data of the blocked character.
Fig. 12 is a schematic structural diagram of a sixth embodiment of a problem text recognition apparatus according to an embodiment of the present application, where the final confusion value calculation module 3 includes: the second dividing module 31 is configured to divide each first sub-text according to a second preset character string length to obtain a second sub-text corresponding to each first sub-text; a sub-confusion value calculating module 32, configured to calculate a sub-confusion of each of the second sub-texts by combining an occurrence probability of each character in each of the second sub-texts; a maximum value determining module 33, configured to determine a final confusion value of each of the first sub-texts, where the final confusion value is a maximum value of all the sub-confusion values corresponding to each of the first sub-texts.
Fig. 13 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention. The electronic device includes: a memory 101 and a processor 102;
a memory 101 for storing a computer program;
a processor 102 for executing a computer program stored in a memory to implement the question text recognition method in the above embodiments. Reference may be made in particular to the description relating to the method embodiments described above.
Alternatively, the memory 101 may be separate or integrated with the processor 102.
When the memory 101 is a device independent of the processor 102, the electronic apparatus may further include:
a bus 103 for connecting the memory 101 and the processor 102.
The electronic device provided by the embodiment of the present invention may be configured to execute any of the problem text recognition methods shown in the above embodiments, and the implementation manner and the technical effect of the electronic device are similar to each other.
An embodiment of the present invention further provides a readable storage medium, in which a computer program is stored, and when at least one processor of a message sending apparatus executes the computer program, the message sending apparatus executes the question text recognition method according to any one of the above embodiments.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program described above may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims (14)

1. A question text recognition method, the method comprising:
dividing the text to be recognized according to a preset segmentation rule to obtain a plurality of first sub-texts;
calculating the occurrence probability of each character by combining the context data corresponding to each character in each first sub-text;
calculating a final confusion value of each first sub-text by combining the occurrence probability of each character in each first sub-text;
and determining a problem text in the text to be recognized, wherein the problem text is a first sub-text corresponding to a final confusion value which is greater than a preset confusion threshold value.
2. The method according to claim 1, wherein the dividing the text to be recognized according to the preset segmentation rule to obtain a plurality of first subfolders comprises:
acquiring the text to be recognized;
preprocessing the text to be recognized to obtain a normalized text, wherein the normalized text is a text with a preset text format;
and dividing the normalized text according to a preset segmentation rule to obtain a plurality of first sub-texts.
3. The method according to claim 1, wherein the preset segmentation rule is segmentation according to punctuation marks, and the dividing the text to be recognized according to the preset segmentation rule to obtain a plurality of first subfiles comprises:
determining punctuation marks in the text to be recognized;
and dividing the text to be recognized by taking the punctuation marks as nodes to obtain a plurality of first sub-texts.
4. The method according to claim 1, wherein the preset segmentation rule is a division according to a first preset character string length, and the dividing the text to be recognized according to the preset segmentation rule to obtain a plurality of first subfiles comprises:
determining a target starting character and a target ending character in the text to be recognized by combining the length of the first preset character string, wherein the target starting character is a first character in a matched character string, the target ending character is an ending character in the matched character string, the matched character string is a character string with the length of the character string conforming to the length of the first preset character string, an overlapped character string with a preset length exists between every two adjacent matched character strings, and all the matched character strings form the first sub-text;
and dividing the text to be recognized by taking the adjacent target starting character and the target ending character as nodes to obtain a plurality of first sub texts.
5. The method according to claim 1, wherein said calculating a probability of occurrence for each of said characters in combination with context data corresponding to each of said characters in each of said first sub-texts comprises:
calculating context data corresponding to each character in the first sub-text by using a bidirectional probability language model;
and sequentially shielding each character, and calculating the occurrence probability of each character by combining the context data of the shielded character.
6. The method of claim 1, wherein said calculating a final confusion value for each of said first sub-texts in combination with a probability of occurrence of each character in each of said first sub-texts comprises:
dividing each first sub-text according to a second preset character string length to obtain a second sub-text corresponding to each first sub-text;
calculating the sub-confusion degree of each second sub-text by combining the occurrence probability of each character in each second sub-text;
and determining a final confusion value of each first sub-text, wherein the final confusion value is the maximum value of all the sub-confusion values corresponding to each first sub-text.
7. An apparatus for question text recognition, the apparatus comprising:
the first dividing module is used for dividing the text to be recognized according to a preset segmentation rule to obtain a plurality of first sub-texts;
the probability calculation module is used for calculating the occurrence probability of each character by combining context data corresponding to each character in each first sub-text;
the final confusion value calculation module is used for calculating the final confusion value of each first sub-text by combining the occurrence probability of each character in each first sub-text;
and the problem text determining module is used for determining a problem text in the text to be recognized, wherein the problem text is a first sub-text corresponding to a final confusion value which is greater than a preset confusion threshold value.
8. The apparatus of claim 7, wherein the first partitioning module comprises:
the text acquisition module is used for acquiring the text to be identified;
the preprocessing module is used for preprocessing the text to be recognized to obtain a normalized text, and the normalized text is a text with a preset text format;
and the first obtaining module is used for dividing the normalized text according to a preset segmentation rule to obtain a plurality of first sub-texts.
9. The apparatus of claim 7, wherein the first partitioning module comprises:
the punctuation mark determining module is used for determining punctuation marks in the text to be recognized;
and the second obtaining module is used for dividing the text to be recognized by taking the punctuation marks as nodes to obtain a plurality of first sub-texts.
10. The apparatus of claim 7, wherein the first partitioning module comprises:
a target character determining module, configured to determine, in combination with the first preset character string length, a target starting character and a target ending character in the text to be recognized, where the target starting character is a first character in a matching character string, and the target ending character is an ending character in the matching character string, where the matching character string is a character string whose character string length matches the first preset character string length, an overlapping character string of a preset length exists between every two adjacent matching character strings, and all the matching character strings constitute the first sub-text;
and the third obtaining module is used for dividing the text to be recognized by taking the adjacent target starting character and the target ending character as nodes to obtain a plurality of first sub-texts.
11. The apparatus of claim 7, wherein the probability computation module comprises:
the context data calculation module is used for calculating context data corresponding to each character in the first sub-text by utilizing a bidirectional probability language model;
and the appearance probability calculation module is used for sequentially shielding each character and calculating the appearance probability of each character by combining the context data of the shielded character.
12. The apparatus of claim 7, wherein the final confusion value calculating module comprises:
the second dividing module is used for dividing each first sub-text according to a second preset character string length to obtain a second sub-text corresponding to each first sub-text;
the sub-confusion value calculating module is used for calculating the sub-confusion of each second sub-text by combining the occurrence probability of each character in each second sub-text;
and the maximum value determining module is used for determining a final confusion value of each first sub-text, wherein the final confusion value is the maximum value of all the sub-confusion values corresponding to each first sub-text.
13. An electronic device, characterized in that the electronic device comprises:
a processor, and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the question text recognition method of any one of claims 1-6 via execution of the executable instructions.
14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the question text recognition method according to any one of claims 1 to 6.
CN201911344917.2A 2019-12-24 2019-12-24 Question text recognition method and device, electronic equipment and storage medium Active CN111144100B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911344917.2A CN111144100B (en) 2019-12-24 2019-12-24 Question text recognition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911344917.2A CN111144100B (en) 2019-12-24 2019-12-24 Question text recognition method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111144100A true CN111144100A (en) 2020-05-12
CN111144100B CN111144100B (en) 2023-08-18

Family

ID=70519589

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911344917.2A Active CN111144100B (en) 2019-12-24 2019-12-24 Question text recognition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111144100B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783458A (en) * 2020-08-20 2020-10-16 支付宝(杭州)信息技术有限公司 Method and device for detecting overlapping character errors
CN111881293A (en) * 2020-07-24 2020-11-03 腾讯音乐娱乐科技(深圳)有限公司 Risk content identification method and device, server and storage medium
CN112528980A (en) * 2020-12-16 2021-03-19 北京华宇信息技术有限公司 OCR recognition result correction method and terminal and system thereof
CN112966509A (en) * 2021-04-16 2021-06-15 重庆度小满优扬科技有限公司 Text quality evaluation method and device, storage medium and computer equipment
CN113609864A (en) * 2021-08-06 2021-11-05 珠海市鸿瑞信息技术股份有限公司 Text semantic recognition processing system and method based on industrial control system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408087A (en) * 2014-11-13 2015-03-11 百度在线网络技术(北京)有限公司 Method and system for identifying cheating text
CN105095826A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Character recognition method and character recognition device
CN105589845A (en) * 2015-12-18 2016-05-18 北京奇虎科技有限公司 Junk text recognizing method, device and system
US20170323008A1 (en) * 2016-05-09 2017-11-09 Fujitsu Limited Computer-implemented method, search processing device, and non-transitory computer-readable storage medium
CN107861941A (en) * 2017-10-10 2018-03-30 武汉斗鱼网络科技有限公司 User's pet name authentic assessment method, storage medium, electronic equipment and system
CN109992769A (en) * 2018-12-06 2019-07-09 平安科技(深圳)有限公司 Sentence reasonability judgment method, device, computer equipment based on semanteme parsing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095826A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Character recognition method and character recognition device
CN104408087A (en) * 2014-11-13 2015-03-11 百度在线网络技术(北京)有限公司 Method and system for identifying cheating text
CN105589845A (en) * 2015-12-18 2016-05-18 北京奇虎科技有限公司 Junk text recognizing method, device and system
US20170323008A1 (en) * 2016-05-09 2017-11-09 Fujitsu Limited Computer-implemented method, search processing device, and non-transitory computer-readable storage medium
CN107861941A (en) * 2017-10-10 2018-03-30 武汉斗鱼网络科技有限公司 User's pet name authentic assessment method, storage medium, electronic equipment and system
CN109992769A (en) * 2018-12-06 2019-07-09 平安科技(深圳)有限公司 Sentence reasonability judgment method, device, computer equipment based on semanteme parsing

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881293A (en) * 2020-07-24 2020-11-03 腾讯音乐娱乐科技(深圳)有限公司 Risk content identification method and device, server and storage medium
CN111881293B (en) * 2020-07-24 2023-11-07 腾讯音乐娱乐科技(深圳)有限公司 Risk content identification method and device, server and storage medium
CN111783458A (en) * 2020-08-20 2020-10-16 支付宝(杭州)信息技术有限公司 Method and device for detecting overlapping character errors
CN111783458B (en) * 2020-08-20 2024-05-03 支付宝(杭州)信息技术有限公司 Method and device for detecting character overlapping errors
CN112528980A (en) * 2020-12-16 2021-03-19 北京华宇信息技术有限公司 OCR recognition result correction method and terminal and system thereof
CN112528980B (en) * 2020-12-16 2022-02-15 北京华宇信息技术有限公司 OCR recognition result correction method and terminal and system thereof
CN112966509A (en) * 2021-04-16 2021-06-15 重庆度小满优扬科技有限公司 Text quality evaluation method and device, storage medium and computer equipment
CN113609864A (en) * 2021-08-06 2021-11-05 珠海市鸿瑞信息技术股份有限公司 Text semantic recognition processing system and method based on industrial control system

Also Published As

Publication number Publication date
CN111144100B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
CN111144100B (en) Question text recognition method and device, electronic equipment and storage medium
US10599627B2 (en) Automatically converting spreadsheet tables to relational tables
EP3819785A1 (en) Feature word determining method, apparatus, and server
US20140350913A1 (en) Translation device and method
US11816138B2 (en) Systems and methods for parsing log files using classification and a plurality of neural networks
CN110413961B (en) Method and device for text scoring based on classification model and computer equipment
US11790174B2 (en) Entity recognition method and apparatus
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
EP4322009A1 (en) Test case generation method, apparatus and device
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN109492217B (en) Word segmentation method based on machine learning and terminal equipment
WO2020259280A1 (en) Log management method and apparatus, network device and readable storage medium
US20230076658A1 (en) Method, apparatus, computer device and storage medium for decoding speech data
US20220414463A1 (en) Automated troubleshooter
CN111177375A (en) Electronic document classification method and device
CN112395866B (en) Customs clearance sheet data matching method and device
CN112818693A (en) Automatic extraction method and system for electronic component model words
CN111492364A (en) Data labeling method and device and storage medium
CN109614494B (en) Text classification method and related device
CN113468315B (en) Vulnerability vendor name matching method
WO2021056740A1 (en) Language model construction method and system, computer device and readable storage medium
US20210342521A1 (en) Learning device, extraction device, and learning method
CN112182235A (en) Method and device for constructing knowledge graph, computer equipment and storage medium
CN113934842A (en) Text clustering method and device and readable storage medium
CN112101019A (en) Requirement template conformance checking optimization method based on part-of-speech tagging and chunk analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant