CN110134940B

CN110134940B - Method and device for training text recognition model and text continuity

Info

Publication number: CN110134940B
Application number: CN201910147725.6A
Authority: CN
Inventors: 罗彦
Original assignee: Institute of Electrical Engineering of CAS
Current assignee: Institute of Electrical Engineering of CAS
Priority date: 2019-02-27
Filing date: 2019-02-27
Publication date: 2023-04-07
Anticipated expiration: 2039-02-27
Also published as: CN110134940A

Abstract

The invention discloses a training text recognition model, a text continuity recognition method and a text continuity recognition device, wherein the method for training the text recognition model comprises the following steps: acquiring a first training text and a second training text, wherein the second training text is a reference training text corresponding to the first training text; extracting first training characteristic information from a first training text, and extracting second training characteristic information from a second training text, wherein the first training characteristic information is a text characteristic with a disordered word sequence, and the second training characteristic information is a text characteristic with a coherent word sequence; and training the support vector machine model by using the first training characteristic information and the second training characteristic information to obtain a text recognition model. The text recognition model is formed by extracting the characteristic information of the training text, so that the consistency of the text to be recognized can be recognized quickly, the recognition efficiency of the text consistency is obviously improved, the text consistency can be recognized by replacing manual work, and further, a great deal of labor is reduced.

Description

Method and device for training text recognition model and text continuity

Technical Field

The invention relates to the technical field of text recognition, in particular to a method and a device for training a text recognition model and text consistency recognition.

Background

Text coherence refers to the natural semantic relationship between sentences, i.e., a certain consistent directivity is shown among components and among layers in the whole speech activity. Text continuity plays an important role in measuring whether an article or a short article expresses clearness or has sentence logical relevance.

At present, the existing text continuity usually utilizes manual work to perform continuity identification on the whole article or the whole dialogue, although the accuracy of the manual text continuity identification is high, a large amount of time is consumed for the manual identification, so that a large amount of labor is consumed for the manual text continuity identification, and the identification efficiency is also low.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method for training a text recognition model, so as to solve the problems that the existing manual text recognition method consumes a lot of labor and has low recognition efficiency.

According to a first aspect, an embodiment of the present invention provides a method for training a text recognition model, including:

acquiring a first training text and a second training text, wherein the second training text is a reference training text corresponding to the first training text;

extracting first training characteristic information from the first training text, and extracting second training characteristic information from the second training text, wherein the first training characteristic information is a text characteristic with a disordered word sequence, and the second training characteristic information is a text characteristic with a coherent word sequence;

and training a support vector machine model by using the first training characteristic information and the second training characteristic information to obtain a text recognition model.

With reference to the first aspect, in a first implementation manner of the first aspect, the extracting first training feature information from the first training text, and extracting second training feature information from the second training text further includes:

performing word segmentation on the first training text to obtain a first word segmentation result, and performing word segmentation on the second training text to obtain a second word segmentation result;

acquiring a plurality of entity nouns in the first training text according to the first word segmentation result, and acquiring a plurality of entity nouns in the second training text according to the second word segmentation result;

confirming sentence structure types of each entity noun in the first training text in at least two adjacent sentences, and confirming sentence structure types of each entity noun in the second training text in at least two adjacent sentences;

obtaining a first transformation matrix of the first training text according to the sentence structure types of each entity noun in the first training text in at least two adjacent sentences, and obtaining a second transformation matrix of the second training text according to the sentence structure types of each entity noun in the second training text in at least two adjacent sentences;

and calculating a first probability matrix of each type of sentence structure in the first training text according to the first transformation matrix to obtain the first training characteristic information, and calculating a second probability matrix of each type of sentence structure in the second training text according to a second transformation matrix to obtain the second training characteristic information.

With reference to the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the segmenting the first training text to obtain a first segmentation result, and segmenting the second training text to obtain a second segmentation result further includes:

and identifying the part of speech of each word and/or the sentence structure of each sentence in the first training text to obtain the first segmentation result, and identifying the part of speech of each word and/or the sentence structure of each sentence in the second training text to obtain the second segmentation result.

With reference to the first aspect or the first implementation manner, in a third implementation manner of the first aspect, the step of calculating a first probability matrix of each type of sentence structure in the first training text according to the first transformation matrix to obtain the first training feature information, and calculating a second probability matrix of each type of sentence structure in the second training text according to a second transformation matrix to obtain the second training feature information further includes:

counting a first number of each type of sentence structure of the first training text, and counting a second number of each type of sentence structure of the second training text;

counting a third number of the multiple types of sentence structures of the first training text, and counting a fourth number of each type of sentence structures of the second training text;

calculating a ratio of the first number to the third number based on the first number and the third number to obtain the first probability matrix, and calculating a ratio of the second number to the fourth number based on the second number and the fourth number to obtain the second probability matrix.

With reference to the first aspect, in a fourth implementation manner of the first aspect, the training a support vector machine model by using the first training feature information and the second training feature information to obtain a text recognition model further includes:

inputting the first training feature information and the second training feature information into the support vector machine model;

and comparing and scoring the first training characteristic information and the second training characteristic information to obtain the text recognition model.

According to a second aspect, an embodiment of the present invention provides a method for recognizing text coherence, including:

acquiring a text to be recognized and a word sequence coherent text, wherein the word sequence coherent text is a reference text corresponding to the text to be recognized;

generating first characteristic information according to the text to be recognized, and generating second characteristic information according to the word order coherent text;

and inputting the first characteristic information and the second characteristic information into the text recognition model to obtain result information corresponding to the text to be recognized, wherein the result information is used for recognizing whether the text to be recognized has continuity.

With reference to the second aspect, in a first implementation manner of the second aspect, the step of generating first feature information according to the text to be recognized and second feature information according to the word order consecutive text further includes:

performing word segmentation on the text to be recognized to obtain a first word segmentation result, and performing word segmentation on the word order coherent text to obtain a second word segmentation result;

confirming the sentence structure types of each entity noun in the text to be recognized in at least two adjacent sentences, and confirming the sentence structure types of each entity noun in the language sequence coherent text in at least two adjacent sentences according to the second word segmentation result;

obtaining a first transformation matrix of the text to be recognized according to the sentence structure types of each entity noun in the text to be recognized in at least two adjacent sentences, and obtaining a second transformation matrix of the word order consecutive text according to the sentence structure types of each entity noun in the word order consecutive text in at least two adjacent sentences;

and according to a second transformation matrix, calculating a second probability matrix of each type of statement structure in the language order coherent text to obtain second characteristic information.

According to a third aspect, an embodiment of the present invention provides an apparatus for training a text recognition model, including:

the device comprises a first obtaining module, a second obtaining module and a third obtaining module, wherein the first obtaining module is used for obtaining a first training text and a second training text, and the second training text is a reference training text corresponding to the first training text;

the extraction module is used for extracting first training characteristic information from the first training text and extracting second training characteristic information from the second training text, wherein the first training characteristic information is a text characteristic with disordered word sequence, and the second training characteristic information is a text characteristic with continuous word sequence;

and the training module is used for training a support vector machine model by using the first training characteristic information and the second training characteristic information to obtain a text recognition model.

According to a fourth aspect, an embodiment of the present invention provides an apparatus for recognizing text consistency, including:

the second acquisition module is used for acquiring a text to be recognized and a word order coherent text, wherein the word order coherent text is a reference text corresponding to the text to be recognized;

the generating module is used for generating first characteristic information according to the text to be recognized and generating second characteristic information according to the word sequence coherent text;

and the result determining module is used for obtaining result information corresponding to the text to be recognized through a text recognition model, wherein the result information is used for recognizing whether the text to be recognized has continuity or not.

According to a fifth aspect, an embodiment of the present invention provides a storage medium having stored thereon computer instructions, which when executed by a processor, implement the method for training a text recognition model according to the first aspect or any one of the embodiments of the first aspect; or, the method for identifying text consistency as described in the second aspect or any one of the embodiments of the second aspect.

According to a sixth aspect, an embodiment of the present invention provides a text recognition apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the method for training a text recognition model according to the first aspect or any one of the embodiments of the first aspect when executing the program; or, implementing the method for recognizing text consistency as described in the second aspect or any one of the embodiments of the second aspect.

The invention provides a training text recognition model, a text consistency recognition method and a device, wherein the text consistency recognition method comprises the following steps: acquiring a text to be recognized and a word sequence coherent text, wherein the word sequence coherent text is a reference text corresponding to the text to be recognized; generating first characteristic information according to the text to be recognized and generating second characteristic information according to the language sequence coherent text; and inputting the first characteristic information and the second characteristic information into a text recognition model to obtain result information corresponding to the text to be recognized, wherein the result information is used for recognizing whether the text to be recognized has continuity. According to the text recognition method and the text recognition model, the first characteristic information of the text to be recognized and the second characteristic information of the word sequence coherent text corresponding to the first characteristic information are input into the text recognition model, the text recognition model is formed, the coherence of the text to be recognized can be recognized quickly, the recognition efficiency of the text coherence is improved obviously, the text coherence can be recognized manually, and further a great deal of manual energy is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a first flowchart of a method for training a text recognition model according to an embodiment of the present invention;

FIG. 2 is a second flowchart of a method for training a text recognition model according to an embodiment of the present invention;

FIG. 3A is a diagram illustrating recognition of a first training text according to an embodiment of the invention;

FIG. 3B is a diagram illustrating recognition of a second training text according to an embodiment of the invention;

FIG. 4 is a third flowchart of a method for training a text recognition model according to an embodiment of the present invention;

FIG. 5 is a fifth flowchart of a method for training a text recognition model according to an embodiment of the present invention;

FIG. 6 is a first flowchart of a method for identifying text coherence according to an embodiment of the present invention;

FIG. 7 is a second flowchart of a method for identifying text consistency in an embodiment of the present invention;

FIG. 8 is a block diagram of an apparatus for training a text recognition model according to an embodiment of the present invention;

FIG. 9 is a block diagram of an apparatus for identifying text coherence according to an embodiment of the present invention;

fig. 10 is a schematic diagram of a hardware structure of a text recognition device in an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

The embodiment of the invention provides a method for training a text recognition model, which comprises the following steps of:

step S1: and acquiring a first training text and a second training text, wherein the second training text is a reference training text corresponding to the first training text. The first training texts represent a training text set formed by a plurality of training texts, each first training text corresponds to a reference training text, and the reference training texts are second training texts. For example: the number of the first training texts is 5, the number of the second training texts is 5, and each first training text corresponds to one second training text. Specifically, for example: if the first training text is a 100-word small prose, the second training text is a reference text of the 100-word small prose, and the 100-word small prose in the second training text is usually used for verifying the continuity of the 100-word small prose in the first training text; for example; if the first training text is a history record of 500 words, the second training text is a reference text of the history record of 500 words, and the history record of 500 words in the second training text is generally used for verifying the consistency of the history record of 500 words in the first training text; for example: if the first training text is written by 10000 characters, the second training text is a reference text for writing the 10000 characters, and the 10000 characters in the second training text are written to verify the consistency of the 10000 characters in the first training text; for example: if the first training text is the lyrics of 50 words, the second training text is a reference text of the lyrics of 50 words, and the lyrics of 50 words in the second training text are used for verifying the continuity of the lyrics of 50 words in the first training text; for example: the fifth training text is a 60-word english introduction, the second training text is a reference text of the 60-word english introduction, and the 60-word english introduction in the second training text is used for verifying the continuity of the 60-word english introduction in the first training text. Therefore, the first training text and the second training text in this embodiment are both paired and appear correspondingly, and may represent the same text, except that the second training text is used as a standard continuity text of the text.

Step S2: extracting first training characteristic information from a first training text, and extracting second training characteristic information from a second training text, wherein the first training characteristic information is a text characteristic with a disordered word sequence, and the second training characteristic information is a text characteristic with a coherent word sequence. The first training feature information is from a first training text and usually represents text feature information with incoherent speech order, and the second training feature information is from a second training text and usually represents text feature information with coherent speech order and corresponds to the first training feature information because the first training text corresponds to the second training text. For example, the second training text is a 60-word English introduction that is used as a reference text for standard coherence, "Hello! My name is Chenqing.My English name is Joy.I'm 14yearold.I'm a happy family.I has a happy family.My family and My more people are book of the word works.They re busy.But at wekends, the y always code food for me.I m wish family in the word of the book, I's English training feature information is extracted from the 60-word introduction of the language-disordered first training text, and' the second training feature information is extracted from the 60-word introduction of the language-sequential second text.

In an embodiment, as shown in fig. 2, the step S2 may include the following steps in the execution process:

step S21: and performing word segmentation on the first training text to obtain a first word segmentation result, and performing word segmentation on the second training text to obtain a second word segmentation result. The word segmentation is mainly to identify the sentence structure or the part of speech of a word by using a word segmentation tool and segment the word. Specifically, identifying the part of speech of each word and/or the sentence structure of each sentence in the first training text results in a first segmentation result, for example: one sentence in the first training text is I very much like this fish, a first word segmentation result obtained by segmenting each word by identifying the part of speech of the word is called pronoun, adverb, verb and noun, and the corresponding sentence components are subject, subjects, predicates and objects. For example: one sentence in the first training text is a She said She would be a factor in the future, a first segmentation result obtained by segmenting the sentence by identifying the sentence structure of the sentence is a main sentence + a subordinate sentence, a first segmentation result obtained by segmenting the sentence by identifying the part of speech of each word is a person standing synonym in the main sentence, a person standing synonym in the subordinate sentence, a verb in the subordinate sentence, a noun in the subordinate sentence, and a noun in the subordinate sentence, wherein the corresponding sentence components of the first training text are the main sentence, the predicate in the main sentence, the predicate in the subordinate sentence and the object in the subordinate sentence. The word segmentation mode of the second training text is the same as that of the first training text, and is not described herein again.

Step S22: and acquiring a plurality of entity nouns in the first training text according to the first word segmentation result, and acquiring a plurality of entity nouns in the second training text according to the second word segmentation result. In step S21, for example: one sentence in the first training text is I very much like this fish, a first word segmentation result obtained by segmenting each word by recognizing the part of speech of each word is called pronoun, adverb, verb and noun, the corresponding sentence components are subject, predicate and object, and a plurality of entity nouns are obtained according to the first word segmentation result.

Step S23: confirming the sentence structure type of each entity noun in the first training text in at least two adjacent sentences, and confirming the sentence structure type of each entity noun in the second training text in at least two adjacent sentences.

In an embodiment, step S23 is performed to identify the sentence structure type of each entity noun in the first training text in two adjacent sentences. The sentence structure type here is a sentence structure of a subject, an object, a table or other components played in two adjacent sentences by the entity nouns obtained by analyzing and screening the part of speech of each word in the first training text. For example: the pronouns and nouns which are entity nouns in the first word segmentation result are extracted, subjects in the entity nouns are represented by A, other components in the entity nouns or the entity nouns are not represented by B in the current sentence, and objects or expressions in the entity nouns are represented by C.

For a sentence in the first training text as a main sentence + a subordinate sentence, a main language in a noun in the main sentence is denoted by AA, other components in the noun in the main sentence or the noun in the main sentence are not denoted by AB in the current sentence, an object or a table in the noun in the main sentence is denoted by AC, a main language in the noun in the subordinate sentence is denoted by CA, other components in the noun in the subordinate sentence or the noun in the current sentence are not denoted by CB in the subordinate sentence, an object or a table in the noun in the subordinate sentence is denoted by CC, for He is Tom, tom is my friend, the two sentences are adjacent sentences, and the sentence structure types of each noun in the two adjacent sentences have the following 9 structure types: AA, AB, AC, BA, BB, BC, CA, CB, CC. Therefore, after the plurality of entity nouns in the first training text are obtained according to the first word segmentation result, the structure types of each entity noun in the first training text in two adjacent sentences can be confirmed.

In an embodiment, step S23 is performed to identify the sentence structure type of each entity noun in the first training text in the three adjacent sentences. The sentence structure type is obtained by analyzing and screening the part of speech of each word in the first training text, and the sentence structure of a subject or an object or a table or other components of the entity noun played in the sentence, wherein the subject of the entity noun is represented by A, the other components of the entity noun or the entity noun are not represented by B in the current sentence, and the object or the table of the entity noun is represented by C. For example: he is Tom, tom is my friend, tom studie ver hard, these three sentences are adjacent three sentences, if the subject in the entity noun is represented by a, other components in the entity noun or the entity noun is not shown in the current sentence and represented by B, and the object or table in the entity noun is represented by C, the sentence structure type in the adjacent three sentences for each entity noun has the following 27 structure types: AAA, AAC, AAB, ACA, ACC, ACB, ABA, ABC, ABB, CAA, CAC, CAB, CCA, CCC, CCB, CBA, CBC, CBB, BAA, BAC, BAB, BCA, BCC, BCB, BBA, BBC, BBB. Therefore, after the plurality of entity nouns in the first training text are obtained according to the first word segmentation result, the structure type of each entity noun in the first training text in the three adjacent sentences can be confirmed.

In an embodiment, step S23 is performed to identify the sentence structure type of each entity noun in the first training text in two adjacent sentences. The sentence structure type is the sentence structure of a subject, an object, a table or other components played in two adjacent sentences through the entity noun obtained by analyzing and screening each sentence structure in the first training text. For example: the first word segmentation result obtained by segmenting the sentence by identifying the sentence structure of the sentence is a main sentence + a subordinate sentence, the main language of the entity noun in the main sentence is represented by AA, other components of the entity noun in the main sentence or the entity noun not appearing in the current sentence are represented by AB, the object or table language of the entity noun in the main sentence is represented by AC, the main language of the entity noun in the subordinate sentence is represented by CA, other components in the subordinate sentence or no current CB appears in the current sentence are represented by CB, and the object or table language of the entity noun in the subordinate sentence is represented by CC, so that the sentence structure types of each entity noun in two adjacent sentences containing the master-slave sequence have the following 36 structure types: AABAA, AABBC, AABAC, AABCA, AABCC, AABCB, ACBAA, ACBAC, ACBAB, ACBCA, ACBCB, ABBAA, ABBAC, ABBAB, ABBCA, ABBCC, ABBCB, CACABAA, CABAC, CABCA, CABCC, CABCB, CCBAA, CCBAC, CCBAB, CCBCA, CCBCC, CCBCB, CBBAA, BAC, CBBAB, CBBCA, CBBCC, CBBCA.

Similarly, in step S23, sentence structure types of each entity noun in the second training text in at least two adjacent sentences are determined. Here, the process of specifically confirming the sentence structure type of each entity noun in at least two adjacent sentences in the second training text is the same as the confirmation process of the first training text, and is not repeated herein.

Step S24: and obtaining a first transformation matrix of the first training text according to the sentence structure types of each entity noun in the first training text in at least two adjacent sentences, and obtaining a second transformation matrix of the second training text according to the sentence structure types of each entity noun in the second training text in at least two adjacent sentences.

For example: for three adjacent sentences in the first training text:

Tom is a doctor，he loves art very much，he is Tom.

firstly, the three sentences are segmented, then a plurality of entity nouns are obtained according to the segmentation result, wherein the entity nouns are He Tom vector Art respectively, and the sentence structure types of the entity nouns in the adjacent three sentences are BAA, ABC, CBB and BCB.

The first transformation matrix of the first training text is:

specifically, as shown in fig. 3A.

Because the first training text contains 3 sentences, the corresponding second training text also contains 3 sentences, which are used as word order consecutive texts of the first training text, and the word order consecutive texts are as follows:

firstly, segmenting the three sentences, and then obtaining a plurality of entity nouns according to the segmentation result, wherein the entity nouns are He Tom vector Art, and the sentence structure types of the entity nouns in the adjacent three sentences are ABA, CAB, BCB and BBC.

The second transformation matrix of the second training text is:

specifically, as shown in fig. 3B.

Step S25: and calculating a first probability matrix of each type of sentence structure in the first training text according to the first transformation matrix to obtain first training characteristic information, and calculating a second probability matrix of each type of sentence structure in the second training text according to the second transformation matrix to obtain second training characteristic information. Here, the first training feature information and the second training feature information appear in pairs.

In an embodiment, in the process of executing step S25, as shown in fig. 4, the method may specifically include the following steps:

step S251: a first number of each type of sentence structure of the first training text is counted, and a second number of each type of sentence structure of the second training text is counted. For example: the first training text has 20 sentences, and 20 sentence structure types exist, namely 3 ABA sentence structure types, 5 BBA sentence structure types, 4 CBB sentence structure types, 3 BBB sentence structure types and 5 BBC sentence structure types, so that the 20 sentence structure types have the same structure type. And counting the number of each type of sentence structure in the first training text as a first number. Similarly, the number of each type of sentence structure in the second training text is counted as a second number.

Step S252: and counting a third number of the multiple types of sentence structures of the first training text, and counting a fourth number of each type of sentence structures of the second training text. For example: the multiple types of sentence structures in the first training text are 5 types, the 5 types of sentence structures include the same type of sentence structure, the counted multiple types of sentence structures are 50, and the 50 are taken as the third number, and similarly, for example: the second training text contains 5 types of sentence structures of the same type, the counted number of the sentence structures of the plurality of types is 50, and the 50 are taken as the fourth number.

Step S253: and calculating the ratio of the first quantity to the third quantity according to the first quantity and the third quantity to obtain a first probability matrix, and calculating the ratio of the second quantity to the fourth quantity according to the second quantity and the fourth quantity to obtain a second probability matrix. For example: as shown in table 1 below.

TABLE 1

And step S3: and training the support vector machine model by using the first training characteristic information and the second training characteristic information to obtain a text recognition model. And inputting first training characteristic information obtained through the first probability matrix and second training characteristic information obtained through the second probability matrix into a model in a support vector machine for training and learning to obtain a text recognition model. The Support Vector Machine (SVM) is a new learning Machine based on statistical learning theory and structure risk minimization principle, and seeks an optimal compromise between model complexity (i.e. precision for a specific training sample) and learning capability (i.e. capability of identifying any sample without error) according to limited sample information to obtain the strongest classification capability.

In an embodiment, the step S3 may specifically include the following steps in the execution process, as shown in fig. 5

Step S31: the first training feature information and the second training feature information are input into a support vector machine model. Here, the first training feature information obtained by the first probability matrix and the second training feature information obtained by the second probability matrix are input to the support vector machine model, and the first training feature information and the second training feature information appear in pairs.

Step S32: and comparing and scoring the first training characteristic information with the second training characteristic information to obtain a text recognition model. Because the second training characteristic information is used as the reference text of the first training characteristic information, the second training characteristic text is used as a reference standard, the first training characteristic information is compared with the second training characteristic information, the first training characteristic information is scored to obtain a text recognition model, and the text recognition model can be used for recognizing the text continuity of the text to be recognized.

According to the method for training the text recognition model, disclosed by the embodiment of the invention, the first training text with disordered word sequences and the second training text with consistent word sequences corresponding to the first training text are respectively input into the support vector machine to obtain the first training characteristic information and the second training characteristic information, so that the text recognition model can be formed to quickly recognize the text to be recognized, the recognition efficiency of text consistency is obviously improved, the text consistency can be replaced by manual recognition, and further, a great deal of manual energy is reduced.

Example 2

An embodiment of the present invention provides a method for identifying text coherence, as shown in fig. 6, including:

and S61, acquiring a text to be recognized and a word sequence coherent text, wherein the word sequence coherent text is a reference text corresponding to the text to be recognized. The text to be recognized is a text sentence for which text continuity is to be determined, and the text to be recognized may be a prose text, a history text, an english essay, a character writing text, or other types of text, without being limited thereto. And the language order coherent text is used as a reference text of the text to be recognized, and the language order coherent text is used as a reference object of the text to be recognized.

And S62, generating first characteristic information according to the text to be recognized and generating second characteristic information according to the language sequence coherent text. The first characteristic information is from a text to be recognized and usually represents text characteristic information with uncertain continuity of the word order, and the second characteristic information is from the text with the continuity of the word order and usually represents text characteristic information with the continuity of the word order and corresponds to the text information to be recognized.

In an embodiment, in the process of executing step S62, as shown in fig. 7, the method may specifically include the following steps:

step S621: and performing word segmentation on the text to be recognized to obtain a first word segmentation result, and performing word segmentation on the sequential text to obtain a second word segmentation result. The word segmentation includes that the word segmentation tool is mainly used for identifying the sentence structure or the part of speech of the word and segmenting the word. Specifically, the part-of-speech of each word and/or the sentence structure of each sentence in the first training text are identified to obtain a first segmentation result, and similarly, the second segmentation result is the same as the first segmentation result.

Step S622: and acquiring a plurality of entity nouns in the text to be recognized according to the first word segmentation result, and acquiring a plurality of entity nouns in the text with continuous word order according to the second word segmentation result. For example: a sentence in a text to be recognized is I very mucch like this fish, a first word segmentation result obtained by segmenting each word by recognizing the part of speech of the word is a person named pronoun, a adverb, a verb and a noun, sentence components respectively corresponding to the first word segmentation result are a subject, a predicate and an object, and a plurality of entity nouns are obtained according to the first word segmentation result.

Step S623: confirming the sentence structure type of each entity noun in the text to be recognized in at least two adjacent sentences, and confirming the sentence structure type of each entity noun in the language order coherent text in at least two adjacent sentences.

In an embodiment, step S623 is performed to determine sentence structure types of each entity noun in the text to be recognized in two adjacent sentences. The sentence structure type is the structure type of each entity noun in two adjacent sentences obtained by segmenting each word in the text to be recognized, for example; he is Tom, tom is a factor. If a subject in a noun is represented by a, other components in the noun or the noun that does not appear in the current sentence are represented by B, and an object or a table in the noun is represented by C, sentence structure types in two adjacent sentences for each noun have the following 9 structure types: AA, AB, AC, BA, BB, BC, CA, CB, CC. Therefore, according to the first word segmentation result, the structure type of each entity noun in the text to be recognized in the two adjacent sentences can be confirmed.

In an embodiment, step S623 of determining sentence structure types of each entity noun in the text to be recognized in three adjacent sentences is performed. The sentence structure type is the structure type of each entity noun in two adjacent sentences obtained by segmenting each word in the text to be recognized. For example: the subject in a physical noun is denoted by A, other components in the physical noun or other entities not present in the physical noun are denoted by B, and the object or table is denoted by C. The three sentences are adjacent three sentences. If the main word of a noun is represented by a, other components or the noun is represented by B in the sentence, and the object or table language of the noun is represented by C, the sentence structure types in three adjacent sentences have the following 27 structure types for each noun: AAA, AAC, AAB, ACA, ACC, ACB, ABA, ABC, ABB, CAA, CAC, CAB, CCA, CCC, CCB, CBA, CBC, CBB, BAA, BAC, BAB, BCA, BCC, BCB, BBA, BBC, BBB. Therefore, according to the first word segmentation result, the structure type of each entity noun in the text to be recognized in the adjacent three sentences can be confirmed.

In an embodiment, step S623 of determining sentence structure types of each entity noun in the text to be recognized in two adjacent sentences is performed. The sentence structure type is obtained by dividing each word in the text to be recognized to obtain the structure type of each entity noun in two adjacent sentences, the main language of the entity noun in the main sentence is represented by AA, the object AC of the entity noun in the main sentence, other components of the entity noun in the main sentence or other components of the entity noun in the main sentence are not represented by AB in the current sentence, the main language of the entity noun in the subordinate sentence is represented by CA, the object CC of the entity noun in the subordinate sentence, and other components of the entity noun in the subordinate sentence are represented by CB. Then the sentence structure types of each entity noun in two adjacent sentences containing the master-slave order have 36 structure types as follows: AABAA, AABBC, AABAC, AABCA, AABCC, AABCB, ACBAA, ACBAC, ACBAB, ACBCA, ACBCB, ABBAA, ABBAC, ABBAB, ABBCA, ABBCC, ABBCB, CACABAA, CABAC, CABCA, CABCC, CABCB, CCBAA, CCBAC, CCBAB, CCBCA, CCBCC, CCBCB, CBBAA, BAC, CBBAB, CBBCA, CBBCC, CBBCA.

Similarly, in step S623, the sentence structure type of each entity noun in the word order consecutive text in at least two adjacent sentences is determined. Here, the specific word segmentation process of the language order coherent text and the word segmentation process of the text to be recognized are not described herein again.

Step S624: obtaining a first transformation matrix of the text to be recognized according to the sentence structure types of each entity noun in the text to be recognized in at least two adjacent sentences, and obtaining a second transformation matrix of the consecutive text in the language order according to the sentence structure types of each entity noun in the consecutive text in at least two adjacent sentences. For example: the text to be recognized comprises 6 sentences, and the sentence structures of entity nouns Tom, vector and Art in two adjacent sentences are respectively BAC, CBC and CCB.

For example: the word order coherent text is He is Tom, tom is a sector and He loves art music, the word order coherent text can be randomly disordered by taking a whole sentence as a unit to be used as a text to be identified, and the word order coherent text comprises the following steps:

Tom is a doctor，he loves art very much，he is Tom.

firstly, performing word segmentation on the three sentences, and then obtaining a plurality of entity nouns according to word segmentation results, wherein the entity nouns are He Tom vector Art respectively, and the sentence structure types of the entity nouns in the adjacent three sentences are BAA, ABC, CBB and BCB.

The first transformation matrix of the text to be recognized is:

because the text to be recognized includes 3 sentences, the corresponding second training text also includes 3 sentences, which are used as the language order consecutive text of the text to be recognized, and they are:

The second transformation matrix of the language-sequential consecutive text is:

step S625: and according to the second transformation matrix, calculating a second probability matrix of each type of statement structure in the language sequence consecutive text to obtain second characteristic information. The specific calculation method here is the same as the calculation method of calculating the first probability matrix according to the first transformation matrix to obtain the first training feature information and calculating the second probability matrix according to the second transformation matrix to obtain the second training feature information in the method of training the text recognition model in embodiment 1, which is described in detail in embodiment 1 and is not described herein again.

And S63, inputting the first characteristic information and the second characteristic information into the text recognition model in the embodiment 1 to obtain result information corresponding to the text to be recognized, wherein the result information is used for recognizing whether the text to be recognized has continuity or not. The consistency of the text to be recognized can be rapidly recognized through the text recognition model, the text recognition efficiency is improved, and the manual workload is reduced.

According to the method for recognizing the text coherence, disclosed by the embodiment of the invention, the first characteristic information of the text to be recognized and the second characteristic information of the word sequence coherent text corresponding to the first characteristic information are input into the text recognition model, so that the text recognition model is formed, the coherence of the text to be recognized can be rapidly recognized, the recognition efficiency of the text coherence is obviously improved, the text coherence can be replaced by manual recognition, and further a great deal of labor is reduced.

Example 3

An embodiment of the present invention provides an apparatus for training a text recognition model, as shown in fig. 8, including:

the first obtaining module 81 is configured to obtain a first training text and a second training text, where the second training text is a reference training text corresponding to the first training text.

The extracting module 82 is configured to extract first training feature information from a first training text, and extract second training feature information from a second training text, where the first training feature information is a text feature with a disordered language sequence, and the second training feature information is a text feature with a coherent language sequence.

And the training module 83 is configured to train the support vector machine model by using the first training feature information and the second training feature information to obtain a text recognition model.

In fig. 8, the apparatus for training a text recognition model in the embodiment of the present invention, the extracting module 82 further includes:

the word segmentation sub-module 821 is used for performing word segmentation on the first training text to obtain a first word segmentation result, and performing word segmentation on the second training text to obtain a second word segmentation result;

an obtaining submodule 822, configured to obtain a plurality of entity nouns in the first training text according to the first word segmentation result, and obtain a plurality of entity nouns in the second training text according to the second word segmentation result; the confirming sub-module 823 is configured to confirm the sentence structure types of each entity noun in the first training text in at least two adjacent sentences, and confirm the sentence structure types of each entity noun in the second training text in at least two adjacent sentences;

a matrix transformation module 824, configured to obtain a first transformation matrix of the first training text according to the sentence structure type of each entity noun in the first training text in at least two adjacent sentences, and obtain a second transformation matrix of the second training text according to the sentence structure type of each entity noun in the second training text in at least two adjacent sentences;

and the matrix calculation module 825 is configured to calculate a first probability matrix of each type of sentence structure in the first training text according to the first transformation matrix to obtain first training feature information, and calculate a second probability matrix of each type of sentence structure in the second training text according to the second transformation matrix to obtain second training feature information.

In the apparatus for training a text recognition model in the embodiment of the present invention, the confirmation sub-module 823 further includes:

and the recognition unit is used for recognizing the part of speech of each word and/or the sentence structure of each sentence in the first training text to obtain a first word segmentation result, and recognizing the part of speech of each word and/or the sentence structure of each sentence in the second training text to obtain a second word segmentation result.

In the apparatus for training a text recognition model in the embodiment of the present invention, the matrix calculation module 825 further includes:

a first statistical unit for counting a first number of each type of sentence structure of the first training text and counting a second number of each type of sentence structure of the second training text;

the second statistical unit is used for counting the third number of the multiple types of sentence structures of the first training text and counting the fourth number of each type of sentence structures of the third training text;

and the calculating unit is used for calculating the proportion of the first quantity to the third quantity according to the first quantity and the third quantity to obtain a first probability matrix, and calculating the proportion of the second quantity to the fourth quantity according to the second quantity and the fourth quantity to obtain a second probability matrix.

In fig. 8, the training module 83 further includes:

the input sub-module 831 is configured to input the first training feature information and the second training feature information into the support vector machine model;

and a comparison sub-module 832, configured to obtain the text recognition model by comparing and scoring the first training feature information and the second training feature information.

According to the device for training the text recognition model, disclosed by the embodiment of the invention, the first training text with disordered word sequences and the second training text with consistent word sequences corresponding to the first training text are respectively input into the support vector machine to obtain the first training characteristic information and the second training characteristic information, so that the text recognition model can be formed to quickly recognize the text to be recognized, the recognition efficiency of text consistency is obviously improved, the text consistency can be replaced by manual recognition, and further, a great deal of manual energy is reduced.

Example 4

An embodiment of the present invention provides an apparatus for identifying text coherence, as shown in fig. 9, including:

the second obtaining module 91 is configured to obtain a text to be recognized and a word order consecutive text, where the word order consecutive text is a reference text corresponding to the text to be recognized;

the generating module 92 is configured to generate first feature information according to the text to be recognized, and generate second feature information according to the language order coherent text;

and the result determining module 93 is configured to obtain result information corresponding to the text to be recognized through the text recognition model, where the result information is used to identify whether the text to be recognized has consistency.

In the apparatus for identifying text consistency in the embodiment of the present invention, the generating module 92 further includes:

the word segmentation sub-module 921 is configured to perform word segmentation on the text to be recognized to obtain a first word segmentation result, and perform word segmentation on the sequential text to obtain a second word segmentation result;

the obtaining sub-module 922 is configured to obtain a plurality of entity nouns in the text to be recognized according to the first word segmentation result, and obtain a plurality of entity nouns in the word order consecutive text according to the second word segmentation result;

the confirming submodule 923 is used for confirming the sentence structure types of each entity noun in the text to be recognized in at least two adjacent sentences according to the first word segmentation result, and confirming the sentence structure types of each entity noun in the language order coherent text in at least two adjacent sentences according to the second word segmentation result;

the matrix transformation submodule 924 is configured to obtain a first transformation matrix of the text to be recognized according to the sentence structure types of each entity noun in the text to be recognized in the at least two adjacent sentences, and obtain a second transformation matrix of the word order consecutive text according to the sentence structure types of each entity noun in the word order consecutive text in the at least two adjacent sentences;

the calculating sub-module 925 is configured to calculate a first probability matrix of each type of statement structure in the text to be recognized according to the first transformation matrix to obtain first feature information, and calculate a second probability matrix of each type of statement structure in the language sequence consecutive text according to the second transformation matrix to obtain second feature information.

According to the device for recognizing the text coherence, disclosed by the embodiment of the invention, the first characteristic information of the text to be recognized and the second characteristic information of the word sequence coherent text corresponding to the first characteristic information are input into the text recognition model, so that the text recognition model is formed, the coherence of the text to be recognized can be rapidly recognized, the recognition efficiency of the text coherence is obviously improved, the text coherence can be replaced by manual recognition, and further, a great deal of manual energy is reduced.

Example 5

An embodiment of the present invention provides a storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method of embodiment 1 or embodiment 2. The storage medium further stores a first training text and a second training text, first training characteristic information and second training characteristic information, a first word segmentation result, a second word segmentation result, a first transformation matrix and a second transformation matrix, a first probability matrix and a second probability matrix, and the like. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.

Example 6

An embodiment of the present invention provides a text recognition apparatus, as shown in fig. 10, which includes a memory 1020, a processor 1010, and a computer program stored in the memory 1020 and executable on the processor 1010, where the processor 1010 implements the steps of the method in embodiment 1 or embodiment 2 when executing the program.

Fig. 10 is a schematic diagram of a hardware structure of a text recognition device for executing a processing method of list item operations according to an embodiment of the present invention, as shown in fig. 10, the text recognition device includes one or more processors 1010 and a memory 1020, and one processor 1010 is taken as an example in fig. 10.

The text recognition apparatus that performs the processing method of the list item operation may further include: an input device 1030 and an output device 1040.

The processor 1010, the memory 1020, the input device 1030, and the output device 1040 may be connected by a bus or other means, and fig. 10 illustrates an example of connection by a bus.

Processor 1010 may be a Central Processing Unit (CPU). The Processor 1010 may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications derived therefrom are intended to be within the scope of the invention.

Claims

1. A method of training a text recognition model, comprising:

extracting first training characteristic information from the first training text, and extracting second training characteristic information from the second training text, wherein the first training characteristic information is a text characteristic with disordered word sequences, and the second training characteristic information is a text characteristic with continuous word sequences;

training a support vector machine model by using the first training characteristic information and the second training characteristic information to obtain a text recognition model;

the step of extracting first training feature information from the first training text and extracting second training feature information from the second training text further comprises:

2. The method of claim 1, wherein the step of performing word segmentation on the first training text to obtain a first word segmentation result and performing word segmentation on the second training text to obtain a second word segmentation result further comprises:

3. The method of training a text recognition model according to claim 1, wherein the step of calculating a first probability matrix for each type of sentence structure in the first training text according to the first transformation matrix to obtain the first training feature information and calculating a second probability matrix for each type of sentence structure in the second training text according to a second transformation matrix to obtain the second training feature information further comprises:

4. The method of claim 1, wherein the step of training a support vector machine model by using the first training feature information and the second training feature information to obtain a text recognition model further comprises:

5. A method for identifying text coherence, comprising:

acquiring a text to be recognized and a word sequence consecutive text, wherein the word sequence consecutive text is a reference text corresponding to the text to be recognized;

obtaining result information corresponding to the text to be recognized by inputting the first characteristic information and the second characteristic information into the text recognition model of any one of claims 1-4, wherein the result information is used for recognizing whether the text to be recognized has consistency; the step of generating first feature information according to the text to be recognized and generating second feature information according to the word order coherent text further comprises:

performing word segmentation on the text to be recognized to obtain a first word segmentation result, and performing word segmentation on the language sequence coherent text to obtain a second word segmentation result;

confirming the sentence structure types of each entity noun in the text to be recognized in at least two adjacent sentences, and confirming the sentence structure types of each entity noun in the language sequence consecutive text in at least two adjacent sentences;

obtaining a first transformation matrix of the text to be recognized according to the sentence structure types of each entity noun in the text to be recognized in at least two adjacent sentences, and obtaining a second transformation matrix of the consecutive text in the word order according to the sentence structure types of each entity noun in the consecutive text in at least two adjacent sentences;

6. An apparatus for training a text recognition model, comprising:

the training module is used for training a support vector machine model by using the first training characteristic information and the second training characteristic information to obtain a text recognition model;

the extraction module further comprises:

the word segmentation sub-module is used for carrying out word segmentation on the first training text to obtain a first word segmentation result and carrying out word segmentation on the second training text to obtain a second word segmentation result;

the obtaining submodule is used for obtaining a plurality of entity nouns in the first training text according to the first word segmentation result and obtaining a plurality of entity nouns in the second training text according to the second word segmentation result;

the confirming submodule is used for confirming the sentence structure type of each entity noun in the first training text in at least two adjacent sentences, and confirming the sentence structure type of each entity noun in the second training text in at least two adjacent sentences;

the matrix transformation module is used for obtaining a first transformation matrix of the first training text according to the sentence structure types of each entity noun in the first training text in at least two adjacent sentences, and obtaining a second transformation matrix of the second training text according to the sentence structure types of each entity noun in the second training text in at least two adjacent sentences;

and the matrix calculation module is used for calculating a first probability matrix of each type of statement structure in the first training text according to the first transformation matrix to obtain first training characteristic information, and calculating a second probability matrix of each type of statement structure in the second training text according to the second transformation matrix to obtain second training characteristic information.

7. An apparatus for identifying text coherence, comprising:

the generating module is used for generating first characteristic information according to the text to be recognized and generating second characteristic information according to the word sequence consecutive text;

the result determining module is used for obtaining result information corresponding to the text to be recognized through a text recognition model, wherein the result information is used for recognizing whether the text to be recognized has continuity or not;

the generation module further comprises:

the word segmentation sub-module is used for carrying out word segmentation on the text to be recognized to obtain a first word segmentation result and carrying out word segmentation on the sequential text to obtain a second word segmentation result;

the obtaining submodule is used for obtaining a plurality of entity nouns in the text to be recognized according to the first word segmentation result, and obtaining a plurality of entity nouns in the word sequence coherent text according to the second word segmentation result;

the confirming submodule is used for confirming the sentence structure types of each entity noun in the text to be recognized in at least two adjacent sentences according to the first word segmentation result, and confirming the sentence structure types of each entity noun in the language sequence coherent text in at least two adjacent sentences according to the second word segmentation result;

the matrix transformation submodule is used for obtaining a first transformation matrix of the text to be recognized according to the sentence structure types of each entity noun in the text to be recognized in at least two adjacent sentences, and obtaining a second transformation matrix of the word order consecutive text according to the sentence structure types of each entity noun in the word order consecutive text in at least two adjacent sentences;

and the calculation submodule is used for calculating a first probability matrix of each type of statement structure in the text to be recognized according to the first transformation matrix to obtain first characteristic information, and calculating a second probability matrix of each type of statement structure in the language sequence consecutive text according to the second transformation matrix to obtain second characteristic information.

8. A storage medium having stored thereon computer instructions, which when executed by a processor, carry out the steps of the method of training a text recognition model according to any one of claims 1 to 4; or, implementing the steps of the method of identifying text coherence of claim 5.

9. A text recognition apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of training a text recognition model according to any one of claims 1 to 4 when executing the program; or, implementing the steps of the method of identifying text coherence of claim 5.