CN113742445B - Text recognition sample obtaining method and device and text recognition method and device - Google Patents
Text recognition sample obtaining method and device and text recognition method and device Download PDFInfo
- Publication number
- CN113742445B CN113742445B CN202110807246.XA CN202110807246A CN113742445B CN 113742445 B CN113742445 B CN 113742445B CN 202110807246 A CN202110807246 A CN 202110807246A CN 113742445 B CN113742445 B CN 113742445B
- Authority
- CN
- China
- Prior art keywords
- sample
- text
- causal
- text recognition
- event
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 66
- 230000001364 causal effect Effects 0.000 claims abstract description 256
- 238000012549 training Methods 0.000 claims abstract description 76
- 230000009977 dual effect Effects 0.000 claims abstract description 21
- 239000013598 vector Substances 0.000 claims description 28
- 238000004590 computer program Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 238000001914 filtration Methods 0.000 description 5
- 230000014509 gene expression Effects 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 239000002356 single layer Substances 0.000 description 2
- 208000027418 Wounds and injury Diseases 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 208000014674 injury Diseases 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000010410 layer Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a text recognition sample acquisition method, a text recognition method and a text recognition device, wherein the acquisition method comprises the following steps: and inputting the initial text sample into a sample generation model to obtain a plurality of text recognition samples output by the sample generation model. The sample generation mode is obtained by dual learning based on the generator and the recognizer, so that a plurality of high-quality text recognition samples can be obtained, each text recognition sample expresses causal relationship or non-causal relationship between two events in different semantic ways, and the recognition accuracy of the text recognition model can be improved when the high-quality text recognition samples are used for training the text recognition model. Meanwhile, a plurality of text recognition samples are obtained on the basis of the initial text sample, the quantity and the scale of the samples are enlarged, and further, when the text recognition samples are used for training a text recognition model, the text recognition model can accurately learn the causal semantic relation of events in the text recognition samples, and the recognition precision of the model is improved.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for obtaining a text recognition sample and recognizing a text.
Background
Event cause and effect Identification (ECI) aims to identify cause and effect relationships among events in a text, and can provide important clues for many Natural Language Processing (NLP) tasks, such as logical reasoning, a question and answer system and the like. The ECI task is typically modeled as a classification problem that identifies whether a causal relationship exists between two events in a sentence. For example, the ECI system requires identifying causal relationships between "attack" and "bereaved" events in the following sentences: "character a is a young person who likes football and soon after a fierce game he has lost his lives in attack. "
Most current ECI methods employ a supervised learning paradigm. While these methods achieve good performance, large scale labeling of training data is often required. However, existing event cause and effect identification data sets are small in size. From a cognitive perspective to a linguistic perspective, the causal relationship definition has no uniform framework, so that the existing event causal relationship identification data set is relatively small in size, and statistically, the EventStoryLine data set most commonly used for the task only comprises 258 documents, 4316 sentences and 1770 causal event pairs. These small-scale labeled data sets hinder the training of high-performance event causal relationship recognition models, and sufficient training data cannot be provided to support the models to accurately understand event relationship semantics in texts. Therefore, the lack of training data is an important problem to be solved for event causal relationship identification.
Disclosure of Invention
The invention provides a text recognition sample obtaining method and a text recognition method and device, which are used for solving the defect that the training effect of a recognition model is influenced due to the fact that the scale of a training data set for event cause and effect relationship recognition is small in the prior art.
The invention provides a text recognition sample acquisition method, which comprises the following steps:
determining an initial text sample, the initial text sample comprising at least two events and a causal relationship or a non-causal relationship between the two events;
inputting the initial text sample into a sample generation model to obtain a plurality of text recognition samples output by the sample generation model; each text recognition sample comprises the two events, and each text recognition sample expresses causal relationship or non-causal relationship between the two events in different semantic ways;
the sample generation model is obtained by training based on a sample text and a causal relationship or a non-causal relationship in the sample text; the sample generation model is obtained by dual learning based on a generator and an identifier, the generator is used for performing data enhancement on the sample text to generate an enhanced sample, and the identifier is used for identifying causal relationships of events in the enhanced sample.
According to the text recognition sample acquisition method provided by the invention, the sample text is determined based on the following steps:
determining an initial sample text from a preset database, wherein the initial sample text comprises a first event and a second event;
calculating a causal distance between the first event and the second event, and if the causal distance is smaller than a preset value, taking the initial sample text as the sample text; the causal distance is used to characterize a degree of causal correlation between the first event and the second event.
According to the text recognition sample acquisition method provided by the invention, the calculating of the causal distance between the first event and the second event comprises the following steps:
constructing a causal representation space based on the initial sample text, and mapping the first event and the second event into a first vector and a second vector in the causal representation space respectively based on the initial causal relationship;
determining the causal distance based on between the first vector and the second vector.
According to the text recognition sample acquisition method provided by the invention, the causal distance is calculated based on the following formula:
wherein L represents the causal distance, e i A representation representing the first of two events causally related, e j Representation of the second of the two events causally related, e' i Representing a representation, e ', of the first of two events that are not causally related' j And a representation of the second of the two events representing non-causal correlations, λ represents a threshold between vectors characterizing a causal correlation distance, d represents an inter-vector distance calculation function, T represents a set of causal correlation events, and T' represents a set of non-causal correlation events.
According to the text recognition sample acquisition method provided by the invention, the determining of the initial sample text from the preset database comprises the following steps:
determining an original sample text from the preset database, and extracting a first sample event, a second sample event and a sample causal relationship in the original sample text, wherein the sample causal relationship is the causal relationship between the first sample event and the second sample event;
carrying out synonym expansion on the sample causal relationship, and/or replacing the first sample event with any event, and/or replacing the second sample event with any event to obtain the initial sample text.
The invention also provides a text recognition method, which comprises the following steps:
determining a text to be recognized;
inputting the text to be recognized into a text recognition model to obtain a text recognition result output by the text recognition model, wherein the text recognition result is a causal relationship of an event in the text to be recognized;
the text recognition model is obtained by training based on the recognizer of the sample generation model, wherein the text recognition sample is used as a training sample, and the causal relationship in the text recognition sample is used as a training label.
The invention provides a text recognition sample acquisition device, which comprises:
an initial sample determining unit, configured to determine an initial text sample, where the initial text sample includes at least two events and a causal relationship or a non-causal relationship between the two events;
the identification sample generation unit is used for inputting the initial text sample into a sample generation model to obtain a plurality of text identification samples output by the sample generation model; each text recognition sample comprises the two events, and the text recognition samples express causal relation or non-causal relation between the two events in different semantic ways;
the sample generation model is obtained by training based on a sample text and a causal relationship or a non-causal relationship in the sample text; the sample generation model is obtained by dual learning based on a generator and an identifier, the generator is used for performing data enhancement on the sample text to generate an enhanced sample, and the identifier is used for identifying causal relationships of events in the enhanced sample.
The present invention also provides a text recognition apparatus, comprising:
the text determining unit is used for determining a text to be recognized;
the text recognition unit is used for inputting the text to be recognized into a text recognition model to obtain a text recognition result output by the text recognition model, wherein the text recognition result is a causal relationship of events in the text to be recognized;
the text recognition model is obtained by training based on the recognizer of the sample generation model, wherein the text recognition sample is used as a training sample, and the causal relationship in the text recognition sample is used as a training label.
The invention further provides an electronic device, which includes a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of any of the above text recognition sample acquisition methods when executing the computer program, and/or implements the steps of any of the above text recognition methods when executing the computer program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the text recognition sample acquisition method as described in any of the above, and/or which, when being executed by a processor, carries out the steps of the text recognition method as described in any of the above.
According to the text recognition sample obtaining and text recognition method and device, the sample generation mode is obtained based on dual learning of the generator and the recognizer, so that a plurality of high-quality text recognition samples can be obtained through the sample generation model, each text recognition sample expresses causal relationship or non-causal relationship between two events in different semantic modes, and the recognition accuracy of the text recognition model can be improved when the high-quality text recognition samples are used for text recognition model training. Meanwhile, a plurality of text recognition samples are obtained on the basis of the initial text sample, the quantity and the scale of the samples are enlarged, and then when the text recognition samples are used for training a text recognition model, the text recognition model can accurately learn the causal semantic relation of events in the text recognition samples, so that the recognition precision of the model is improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a text recognition sample acquisition method provided by the present invention;
FIG. 2 is a schematic diagram of generator sentence generation provided by the present invention;
FIG. 3 is a schematic flow chart of the dual learning of the generator and the recognizer provided by the present invention;
FIG. 4 is a flow chart illustrating a text recognition method provided by the present invention;
FIG. 5 is a flow chart of text recognition based on an enhanced box for even data according to the present invention;
FIG. 6 is a schematic structural diagram of a text recognition sample acquiring apparatus provided in the present invention;
FIG. 7 is a schematic structural diagram of a text recognition apparatus provided in the present invention;
fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The core of the training sentence of the ECI task requirements is two causally related events, this property being "causality" (for example, "attack" and "death" events are causally related, while there is little causality between "attack" and "birth" events. Therefore, how to acquire a large number of different causally related events is a fundamental problem for the task to generate new training data. The existing knowledge base has a large amount of causality related knowledge, and can provide sufficient resources for obtaining causality related events.
However, the only causal event is not the complete training data, and the context meeting the language specification is also required to express the causal semantics of the event, i.e. the training data needs to meet the "normalization" and includes: a) syntax in compliance with a language specification; b) event related entities which are consistent with logic and have semantic roles; c) an adapter expressing complete causal semantics. Therefore, how to construct causal sentences meeting the language specification for the causal related events is the key to generate new training data for the ECI task. Currently, there are two methods to generate new training data related to the ECI task: 1) and (3) remote supervision: finding sentences expressing causal semantics of the causal related events from the related documents of the labeled events; 2) and (3) constraint generation: sentences expressing the causal semantics of the causally related events are generated based on the causally related events.
Through analysis, the automatically labeled training data is relatively low in quality, and most natural language processing data enhancement methods including remote supervision are independent of a task framework, and all new training data are generated at one time. In these frameworks, data generation and target tasks are modeled independently, resulting in the generated data lacking task-related features such as linguistic expressions and knowledge. Therefore, how to interact modeling data enhancement and an ECI task and generate new training data with higher quality and more task relevance is a problem to be solved urgently in the ECI task.
In view of the above, the present invention provides a text recognition sample acquisition method. Fig. 1 is a schematic flow chart of a text recognition sample acquisition method provided by the present invention, and as shown in fig. 1, the method includes the following steps:
Specifically, the initial text sample refers to the text information including two events with causal relationships or non-causal relationships. For example, "attack" events and "death" events are causally related, and thus attack "events and" death "events can be causal events; there is little causal correlation between "attack" and "birth" events, so "attack" and "birth" events can be considered non-causal events.
The initial text sample may be text data obtained from a public data set (e.g., an EventStoryLine data set), may also be text data randomly input by a user, and may also be text data obtained according to voice recognition of the user, which is not specifically limited in this embodiment of the present invention.
the sample generation model is obtained by training based on a sample text and a causal relationship or a non-causal relationship in the sample text; the sample generation model is obtained by dual learning based on a generator and an identifier, the generator is used for performing data enhancement on a sample text to generate an enhanced sample, and the identifier is used for identifying the causal relationship of events in the enhanced sample.
Specifically, if the initial text samples are obtained from the public data set, the number of the samples in the public data set is limited, so that the scale of the number of the initial text samples that can be obtained is small, and if the initial text samples with small number scale are used as training samples of a subsequent text recognition model, sufficient training data support models cannot be provided to accurately understand event relation semantics in the text. If the text data randomly input by the user or the text data obtained according to the speech recognition of the user is used as the training data of the text recognition model, the text data randomly input by the user or the text data obtained according to the speech recognition of the user may not meet the language specification, such as a causal relationship or a non-causal relationship which does not meet a grammatical specification, does not meet a logic, and cannot completely express an event, the causal relationship of the event in the text cannot be accurately identified and obtained by the text recognition model obtained through training.
Therefore, the embodiment of the invention inputs the initial text sample into the sample generation model to obtain a plurality of text recognition samples output by the sample generation model. Because each text recognition sample obtained by output contains two events and each text recognition sample expresses causal relationship or non-causal relationship between the two events in different semantic ways, the scale of the text recognition model training sample is enlarged on the basis of the original initial text sample, and the sample generation model is obtained by dual learning based on a generator and a recognizer, so that each output text recognition sample can accord with language specifications, and the recognition accuracy of the model can be improved when the sample generation model is used for training the text recognition model.
In addition, the sample generation model is obtained by performing dual learning based on a generator and an identifier, the generator is used for performing data enhancement on the sample text to generate an enhanced sample, the identifier is used for identifying the causal relationship of the event in the enhanced sample, namely, "generating the enhanced sample" as a main task, and "identifying the causal relationship of the event in the enhanced sample" as a dual task. Therefore, the embodiment of the invention utilizes a dual learning mechanism, and based on the introduced causal correlation events, the recognizer and the generator are dual-constrained to generate high-quality text recognition samples, so that the performance of event causal relationship recognition is improved.
Before the initial text sample is input into the sample generation model, the sample generation model may be obtained through pre-training, and specifically, the following steps may be performed: firstly, a large amount of sample texts are collected, and the corresponding causal relationship or non-causal relationship is determined through manual marking. And secondly, training the initial model based on the sample text and the causal relationship or the non-causal relationship in the sample text to obtain a sample generation model.
It can be understood that the text recognition sample can be used for training a text recognition model so that the text recognition model can accurately recognize the cause-and-effect relationship of events in a text, and can also be used for training a text extraction model, such as text abstract extraction, to extract keywords in the text as abstract information, thereby improving the efficiency and precision of text abstract extraction. In addition, the text recognition sample can also be used for a question-answering system, for example, a keyword extraction model obtained based on the training of the text recognition sample can accurately extract the information of keywords or keywords in the text or voice input by the user, the question-answering system searches based on the extracted keywords or keywords and outputs answer information corresponding to the text or voice input by the user, and the question-answering efficiency and precision are further improved.
According to the text recognition sample acquisition method provided by the embodiment of the invention, the sample generation mode is obtained by dual learning based on the generator and the recognizer, so that a plurality of high-quality text recognition samples can be obtained through the sample generation model, and each text recognition sample expresses causal relation or non-causal relation between two events in different semantic ways, so that when the high-quality text recognition samples are used for text recognition model training, the recognition accuracy of the text recognition model can be improved. Meanwhile, the embodiment of the invention obtains a plurality of text recognition samples on the basis of the initial text sample, enlarges the quantity and scale of the samples, and further enables the text recognition model to accurately learn the causal semantic relation of events in the text recognition sample when the text recognition sample is used for training the text recognition model, thereby improving the recognition precision of the model.
As shown in fig. 2, the generator generates an enhanced sample with "normative" through three stages: 1) related entities are distributed, so that the logic of entities participating in different semantic roles of the event in the sentence is reasonable; 2) completing sentences, and ensuring the integrity of semantic expression of causal or non-causal relation of the sentences; 3) and sentence filtering, namely filtering the sentences based on the confusion and the similar distance to ensure the quality and diversity of the generated sentences.
The cosine similarity calculation formula allocated by the related entities is as follows:
where ε (ent) represents cosine similarity, ent represents the set of candidate entities, ω represents a single token in an entity, and ε (ω) represents the vector representation of each token.
The formula for calculating the degree of confusion is as follows:
wherein PPL (s ') represents the degree of confusion, T (s') represents the set of tokens newly generated in the generation sample s ', T represents a single newly generated token in T (s'), and p (T) represents the generation probability of T.
The calculation formula of the similar distance is as follows:
wherein DIS (s', D) m ) Denotes a similar distance, D m Representing randomly drawn m labeled training samples, s represents D m Where a single sample, s (s ') represents the vector representation of the newly generated sample s', and s(s) represents the vector representation of the marked sample s.
The weighted filtering calculation formula is as follows:
Score(s′)=μPPL(s′)+(1-μ)DIS(s′,D m )
where Score (s') represents a weighted filter value and μ represents a weight.
As shown in fig. 3, the dual learning joint modeling knowledge-guided constrained sentence generator and event causal relationship recognizer iteratively optimizes the recognizer and generator, generates new training data related to the task, and then further trains the recognizer with the new data. The generator G generates a sentence s' expressing the causal or non-causal relationship c of the two events of the input event pair ep, receives a reward R characterizing the output quality of the current system, consisting of a semantically aligned reward Rs characterizing the output quality of the generator itself and a causal reward Rc (primary loop) characterizing the output quality of the recognizer I. Likewise, the recognizer I recognizes the causal or non-causal relationship c' of two events of the input event pair ep on the basis of the input sentence s, receiving a reward R consisting of a causal reward Rc characterizing the output quality of the recognizer itself and a semantically aligned reward Rs characterizing the output quality of the generator G (dual cycle).
If the generated sentence clearly expresses the relationship between the input event pair events, the recognizer will more easily understand the sentence expression relationship semantics. Thus, the accuracy of the causal classification is used as a causal reward to assess the quality of sentences generated by the current system, while the recognizer itself is tuned and optimized. The causal reward formula is as follows:
wherein R is c (ep, s) denotes a causal reward, p (c' | s; θ) I ) Representing the probability of identifying a causal relationship between two events in a sample s.
The method can make the semantics of the generated sentence consistent with the relationship between the input events and the events. Further, if the relationships of the input events can be classified more accurately, it is considered that the semantics of the newly generated sentence and the relationships of the input are more likely to be kept consistent. Thus, the degree of semantic alignment is measured by generating a probability of semantically similar sentences to the input relationship, defining a semantic alignment reward, which is formulated as follows:
wherein R is s (ep, s) denotes the semantic alignment reward, p (s' | v; θ) G ) Represents the probability of generating a new sample s' based on the causal relationship c, p (t | c; theta G ) Representing the probability of generating each token in the new sample s'.
Based on the above embodiment, the sample text is determined based on the following steps:
determining an initial sample text from a preset database, wherein the initial sample text comprises a first event and a second event;
calculating a causal distance between the first event and the second event, and if the causal distance is smaller than a preset value, taking the initial sample text as a sample text; the causal distance is used to characterize a degree of causal correlation between the first event and the second event.
Specifically, a large amount of knowledge about cause-effect correlation and cause-effect non-correlation is stored in the preset database, so that sufficient resources can be provided for acquiring cause-effect correlation events and non-cause correlation events. For example: the 'attack' event and the 'injury' event can be inquired and obtained from a preset database to be causally related events, and the 'attack' event and the 'birth' event are non-causally related events, so that an initial sample text can be determined based on the causally related events and/or the non-causally related events, and the initial sample text comprises the first event and the second event.
In addition, because the degree of causal correlation between the first event and the second event contained in different initial sample texts is different, the higher the degree of causal correlation is, and when the corresponding initial sample texts are used for training the sample generation model, high-quality text recognition samples can be accurately generated. Therefore, in the embodiment of the invention, after the initial sample text is obtained, the causal distance between the first event and the second event is calculated, if the causal distance is smaller than the preset value, the causal correlation degree between the first event and the second event is high, and the initial sample text is used as the sample text for training the sample generation model, so that the sample generation model can generate the high-quality text recognition sample.
Based on any of the embodiments above, calculating a causal distance between a first event and a second event comprises:
constructing a causal representation space based on the initial sample text, and respectively mapping a first event and a second event into a first vector and a second vector in the causal representation space based on the initial causal relationship;
a causal distance is determined based on the first vector and the second vector.
Specifically, since the initial sample text is preliminarily extracted from the preset database, event pairs (such as a first event and a second event) contained in the initial sample text are rough, and many causal/non-causal correlations are not strong, so that a causal representation space is constructed, the extracted initial sample text is converted into triples in the form of { the first event, the causal correlations/non-causal correlations, and the second event }, the events and the relationships are mapped into vector representations in the causal representation space through a single-layer neural network, and euclidean distances between vectors are used as causal distances, so that the initial sample text can be filtered, the text with weak causal/non-causal correlations is filtered, the training effect of a sample generation model is improved, and a high-quality text recognition sample is obtained.
Based on any of the above embodiments, the causal distance is calculated based on the following formula:
wherein L represents a causal distance, e i A representation representing the first of two events causally related, e j Representation of the second of the two events causally related, e' i Representing a representation, e ', of the first of two events that are not causally related' j A representation of the second of the two events representing non-causal correlation, λ represents a threshold value between vectors characterizing a causal correlation distance, d represents an inter-vector distance calculation function, T represents a set of causal correlation events, and T' represents a set of non-causal correlation events.
Specifically, events and relationships are mapped to vector representations in a causal representation space through a single-layer neural network, and a 'causal distance' between two events in a triplet is calculated by maximizing an objective function, and a smaller causal distance indicates a higher degree of causal correlation between a first event and a second event. The initial sample text with the causal distance smaller than the preset value is selected as the sample text, so that the quality of the text recognition sample generated by the sample generation model can be improved, and the precision of subsequent text recognition is further improved.
Based on any of the above embodiments, determining an initial sample text from a preset database includes:
determining an original sample text from a preset database, and extracting a first sample event, a second sample event and a sample causal relationship in the original sample text, wherein the sample causal relationship is the causal relationship between the first sample event and the second sample event;
and carrying out synonym expansion on the sample causal relationship, and/or replacing a first sample event with any event, and/or replacing a second sample event with any event to obtain an initial sample text.
Specifically, table 1 is a list of extracted causal related events, and as shown in table 1, causal/non-causal correlations are obtained from different databases in three ways: 1) and (4) expanding the causal/non-causal related events marked in the data set through the vocabulary knowledge to obtain new causal/non-causal related events. For example, expansion by synonyms in WordNet, expansion by verb categories in VerbNet, etc.; 2) extracting a causal correlation triple from the concept knowledge to obtain a new causal correlation event, such as ConceptNet; 3) new causal related events are introduced from external documents through causal connectives. For example, external specification documents such as connectives and KBP data sets in PDTB2 that characterize causal utterances.
TABLE 1
Based on any of the above embodiments, the present invention provides a text recognition method, as shown in fig. 4, the method includes:
step 420, inputting the text to be recognized into a text recognition model to obtain a text recognition result output by the text recognition model, wherein the text recognition result is a causal relationship of events in the text to be recognized;
the text recognition model is obtained by training based on the recognizer of the sample generation model according to any embodiment, wherein the text recognition sample according to any embodiment is used as a training sample, and the causal relationship in the text recognition sample is used as a training label.
Specifically, the embodiment of the present invention further trains the recognizer by using dually generated text recognition samples, for example, a pre-trained language model (BERT) may be used to extract features first, then a multi-layer perceptron model is used to perform class prediction, and the model is updated by using the following cross-entropy loss function:
L I (ep,s)=p(c′|s;θ I )
as shown in FIG. 5, the invention obtains causal/non-causal related events from different knowledge bases by three ways of vocabulary knowledge extension, connection knowledge introduction and concept knowledge introduction; and constructing a causal representation space based on causal/non-causal related events, converting the newly extracted event pairs into triples, calculating the causal distance of the two events, and filtering to obtain high-probability causal related/non-causal related events. Generating sentences which contain given events and express the semantics of the causal/non-causal relationship of the given events by related entity allocation, sentence completion and sentence filtering based on the extracted causal/non-causal related events; a constrained sentence generator and an event causal relationship recognizer guided by dual learning joint modeling knowledge iteratively optimizes the recognizer and the generator based on causal rewards and semantic alignment rewards, and new training data related to tasks are generated. And finally, further training the recognizer by using dual enhanced data to obtain a text recognition model for text recognition, so that the causal relationship of events in the text can be accurately recognized.
Based on this, experiments are performed to verify the performance of the method provided by the above embodiment, and specifically, the validity of newly generated training data is verified on two common public data sets: 1) EventStoryline v0.9 (ESC): contains 258 documents, 4316 sentences and 1770 causal event pairs; 2) Causal-TimeBank (Causal-TB): contains 184 documents, 6813 events, 318 causal event pairs.
Table 2 is a text recognition result comparison table, and as shown in table 2, the effectiveness thereof is illustrated by comparing the effects of the conventional method and the method (Ours) of the embodiment of the present invention. As can be seen from Table 2, when the text recognition sample generated by the method is applied to a text recognition model, the performance of event cause and effect relationship recognition on an EventStoryLine v0.9 data set and a Causal-TimeBank data set can be effectively improved. The method provided by the embodiment of the invention can generate the event causal relationship identification data with better quality and more task relevance.
TABLE 2
The following describes the text recognition sample acquisition device provided by the present invention, and the text recognition sample acquisition device described below and the text recognition sample acquisition method described above may be referred to in correspondence with each other.
Based on any of the above embodiments, the present invention provides a text recognition sample acquiring device, as shown in fig. 6, the device includes:
an initial sample determining unit 610, configured to determine an initial text sample, where the initial text sample includes at least two events and a causal relationship or a non-causal relationship between the two events;
the identification sample generation unit 620 is configured to input the initial text sample to a sample generation model, and obtain a plurality of text identification samples output by the sample generation model; each text recognition sample comprises the two events, and each text recognition sample expresses causal relationship or non-causal relationship between the two events in different semantic ways;
the sample generation model is obtained by training based on a sample text and a causal relationship or a non-causal relationship in the sample text; the sample generation model is obtained by dual learning based on a generator and an identifier, the generator is used for performing data enhancement on the sample text to generate an enhanced sample, and the identifier is used for identifying causal relationships of events in the enhanced sample.
Based on any embodiment above, the apparatus further comprises:
the device comprises a first determining unit, a second determining unit and a display unit, wherein the first determining unit is used for determining an initial sample text from a preset database, and the initial sample text comprises a first event and a second event;
the calculation unit is used for calculating a causal distance between the first event and the second event, and if the causal distance is smaller than a preset value, the initial sample text is used as the sample text; the causal distance is used to characterize a degree of causal correlation between the first event and the second event.
Based on any of the above embodiments, the computing unit includes:
a construction unit, configured to construct a causal representation space based on the initial sample text, and map the first event and the second event into a first vector and a second vector in the causal representation space based on the initial causal relationship, respectively;
a second determination unit for determining the causal distance based on the distance between the first vector and the second vector.
Based on any of the above embodiments, the first determining unit includes:
the extraction unit is used for determining an original sample text from the preset database and extracting a first sample event, a second sample event and a sample causal relationship in the original sample text, wherein the sample causal relationship is the causal relationship between the first sample event and the second sample event;
and the generating unit is used for carrying out synonym expansion on the sample causal relationship, and/or replacing the first sample event with any event, and/or replacing the second sample event with any event to obtain the initial sample text.
The following describes the text recognition apparatus provided by the present invention, and the text recognition apparatus described below and the text recognition method described above may be referred to correspondingly.
Based on any of the above embodiments, the present invention provides a text recognition apparatus, as shown in fig. 7, the apparatus including:
a text determining unit 710, configured to determine a text to be recognized;
the text recognition unit 720 is configured to input the text to be recognized into a text recognition model, and obtain a text recognition result output by the text recognition model, where the text recognition result is a causal relationship of an event in the text to be recognized;
the text recognition model is obtained by training based on the recognizer of the sample generation model according to any embodiment, wherein the text recognition sample according to any embodiment is used as a training sample, and the causal relationship in the text recognition sample is used as a training label.
Fig. 8 is a schematic structural diagram of an electronic device provided in the present invention, and as shown in fig. 8, the electronic device may include: a processor (processor)810, a memory (memory)820, a communication Interface (Communications Interface)830 and a communication bus 840, wherein the processor 810, the memory 820 and the communication Interface 830 communicate with each other via the communication bus 840. Processor 810 may invoke logic instructions in memory 820 to perform a text recognition sample acquisition method comprising: determining an initial text sample, the initial text sample comprising at least two events and a causal relationship or a non-causal relationship between the two events; inputting the initial text sample into a sample generation model to obtain a plurality of text recognition samples output by the sample generation model; each text recognition sample comprises the two events, and each text recognition sample expresses causal relationship or non-causal relationship between the two events in different semantic ways; the sample generation model is obtained by training based on a sample text and a causal relationship or a non-causal relationship in the sample text; the sample generation model is obtained by dual learning based on a generator and an identifier, the generator is used for performing data enhancement on the sample text to generate an enhanced sample, and the identifier is used for identifying causal relationships of events in the enhanced sample.
And/or, to perform a text recognition method, the method comprising: determining a text to be recognized; inputting the text to be recognized into a text recognition model to obtain a text recognition result output by the text recognition model, wherein the text recognition result is a causal relationship of an event in the text to be recognized; the text recognition model is obtained by training based on the recognizer of the sample generation model, wherein the text recognition sample is used as a training sample, and the causal relationship in the text recognition sample is used as a training label.
Furthermore, the logic instructions in the memory 820 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the text recognition sample acquisition method provided by the above methods, the method comprising: determining an initial text sample, the initial text sample comprising at least two events and a causal relationship or a non-causal relationship between the two events; inputting the initial text sample into a sample generation model to obtain a plurality of text recognition samples output by the sample generation model; each text recognition sample comprises the two events, and each text recognition sample expresses causal relationship or non-causal relationship between the two events in different semantic ways; the sample generation model is obtained by training based on a sample text and a causal relationship or a non-causal relationship in the sample text; the sample generation model is obtained by dual learning based on a generator and an identifier, the generator is used for performing data enhancement on the sample text to generate an enhanced sample, and the identifier is used for identifying causal relationships of events in the enhanced sample.
And/or, to perform a text recognition method, the method comprising: determining a text to be recognized; inputting the text to be recognized into a text recognition model to obtain a text recognition result output by the text recognition model, wherein the text recognition result is a causal relationship of an event in the text to be recognized; the text recognition model is obtained by training based on the recognizer of the sample generation model, wherein the text recognition sample is used as a training sample, and the causal relationship in the text recognition sample is used as a training label.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the text recognition sample acquiring methods provided above, the method including: determining an initial text sample, the initial text sample comprising at least two events and a causal relationship or a non-causal relationship between the two events; inputting the initial text sample into a sample generation model to obtain a plurality of text recognition samples output by the sample generation model; each text recognition sample comprises the two events, and each text recognition sample expresses causal relationship or non-causal relationship between the two events in different semantic ways; the sample generation model is obtained by training based on a sample text and a causal relationship or a non-causal relationship in the sample text; the sample generation model is obtained by dual learning based on a generator and an identifier, the generator is used for performing data enhancement on the sample text to generate an enhanced sample, and the identifier is used for identifying causal relationships of events in the enhanced sample.
And/or, to perform a text recognition method, the method comprising: determining a text to be recognized; inputting the text to be recognized into a text recognition model to obtain a text recognition result output by the text recognition model, wherein the text recognition result is a causal relationship of an event in the text to be recognized; the text recognition model is obtained by training based on the recognizer of the sample generation model, wherein the text recognition sample is used as a training sample, and the causal relationship in the text recognition sample is used as a training label.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A text recognition sample acquisition method is characterized by comprising the following steps:
determining an initial text sample, the initial text sample comprising at least two events and a causal relationship or a non-causal relationship between the two events;
inputting the initial text sample into a sample generation model to obtain a plurality of text recognition samples output by the sample generation model; each text recognition sample comprises the two events, and each text recognition sample expresses causal relationship or non-causal relationship between the two events in different semantic ways;
the sample generation model is obtained by training based on a sample text and a causal relationship or a non-causal relationship in the sample text; the sample generation model is obtained by dual learning based on a generator and an identifier, the generator is used for performing data enhancement on the sample text to generate an enhanced sample, and the identifier is used for identifying causal relationships of events in the enhanced sample.
2. The text recognition sample acquisition method according to claim 1, wherein the sample text is determined based on the steps of:
determining an initial sample text from a preset database, wherein the initial sample text comprises a first event and a second event;
calculating a causal distance between the first event and the second event, and if the causal distance is smaller than a preset value, taking the initial sample text as the sample text; the causal distance is used to characterize a degree of causal correlation between the first event and the second event.
3. The text recognition sample acquisition method of claim 2, wherein the calculating a causal distance between the first event and the second event comprises:
constructing a causal representation space based on the initial sample text, and mapping the first event and the second event into a first vector and a second vector in the causal representation space respectively based on the causal relationship;
determining the causal distance based on between the first vector and the second vector.
4. The text recognition sample acquisition method according to claim 3, wherein the causal distance is calculated based on the following formula:
wherein L represents the causal distance, e i A representation representing the first of two events causally related, e j Representation of the second of the two events causally related, e' i Representing a representation, e ', of the first of two events that are not causally related' j A representation of the second of the two events representing non-causal correlation, λ represents a threshold value between vectors characterizing a causal correlation distance, d represents an inter-vector distance calculation function, T represents a set of causal correlation events, and T' represents a set of non-causal correlation events.
5. The method for obtaining the text recognition sample according to claim 2, wherein the determining the initial sample text from the preset database comprises:
determining an original sample text from the preset database, and extracting a first sample event, a second sample event and a sample causal relationship in the original sample text, wherein the sample causal relationship is the causal relationship between the first sample event and the second sample event;
carrying out synonym expansion on the sample causal relationship, and/or replacing the first sample event with any event, and/or replacing the second sample event with any event to obtain the initial sample text.
6. A text recognition method, comprising:
determining a text to be recognized;
inputting the text to be recognized into a text recognition model to obtain a text recognition result output by the text recognition model, wherein the text recognition result is a causal relationship of an event in the text to be recognized;
the text recognition model is obtained by training a text recognition sample generated by the text recognition sample acquisition method according to any one of claims 1 to 5 as a training sample and causal relationships in the text recognition sample as a training label.
7. A text recognition sample acquisition apparatus, comprising:
an initial sample determining unit, configured to determine an initial text sample, where the initial text sample includes at least two events and a causal relationship or a non-causal relationship between the two events;
the identification sample generation unit is used for inputting the initial text sample into a sample generation model to obtain a plurality of text identification samples output by the sample generation model; each text recognition sample comprises the two events, and each text recognition sample expresses causal relationship or non-causal relationship between the two events in different semantic ways;
the sample generation model is obtained by training based on a sample text and a causal relationship or a non-causal relationship in the sample text; the sample generation model is obtained by dual learning based on a generator and an identifier, the generator is used for performing data enhancement on the sample text to generate an enhanced sample, and the identifier is used for identifying causal relationships of events in the enhanced sample.
8. A text recognition apparatus, comprising:
the text determining unit is used for determining a text to be recognized;
the text recognition unit is used for inputting the text to be recognized into a text recognition model to obtain a text recognition result output by the text recognition model, wherein the text recognition result is a causal relationship of events in the text to be recognized;
the text recognition model is obtained by training a text recognition sample generated by the text recognition sample acquisition method according to any one of claims 1 to 5 as a training sample and causal relationships in the text recognition sample as a training label.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and being executable on the processor, wherein the processor is configured to carry out the steps of the text recognition sample acquisition method according to any one of claims 1 to 5 when executing the program and/or wherein the processor is configured to carry out the steps of the text recognition method according to claim 6 when executing the program.
10. A non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the text recognition sample acquisition method according to any one of claims 1 to 5, and/or which, when being executed by a processor, carries out the steps of the text recognition method according to claim 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110807246.XA CN113742445B (en) | 2021-07-16 | 2021-07-16 | Text recognition sample obtaining method and device and text recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110807246.XA CN113742445B (en) | 2021-07-16 | 2021-07-16 | Text recognition sample obtaining method and device and text recognition method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113742445A CN113742445A (en) | 2021-12-03 |
CN113742445B true CN113742445B (en) | 2022-09-27 |
Family
ID=78728716
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110807246.XA Active CN113742445B (en) | 2021-07-16 | 2021-07-16 | Text recognition sample obtaining method and device and text recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113742445B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117435928B (en) * | 2023-12-20 | 2024-06-18 | 粤港澳大湾区数字经济研究院(福田) | Training method of entity relation extraction model, entity relation extraction method and equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109308323A (en) * | 2018-12-07 | 2019-02-05 | 中国科学院长春光学精密机械与物理研究所 | A kind of construction method, device and the equipment of causality knowledge base |
CN112329478A (en) * | 2020-11-30 | 2021-02-05 | 北京明略昭辉科技有限公司 | Method, device and equipment for constructing causal relationship determination model |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6150282B2 (en) * | 2013-06-27 | 2017-06-21 | 国立研究開発法人情報通信研究機構 | Non-factoid question answering system and computer program |
-
2021
- 2021-07-16 CN CN202110807246.XA patent/CN113742445B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109308323A (en) * | 2018-12-07 | 2019-02-05 | 中国科学院长春光学精密机械与物理研究所 | A kind of construction method, device and the equipment of causality knowledge base |
CN112329478A (en) * | 2020-11-30 | 2021-02-05 | 北京明略昭辉科技有限公司 | Method, device and equipment for constructing causal relationship determination model |
Also Published As
Publication number | Publication date |
---|---|
CN113742445A (en) | 2021-12-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108363790B (en) | Method, device, equipment and storage medium for evaluating comments | |
CN110457708B (en) | Vocabulary mining method and device based on artificial intelligence, server and storage medium | |
CN110347787B (en) | Interview method and device based on AI auxiliary interview scene and terminal equipment | |
US10104232B2 (en) | System and method for a cognitive system plug-in answering subject matter expert questions | |
US10009466B2 (en) | System and method for a cognitive system plug-in answering subject matter expert questions | |
CN110096567A (en) | Selection method, system are replied in more wheels dialogue based on QA Analysis of Knowledge Bases Reasoning | |
CN112069295B (en) | Similar question recommendation method and device, electronic equipment and storage medium | |
KR102271361B1 (en) | Device for automatic question answering | |
CN111339269A (en) | Knowledge graph question-answer training and application service system with automatically generated template | |
CN112581327B (en) | Knowledge graph-based law recommendation method and device and electronic equipment | |
CN115310551A (en) | Text analysis model training method and device, electronic equipment and storage medium | |
WO2023088278A1 (en) | Method and apparatus for verifying authenticity of expression, and device and medium | |
CN116595151A (en) | Priori knowledge-based image reasoning question-answering method for inspiring large language model | |
KR101333485B1 (en) | Method for constructing named entities using online encyclopedia and apparatus for performing the same | |
CN113742445B (en) | Text recognition sample obtaining method and device and text recognition method and device | |
Lee | Natural Language Processing: A Textbook with Python Implementation | |
CN113705207A (en) | Grammar error recognition method and device | |
CN113409768A (en) | Pronunciation detection method, pronunciation detection device and computer readable medium | |
CN115906818A (en) | Grammar knowledge prediction method, grammar knowledge prediction device, electronic equipment and storage medium | |
CN112989001B (en) | Question and answer processing method and device, medium and electronic equipment | |
CN111401070A (en) | Word sense similarity determining method and device, electronic equipment and storage medium | |
CN114239555A (en) | Training method of keyword extraction model and related device | |
CN114154497A (en) | Language disease identification method and device, electronic equipment and storage medium | |
CN113836273A (en) | Legal consultation method based on complex context and related equipment | |
CN114036956A (en) | Tourism knowledge semantic analysis method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |