CN117493873A - Data set supplementing method and data set supplementing device - Google Patents

Data set supplementing method and data set supplementing device Download PDF

Info

Publication number
CN117493873A
CN117493873A CN202311261400.3A CN202311261400A CN117493873A CN 117493873 A CN117493873 A CN 117493873A CN 202311261400 A CN202311261400 A CN 202311261400A CN 117493873 A CN117493873 A CN 117493873A
Authority
CN
China
Prior art keywords
candidate
sentences
sentence
entity
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311261400.3A
Other languages
Chinese (zh)
Inventor
贾子夏
李君鹏
郑子隆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing General Artificial Intelligence Research Institute
Original Assignee
Beijing General Artificial Intelligence Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing General Artificial Intelligence Research Institute filed Critical Beijing General Artificial Intelligence Research Institute
Priority to CN202311261400.3A priority Critical patent/CN117493873A/en
Publication of CN117493873A publication Critical patent/CN117493873A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a data set supplementing method and a data set supplementing device, and belongs to the field of data processing. The data set supplementing method comprises the following steps: acquiring a plurality of candidate sentences, wherein the candidate sentences are obtained based on candidate triples corresponding to the candidate sentences, and the candidate triples comprise a first entity, a second entity and a candidate relation between the first entity and the second entity; respectively replacing candidate relations corresponding to target candidate sentences in the candidate sentences by using a plurality of different preset relations to obtain a plurality of sentences to be tested corresponding to the target candidate sentences; the plurality of preset relations are preset; and determining the to-be-detected sentence which is most similar to the target candidate sentence in the plurality of to-be-detected sentences as a supplementary sample sentence corresponding to the target candidate sentence. The data set supplementing method can effectively add labels outside distribution, improves the coverage range of the supplementing data set, and has a good data set supplementing effect.

Description

Data set supplementing method and data set supplementing device
Technical Field
The application belongs to the field of data processing, and particularly relates to a data set supplementing method and a data set supplementing device.
Background
The original document-level relationship extraction dataset DocRED presents a false negative problem, so that subsequent work re-annotates it by supplementing a large number of relationship triples. In the related art, the data set is mainly supplemented by a manual marking mode, but the manual marking mode for marking the data from the head is low in efficiency and easy to miss marks, so that the supplementing effect of the data set is poor.
Disclosure of Invention
The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the data set supplementing method and the data set supplementing device can effectively add labels outside distribution, improve the coverage range of the supplementing data set and have a good data set supplementing effect; and has higher processing efficiency and processing accuracy.
In a first aspect, the present application provides a data set supplementing method, the method comprising:
acquiring a plurality of candidate sentences, wherein the candidate sentences are obtained based on candidate triples corresponding to the candidate sentences, and the candidate triples comprise a first entity, a second entity and a candidate relation between the first entity and the second entity;
respectively replacing candidate relations corresponding to target candidate sentences in the candidate sentences by using a plurality of different preset relations to obtain a plurality of sentences to be tested corresponding to the target candidate sentences; the preset relationships are preset;
And determining the to-be-detected sentence which is most similar to the target candidate sentence in the plurality of to-be-detected sentences as a supplementary sample sentence corresponding to the target candidate sentence.
According to the data set supplementing method, candidate relations in the candidate triples corresponding to the candidate sentences are replaced by specific predefined relation types to obtain the sentences to be tested, the sentences to be tested are screened based on the similarity between the sentences to be tested and the candidate sentences to obtain the supplementing sentences, labels outside distribution can be effectively added, the coverage range of the supplementing data sets is improved, and a good data set supplementing effect is achieved; and has higher processing efficiency and processing accuracy.
According to an embodiment of the present application, the determining, as the supplementary sample sentence corresponding to the target candidate sentence, the to-be-tested sentence that is most similar to the target candidate sentence in the plurality of to-be-tested sentences includes:
inputting the target candidate sentences and target sentences to be tested in the multiple sentences to be tested into a natural language reasoning model, and obtaining the corresponding implication scores of the target sentences to be tested output by the natural language reasoning model;
and determining the to-be-detected sentence most similar to the target candidate sentence in the plurality of to-be-detected sentences as a supplementary sample sentence corresponding to the target candidate sentence based on the inclusion score of each to-be-detected sentence.
According to an embodiment of the present application, the determining, based on the implication score of each of the to-be-tested sentences, a to-be-tested sentence that is most similar to the target candidate sentence among the plurality of to-be-tested sentences as a supplementary sample sentence corresponding to the target candidate sentence includes:
and under the condition that the maximum value of the implication score exceeds a target threshold value and the entity types of the first entity and the second entity corresponding to the statement to be tested corresponding to the maximum value meet the type constraint of the relation, determining the statement to be tested corresponding to the maximum value as the supplementary sample statement.
According to the data set supplementing method, the plurality of screening conditions are set, so that the sentences to be tested meeting the conditions simultaneously serve as final supplementing sample sentences, the high quality of the newly generated relation triples can be effectively ensured, and the supplementing quality of the data set is further improved.
According to one embodiment of the application, the preset relationship is determined based on the following steps:
acquiring an initial preset relationship between the first entity and the second entity;
turning the positions of the first entity and the second entity to obtain a reverse preset relationship;
and determining the initial preset relation and the reverse preset relation as the preset relation.
According to the data set supplementing method, the positions of the first entity and the second entity are turned to change each initial preset relationship, so that the main body and the object entity are changed, the preset relationship which is the direction is obtained, the coverage of the preset relationship can be expanded, the comprehensiveness of the obtained statement to be tested is improved, and the accuracy and the precision of the subsequent similarity matching are improved.
According to one embodiment of the present application, before the obtaining the plurality of candidate sentences, the method further comprises:
inputting a text to be annotated into a target language model, and acquiring a plurality of initial triples output by the target language model;
and filtering the plurality of initial triples to obtain at least one candidate triplet.
According to one embodiment of the present application, the inputting the text to be annotated into the target language model, obtaining a plurality of initial triples output by the target language model, includes:
generating prompt information, wherein the prompt information comprises a target generation instruction, the text to be marked and an entity list corresponding to the text to be marked; wherein the target generation instruction includes a number threshold for limiting the resulting initial triples;
Under the condition that the iteration threshold is not reached, limiting the target language model based on the prompt information to process the text to be annotated, and acquiring at least one triplet output by the target language model for the first time;
taking at least one initial triplet output for the first time as input, continuing to limit the target language model based on the prompt information to process the text to be annotated, and obtaining at least one triplet output for the second time of the target language model, wherein the second time is the next iteration of the first time;
and determining the last acquired triplet as the initial triples when the iteration threshold is reached.
In a second aspect, the present application provides a data set supplementing apparatus, the apparatus comprising:
the first processing module is used for acquiring a plurality of candidate sentences, wherein the candidate sentences are obtained based on candidate triples corresponding to the candidate sentences, and the candidate triples comprise a first entity, a second entity and a candidate relation between the first entity and the second entity;
the second processing module is used for respectively replacing candidate relations corresponding to target candidate sentences in the candidate sentences by using a plurality of different preset relations to obtain a plurality of sentences to be tested corresponding to the target candidate sentences; the preset relationships are preset;
And the third processing module is used for determining the to-be-detected sentence which is most similar to the target candidate sentence in the plurality of to-be-detected sentences as a supplementary sample sentence corresponding to the target candidate sentence.
According to the data set supplementing device, the candidate relation in the candidate triples corresponding to the candidate sentences is replaced by the specific predefined relation type to obtain the sentences to be tested, the sentences to be tested are screened based on the similarity between the sentences to be tested and the candidate sentences to obtain the supplementing sentences, labels outside distribution can be effectively added, the coverage range of the supplementing data set is improved, and the data set supplementing effect is good; and has higher processing efficiency and processing accuracy.
In a third aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the data set supplementing method according to the first aspect as described above when executing the computer program.
In a fourth aspect, the present application provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data set supplementing method as described in the first aspect above.
In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements a data set supplementing method as described in the first aspect above.
The above technical solutions in the embodiments of the present application have at least one of the following technical effects:
the candidate relation in the candidate triples corresponding to the candidate sentences is replaced by a specific predefined relation type to obtain the sentences to be tested, and the sentences to be tested are screened based on the similarity between the sentences to be tested and the candidate sentences to obtain the supplementary sentences, so that labels outside the distribution can be effectively added, the coverage of the supplementary data set is improved, and a better data set supplementing effect is achieved; and has higher processing efficiency and processing accuracy.
Further, by setting a plurality of screening conditions to take the sentences to be tested which simultaneously meet the conditions as final supplementary sample sentences, the high quality of the newly generated relation triples can be effectively ensured, and the supplementary quality of the data set is further improved.
Further, by turning the positions of the first entity and the second entity to change the subject and the object entity of each initial preset relationship, the preset relationship which is opposite to each other is obtained, and the coverage of the preset relationship can be enlarged, so that the comprehensiveness of the obtained statement to be tested is improved, and the accuracy and the precision of the subsequent similarity matching are improved.
Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, wherein:
FIG. 1 is a flow chart of a data set supplementing method according to an embodiment of the present application;
FIG. 2 is a second flow chart of a data set supplementing method according to the embodiment of the present application;
FIG. 3 is a schematic structural view of a data set supplementing device according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application are within the scope of the protection of the present application.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type and not limited to the number of objects, e.g., the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.
The data set supplementing method, the data set supplementing device, the electronic device and the readable storage medium provided by the embodiment of the application are described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.
The data set supplementing method can be applied to the terminal, and can be specifically executed by hardware or software in the terminal.
The terminal includes, but is not limited to, a portable communication device such as a mobile phone or tablet computer. It should also be appreciated that in some embodiments, the terminal may not be a portable communication device, but rather a desktop computer.
The execution subject of the data set supplementing method provided in the embodiment of the present application may be an electronic device or a functional module or a functional entity capable of implementing the data set supplementing method in the electronic device, where the electronic device in the embodiment of the present application includes, but is not limited to, a mobile phone, a tablet computer, a camera, a wearable device, and the like, and the data set supplementing method provided in the embodiment of the present application is described below by taking the electronic device as an execution subject.
As shown in fig. 1, the data set supplementing method includes: step 110, step 120 and step 130.
Step 110, obtaining a plurality of candidate sentences, wherein the candidate sentences are obtained based on candidate triples corresponding to the candidate sentences, and the candidate triples comprise a first entity, a second entity and a candidate relation between the first entity and the second entity;
In this step, the candidate sentence may be a natural sentence.
The candidate triples include a first entity, a second entity, and a candidate relationship between the first entity and the second entity.
One of the first entity and the second entity is a subject, the other is an object, and the candidate relationship is used for representing the relationship between the subject and the object.
And combining and converting the candidate triples into natural language to obtain candidate sentences.
It will be appreciated that the candidate triples are derived based on the text to be annotated.
For example, the candidate triples may include: the sound barrier of the sound barrier is equal to or larger than Weili, the sound barrier is equal to or smaller than the sound barrier of the sound barrier, and the sound barrier is equal to or smaller than the sound barrier of the sound barrier.
The specific ways of obtaining the candidate triples will be described below, and will not be described in detail herein.
Step 120, replacing candidate relationships corresponding to the target candidate sentences in the candidate sentences by using a plurality of different preset relationships respectively to obtain a plurality of sentences to be tested corresponding to the target candidate sentences; the plurality of preset relations are preset;
in this step, the target candidate sentence may be any candidate sentence among a plurality of candidate sentences.
The sentence to be tested is a new sentence obtained by mapping the candidate relation in the candidate triplet corresponding to the target candidate sentence and replacing the candidate relation with the preset relation.
The statement under test may be a natural statement.
The preset relationship is a preset relationship of a predefined type.
In the actual implementation process, for the data set to be supplemented, the data set itself corresponds to a predefined relationship type, and the selection can be directly performed based on the data set.
It will be appreciated that predefined relationship types are generally abstract. In order for the hypotheses to accurately convey the meaning of each relationship type, the description of each relationship type may be integrated with the host and guest entities for better semantic expression.
In some embodiments, the preset relationship may be determined based on the following steps:
acquiring an initial preset relationship between a first entity and a second entity;
turning the positions of the first entity and the second entity to obtain a reverse preset relationship;
and determining the initial preset relationship and the reverse preset relationship as preset relationships.
In this embodiment, the reverse preset relationship is the same as the entity corresponding to the initial preset relationship, but the subject and object are opposite.
For example, for the triplet "< 'a', '→', 'B', where the first entity is 'a', the second entity is 'B', the initial preset relationship is '→'; turning the positions of the first entity and the second entity, i.e. < 'a', the 'B' is adjusted to < 'B', 'A', and then < 'A', the preset relations corresponding to 'A' and 'B' are determined as preset relations corresponding to 'A' and 'B'.
For any initial preset relationship, a corresponding reverse preset relationship can be set.
In actual execution, for each generated relationship as a premise, $96×2=192$ possible hypotheses may be constructed, where $96$ is the size of the set of predefined relationships (i.e., the number of initial preset relationships), excluding < NULL >, "×2" indicates that the host and guest entities are altered for each initial preset relationship.
According to the data set supplementing method provided by the embodiment of the application, the positions of the first entity and the second entity are turned to change the subject entity and the object entity of each initial preset relationship, so that the preset relationship which is opposite to each other is obtained, the coverage of the preset relationship can be expanded, the comprehensiveness of the obtained statement to be tested is improved, and the accuracy and the precision of the subsequent similarity matching are improved.
Step 130, determining a to-be-detected sentence which is most similar to the target candidate sentence in the plurality of to-be-detected sentences as a supplementary sample sentence corresponding to the target candidate sentence;
in this step, the supplementary sample sentence is a sentence having a similar meaning to the target candidate sentence, but a sentence having a different classification is labeled.
For any candidate sentence, filtering can be performed based on the methods of steps 110-130 to obtain a supplemental sample sentence.
In the actual execution process, the to-be-detected sentence most similar to the target candidate sentence can be obtained through a pre-trained neural network model, and the neural network model can be any realizable model, which is not limited herein.
In some embodiments, after step 130, the method may further comprise: and determining the triples corresponding to the supplementary sample sentences as supplementary relation triples.
In this step, the supplemental relationship triplet is a supplemental relationship triplet of the dataset.
The triples corresponding to the supplementary sample sentences include: the first entity, the second entity and a preset relationship capable of accurately characterizing the relationship between the first entity and the second entity, wherein the preset relationship is different from the candidate relationship.
For any candidate sentence, the method based on steps 110-130 can be supplemented, so that the supplementation of the data set is realized.
According to the data set supplementing method provided by the embodiment of the application, the candidate relation in the candidate triples corresponding to the candidate sentences is replaced by the specific predefined relation type to obtain the sentences to be tested, the sentences to be tested are screened based on the similarity between the sentences to be tested and the candidate sentences to obtain the supplementing sentences, and then the supplementing relation triples are obtained, labels outside distribution can be effectively added, the coverage range of the supplementing data set is improved, and the data set supplementing effect is good; and has higher processing efficiency and processing accuracy.
The inventor finds that in the research and development process, in the related technology, a data set supplementing method also exists, the method firstly utilizes weak supervision data to train a model, then utilizes the trained model to predict data to be marked to obtain a marked candidate set, and then manually screens and marks; the method is dependent on the type of training samples, labels outside the distribution are difficult to add, so that most of similar data with larger original quantity is supplemented, and labels with smaller original quantity are not easy to supplement, and the supplementing effect of the data set is affected.
In the application, after the candidate triples are obtained, the relation in the triples is directly replaced by a specific predefined relation type to be used as a hypothesis, the similarity between each hypothesis and the corresponding premise is calculated, the hypothesis with higher similarity is used as the supplementary data corresponding to the premise, and additional triples can be generated, and the generated additional triples are independent of the type of training samples, so that labels outside distribution are effectively added, the coverage of a supplementary data set is improved, and the supplementary effect is improved.
In addition, the method does not need manual marking, and has higher supplementing efficiency and higher accuracy and precision.
For the Re-DocRED test set, after the remote relation triples are obtained, the inventor performs multiple test verification, such as manual verification on each remote triplet, whether two annotators answer can infer the relation triples according to the provided documents or not is required, and a third annotator solves the conflict answer of the first two annotators.
According to the data set supplementing method provided by the embodiment of the application, the candidate relation in the candidate triplet corresponding to the candidate sentence is replaced by the specific predefined relation type to obtain the sentence to be tested, the sentence to be tested is screened based on the similarity between the sentence to be tested and the candidate sentence to obtain the supplementing sentence, labels outside distribution can be effectively added, the coverage range of the supplementing data set is improved, and the data set supplementing effect is good; and has higher processing efficiency and processing accuracy.
The specific implementation of step 130 is described below.
In some embodiments, step 130 may include:
inputting target candidate sentences and target to-be-tested sentences in the plurality of to-be-tested sentences into a natural language reasoning model to obtain implication scores corresponding to the target to-be-tested sentences output by the natural language reasoning model;
and determining the to-be-detected sentence which is most similar to the target candidate sentence in the plurality of to-be-detected sentences as a supplementary sample sentence corresponding to the target candidate sentence based on the implication score of each to-be-detected sentence.
In this embodiment, a natural language reasoning (Natural Language Inference, NLI) model is used to determine the semantic relationship of two sentences.
It is understood that the NLI model receives two sentences as input, namely, a premise and an assumption. The NLI model computes three scores for each input to measure the likelihood that two sentences of the input are "implication, neutral or contradictory"; if "implications" get the highest score, the model will conclude that the two sentences are in fact identical.
The implication score is used to characterize the similarity of two sentences, the higher the implication score, the more similar the sentences.
In the actual implementation process, the range of the implication score can be between 0 and 1.
The NLI model may be a T5 or other NLI type model, and is not limited herein.
Wherein, T5-XXL is a generating model, has strong universality, and can identify "implication" and "no implication" by generating two sequences in the reasoning stage; in this embodiment, the probability of the two sequences is used to calculate an implication score for the predefined relationship type.
The target test sentence may be any sentence among a plurality of test sentences.
In the application, a target candidate sentence can be used as a premise, and a target to-be-tested sentence obtained after a candidate relation in a candidate triplet corresponding to the target candidate sentence is replaced by a specific predefined relation type can be used as a hypothesis.
In the actual execution process, the natural language reasoning model can automatically calculate the implication score between two input sentences, so that the implication score corresponding to each sentence to be tested can be obtained, and the final supplementary sample sentence can be screened from a plurality of sentences to be tested based on the level of each implication score.
In some embodiments, determining, based on the implication score of each statement to be tested, a statement to be tested that is most similar to the target candidate statement among the multiple statements to be tested as a supplementary sample statement corresponding to the target candidate statement may include:
and under the condition that the maximum value of the implication score exceeds a target threshold value and the entity types of the first entity and the second entity corresponding to the statement to be tested corresponding to the maximum value meet the type constraint of the relation, determining the statement to be tested corresponding to the maximum value as a supplementary sample statement.
In this embodiment, the target threshold is a higher value, which may be customized based on the user, such as setting to 0.6 or 0.8, etc., which is not limited in this application.
In the actual execution process, after obtaining the corresponding implication score of each statement to be tested, retaining the statements to be tested which can simultaneously meet the following three conditions:
1) The entity types of the subject and guest entities satisfy the type constraint of the relationship.
2) The highest implication score was obtained.
3) An implication score exceeding a target threshold is obtained.
The statement to be tested meeting the three conditions is taken as a final supplementary sample statement, so that the high quality of the newly generated relation triples can be effectively ensured.
According to the data set supplementing method provided by the embodiment of the application, the plurality of screening conditions are set, so that the statement to be tested meeting each condition simultaneously is used as a final supplementing sample statement, the high quality of the newly generated relation triples can be effectively ensured, and the supplementing quality of the data set is further improved.
The specific manner of obtaining the candidate triples is described below.
In some embodiments, prior to step 110, the method may further comprise:
inputting the text to be annotated into the target language model, and obtaining a plurality of initial triples output by the target language model;
And filtering the plurality of initial triples to obtain at least one candidate triplet.
In this embodiment, the text to be annotated may be the original document corresponding to the test set of Re-DocRED.
The target language model may be a large language model (Large Language Model, LLM), such as a generated Pre-translated (GPT), including but not limited to GPT-3.5 (GPT-3.5-turbo), GPT-4, or any other type of language model that may be implemented.
It can be understood that most of the relationships generated by the GPT are natural language, so that the subject entity, the relationship and the object entity of each triplet can be directly connected into a natural sentence, and the processing efficiency is high, and the operation is simple and convenient.
In this application, the relationship types generated by the large language model may not be limited to fully exploit the potential of the large language model to generate more triples.
Filtering the plurality of initial triples to obtain at least one candidate triplet, including: based on the given entity, deleting the triples corresponding to the entity inconsistent with the given entity from at least one obtained candidate triples, and determining the rest candidate triples as final candidate sets so as to acquire candidate sentences based on the candidate sets to perform predefined relation alignment.
In some embodiments, inputting the text to be annotated into the target language model, obtaining a plurality of initial triples of the target language model output may include:
generating prompt information, wherein the prompt information comprises a target generation instruction, a text to be marked and an entity list corresponding to the text to be marked; wherein the target generation instruction includes a number threshold for limiting the initial triples obtained for the first time;
under the condition that the iteration threshold is not reached, limiting the target language model based on the prompt information to process the text to be annotated, and acquiring at least one triplet output by the target language model for the first time;
taking at least one initial triplet output for the first time as input, continuing to limit the target language model based on prompt information to process the text to be marked, and obtaining at least one triplet output for the second time of the target language model, wherein the second time is the next iteration of the first time;
in the event that an iteration threshold is reached, the last acquired triplet is determined as a plurality of initial triples.
In this embodiment, the iteration threshold may be user-defined, and is not limited in this application.
As shown in FIG. 2, the prompt message includes a target generation instruction and a specific context (i.e., text to be annotated), and is accompanied by a list of entities corresponding to the text.
The entity list is used to limit the entities contained by the generated triples so that as many entities as possible are contained in a given entity list.
The number threshold may be user-defined based, such as set to 20 or 25, etc., and is not limited in this application.
By setting the quantity threshold to limit the maximum quantity of triples generated for the first time by the LLM, the problem that the accuracy is reduced after the content generated by the LLM is overlong can be solved.
In the actual execution process, setting 'up to 20 triples' as a quantity threshold value in the initial prompt to limit; an iterative approach is then employed to generate additional triples, taking as input the answer to the previous GPT, while instructing the GPT to "please continue to generate more than 20 triples using a given entity from the entity list only.
According to the data set supplementing method provided by the embodiment of the application, the initial triples are obtained through the target language model, and the filtering and other treatments are carried out on the initial triples, so that the quality of the obtained candidate triples can be improved.
A specific implementation of the present application will be described below with reference to fig. 2.
With continued reference to fig. 2, the text to be annotated, "sound barrier broken," is a state a movie in 1952, led by the large Weili en. This is the first time he has been shown by the a-state movie company of alexandrite, colda, "the corresponding target generation instructions are: please use only a given entity in the entity list, at least 20 triples are generated that are considered correct. Please answer using triples in the form of < ' entity1', ' relation ', ' entity2', entity1' and ' entity2' from a list of entities.
Wherein, the entity list is: [ (Summit Sound Barrier), 1952, da Weili En, alexander Keerda, national film company A ].
Through the GPT model, multiple candidate triples can be obtained, such as: the sound barrier of the sound barrier is equal to or larger than Weili, the sound barrier is equal to or smaller than the sound barrier of the sound barrier, and the sound barrier is equal to or smaller than the sound barrier of the sound barrier.
By combining and converting the candidate triples into natural language, multiple candidate sentences (i.e., preconditions) can be obtained, such as: precondition a: "large Weili En works for film company of A country", precondition B: "A country movie company is owned by Alexander Kelda", etc.
For any candidate sentence, the NLI module can be used to map the generated relationship to a predefined relationship type to obtain a plurality of sentences to be tested (i.e. preset), as shown in FIG. 2.
Inputting the preconditions and the presets into the NLI model, a plurality of implication scores may be obtained, such as: the 'David', 'employer', 'A country' > 0.97; "A" is too high, "'owner", "' big guard" > 0.22, etc.
Filtering the sentences to be tested based on a plurality of conditions such as implications and the like, and finally obtaining supplementary sentences, such as: ' big Weili En ', ' national movie A ', ' employer ', ' owner ', ' Alexander ' Keerda ', etc.
In some embodiments, after step 110, the method may further comprise:
and under the condition that the candidate relation included in the candidate triples corresponding to the candidate sentences is the same as the relation (namely the preset relation) in the predefined relation type set, determining the candidate sentences as the supplementary sample sentences.
And after the supplementary sample statement is obtained, taking the triplet corresponding to the supplementary sample statement as a supplementary relation triplet of the data set.
In this embodiment, when the relationships generated by the GPT may happen to be relationships in a set of predefined relationship types, there is no need to map through our NLI module for these generated triples, just to add them to the final set of triples selected.
In the method, a pipeline framework is designed, a Re-DocRED (Re-recorded) test set is further supplemented, remote training labels are automatically generated through GPT and NLI modules, GPT instructions for generating relation triplet candidates are firstly constructed, then the generated relations are consistent with predefined relation types through a natural language reasoning model, the GPT can solve a zero-sample document-level relation extraction task through the natural language reasoning model, meanwhile, the finally obtained relation is more accurate, the current relation types are not relied on, more distribution types can be obtained, more different types of labels are supplemented, and the data set supplementing effect of the Re-DocRED is effectively improved.
According to the data set supplementing method provided by the embodiment of the application, the execution body can be a data set supplementing device. In the embodiment of the present application, a data set supplementing device executes a data set supplementing method as an example, and the data set supplementing device provided in the embodiment of the present application is described.
The embodiment of the application also provides a data set supplementing device.
As shown in fig. 3, the data set supplementing apparatus includes: a first processing module 310, a second processing module 320, and a third processing module 330.
A first processing module 310, configured to obtain a plurality of candidate sentences, where the candidate sentences are obtained based on candidate triples corresponding to the candidate sentences, and the candidate triples include a first entity, a second entity, and a candidate relationship between the first entity and the second entity;
the second processing module 320 is configured to replace candidate relationships corresponding to the target candidate sentence in the plurality of candidate sentences by using a plurality of different preset relationships, so as to obtain a plurality of to-be-tested sentences corresponding to the target candidate sentence; the plurality of preset relations are preset;
the third processing module 330 is configured to determine, as a supplementary sample sentence corresponding to the target candidate sentence, a to-be-tested sentence that is most similar to the target candidate sentence in the plurality of to-be-tested sentences.
According to the data set supplementing device provided by the embodiment of the application, the candidate relation in the candidate triplet corresponding to the candidate sentence is replaced by the specific predefined relation type to obtain the sentence to be tested, the sentence to be tested is screened based on the similarity between the sentence to be tested and the candidate sentence to obtain the supplementing sentence, labels outside distribution can be effectively added, the coverage range of the supplementing data set is improved, and the data set supplementing effect is good; and has higher processing efficiency and processing accuracy.
In some embodiments, the third processing module 330 may also be configured to:
inputting target candidate sentences and target to-be-tested sentences in the plurality of to-be-tested sentences into a natural language reasoning model to obtain implication scores corresponding to the target to-be-tested sentences output by the natural language reasoning model;
and determining the to-be-detected sentence which is most similar to the target candidate sentence in the plurality of to-be-detected sentences as a supplementary sample sentence corresponding to the target candidate sentence based on the implication score of each to-be-detected sentence.
In some embodiments, the third processing module 330 may also be configured to:
and under the condition that the maximum value of the implication score exceeds a target threshold value and the entity types of the first entity and the second entity corresponding to the statement to be tested corresponding to the maximum value meet the type constraint of the relation, determining the statement to be tested corresponding to the maximum value as a supplementary sample statement.
According to the data set supplementing device provided by the embodiment of the application, the plurality of screening conditions are set, so that the statement to be tested meeting all the conditions simultaneously is used as a final supplementing sample statement, the high quality of the newly generated relation triples can be effectively ensured, and the supplementing quality of the data set is further improved.
In some embodiments, the apparatus may further include a fourth processing module to:
acquiring an initial preset relationship between a first entity and a second entity;
turning the positions of the first entity and the second entity to obtain a reverse preset relationship;
and determining the initial preset relationship and the reverse preset relationship as preset relationships.
According to the data set supplementing device provided by the embodiment of the application, the positions of the first entity and the second entity are turned to change the subject and object entity of each initial preset relationship, so that the preset relationship which is opposite to each other is obtained, the coverage of the preset relationship can be expanded, the comprehensiveness of the obtained statement to be tested is improved, and the accuracy and the precision of the subsequent similarity matching are improved.
In some embodiments, the apparatus may further include a fifth processing module for:
before a plurality of candidate sentences are acquired, inputting a text to be annotated into a target language model, and acquiring a plurality of initial triples output by the target language model;
And filtering the plurality of initial triples to obtain at least one candidate triplet.
In some embodiments, the fifth processing module may be further configured to:
generating prompt information, wherein the prompt information comprises a target generation instruction, a text to be marked and an entity list corresponding to the text to be marked; wherein the target generation instruction includes a number threshold for limiting the initial triples obtained for the first time;
under the condition that the iteration threshold is not reached, limiting the target language model based on the prompt information to process the text to be annotated, and acquiring at least one triplet output by the target language model for the first time;
taking at least one initial triplet output for the first time as input, continuing to limit the target language model based on prompt information to process the text to be marked, and obtaining at least one triplet output for the second time of the target language model, wherein the second time is the next iteration of the first time;
in the event that an iteration threshold is reached, the last acquired triplet is determined as a plurality of initial triples.
In some embodiments, the apparatus may further include a sixth processing module for: after the to-be-tested sentence which is most similar to the target candidate sentence in the plurality of to-be-tested sentences is determined to be the supplementary sample sentence corresponding to the target candidate sentence, the triplet corresponding to the supplementary sample sentence is determined to be the supplementary relation triplet.
The data set supplementing device in the embodiment of the application can be an electronic device, and also can be a component in the electronic device, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. By way of example, the electronic device may be a mobile phone, tablet computer, notebook computer, palm computer, vehicle-mounted electronic device, mobile internet appliance (Mobile Internet Device, MID), augmented reality (augmented reality, AR)/Virtual Reality (VR) device, robot, wearable device, ultra-mobile personal computer, UMPC, netbook or personal digital assistant (personal digital assistant, PDA), etc., but may also be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., and the embodiments of the present application are not limited in particular.
The data set supplementing device in the embodiment of the application may be a device with an operating system. The operating system may be an Android operating system, an IOS operating system, or other possible operating systems, which is not specifically limited in the embodiments of the present application.
The data set supplementing device provided in the embodiment of the present application can implement each process implemented by the method embodiments of fig. 1 to 2, and in order to avoid repetition, a description is omitted here.
In some embodiments, as shown in fig. 4, the embodiment of the present application further provides an electronic device 400, including a processor 401, a memory 402, and a computer program stored in the memory 402 and capable of running on the processor 401, where the program when executed by the processor 401 implements the respective processes of the data set supplementing method embodiment described above, and the same technical effects can be achieved, and for avoiding repetition, a description is omitted herein.
The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device described above.
The embodiment of the application further provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the respective processes of the above-mentioned data set supplementing method embodiment, and can achieve the same technical effects, so that repetition is avoided, and no further description is given here.
Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes computer readable storage medium such as computer readable memory ROM, random access memory RAM, magnetic or optical disk, etc.
Embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements the above-described data set supplementing method.
Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes computer readable storage medium such as computer readable memory ROM, random access memory RAM, magnetic or optical disk, etc.
The embodiment of the application further provides a chip, the chip includes a processor and a communication interface, the communication interface is coupled with the processor, and the processor is used for running a program or an instruction, so as to implement each process of the above embodiment of the data set supplementing method, and achieve the same technical effect, so that repetition is avoided, and no redundant description is provided herein.
It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may also be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solutions of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.
In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the principles and spirit of the application, the scope of which is defined by the claims and their equivalents.

Claims (10)

1. A method of supplementing a data set, comprising:
acquiring a plurality of candidate sentences, wherein the candidate sentences are obtained based on candidate triples corresponding to the candidate sentences, and the candidate triples comprise a first entity, a second entity and a candidate relation between the first entity and the second entity;
Respectively replacing candidate relations corresponding to target candidate sentences in the candidate sentences by using a plurality of different preset relations to obtain a plurality of sentences to be tested corresponding to the target candidate sentences; the preset relationships are preset;
and determining the to-be-detected sentence which is most similar to the target candidate sentence in the plurality of to-be-detected sentences as a supplementary sample sentence corresponding to the target candidate sentence.
2. The data set supplementing method according to claim 1, wherein the determining a to-be-tested sentence that is most similar to the target candidate sentence among the plurality of to-be-tested sentences as a supplement sample sentence corresponding to the target candidate sentence includes:
inputting the target candidate sentences and target sentences to be tested in the multiple sentences to be tested into a natural language reasoning model, and obtaining the corresponding implication scores of the target sentences to be tested output by the natural language reasoning model;
and determining the to-be-detected sentence most similar to the target candidate sentence in the plurality of to-be-detected sentences as a supplementary sample sentence corresponding to the target candidate sentence based on the inclusion score of each to-be-detected sentence.
3. The method for supplementing a data set according to claim 2, wherein determining, as the supplemental sample sentence corresponding to the target candidate sentence, a to-be-tested sentence most similar to the target candidate sentence among the plurality of to-be-tested sentences based on the implication score of each to-be-tested sentence, comprises:
And under the condition that the maximum value of the implication score exceeds a target threshold value and the entity types of the first entity and the second entity corresponding to the statement to be tested corresponding to the maximum value meet the type constraint of the relation, determining the statement to be tested corresponding to the maximum value as the supplementary sample statement.
4. A data set supplementing method according to any of the claims 1-3, wherein the predetermined relation is determined based on the steps of:
acquiring an initial preset relationship between the first entity and the second entity;
turning the positions of the first entity and the second entity to obtain a reverse preset relationship;
and determining the initial preset relation and the reverse preset relation as the preset relation.
5. A data set supplementing method according to any of claims 1-3, wherein prior to said obtaining a plurality of candidate sentences, the method further comprises:
inputting a text to be annotated into a target language model, and acquiring a plurality of initial triples output by the target language model;
and filtering the plurality of initial triples to obtain at least one candidate triplet.
6. The method for supplementing a data set according to claim 5, wherein the inputting the text to be annotated into the target language model, obtaining a plurality of initial triples output by the target language model, comprises:
Generating prompt information, wherein the prompt information comprises a target generation instruction, the text to be marked and an entity list corresponding to the text to be marked; wherein the target generation instruction includes a number threshold for limiting the initial triples obtained for the first time;
under the condition that the iteration threshold is not reached, limiting the target language model based on the prompt information to process the text to be annotated, and acquiring at least one triplet output by the target language model for the first time;
taking at least one initial triplet output for the first time as input, continuing to limit the target language model based on the prompt information to process the text to be annotated, and obtaining at least one triplet output for the second time of the target language model, wherein the second time is the next iteration of the first time;
and determining the last acquired triplet as the initial triples when the iteration threshold is reached.
7. A data set supplementing apparatus, comprising:
the first processing module is used for acquiring a plurality of candidate sentences, wherein the candidate sentences are obtained based on candidate triples corresponding to the candidate sentences, and the candidate triples comprise a first entity, a second entity and a candidate relation between the first entity and the second entity;
The second processing module is used for respectively replacing candidate relations corresponding to target candidate sentences in the candidate sentences by using a plurality of different preset relations to obtain a plurality of sentences to be tested corresponding to the target candidate sentences; the preset relationships are preset;
and the third processing module is used for determining the to-be-detected sentence which is most similar to the target candidate sentence in the plurality of to-be-detected sentences as a supplementary sample sentence corresponding to the target candidate sentence.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the data set supplementation method of any of claims 1-6 when executing the program.
9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the data set supplementation method according to any of claims 1-6.
10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the data set supplementation method according to any of claims 1-6.
CN202311261400.3A 2023-09-27 2023-09-27 Data set supplementing method and data set supplementing device Pending CN117493873A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311261400.3A CN117493873A (en) 2023-09-27 2023-09-27 Data set supplementing method and data set supplementing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311261400.3A CN117493873A (en) 2023-09-27 2023-09-27 Data set supplementing method and data set supplementing device

Publications (1)

Publication Number Publication Date
CN117493873A true CN117493873A (en) 2024-02-02

Family

ID=89667898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311261400.3A Pending CN117493873A (en) 2023-09-27 2023-09-27 Data set supplementing method and data set supplementing device

Country Status (1)

Country Link
CN (1) CN117493873A (en)

Similar Documents

Publication Publication Date Title
US10657325B2 (en) Method for parsing query based on artificial intelligence and computer device
CN109472033B (en) Method and system for extracting entity relationship in text, storage medium and electronic equipment
US11308937B2 (en) Method and apparatus for identifying key phrase in audio, device and medium
CN108427707B (en) Man-machine question and answer method, device, computer equipment and storage medium
WO2019184217A1 (en) Hotspot event classification method and apparatus, and storage medium
JP2021089739A (en) Question answering method and language model training method, apparatus, device, and storage medium
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN110569335B (en) Triple verification method and device based on artificial intelligence and storage medium
CN107491536B (en) Test question checking method, test question checking device and electronic equipment
CN110427612B (en) Entity disambiguation method, device, equipment and storage medium based on multiple languages
CN111723870B (en) Artificial intelligence-based data set acquisition method, apparatus, device and medium
CN111309910A (en) Text information mining method and device
CN111597800B (en) Method, device, equipment and storage medium for obtaining synonyms
CN112148862B (en) Method and device for identifying problem intention, storage medium and electronic equipment
CN111143556A (en) Software function point automatic counting method, device, medium and electronic equipment
CN116842951A (en) Named entity recognition method, named entity recognition device, electronic equipment and storage medium
CN111354354B (en) Training method, training device and terminal equipment based on semantic recognition
CN114116997A (en) Knowledge question answering method, knowledge question answering device, electronic equipment and storage medium
CN115248890A (en) User interest portrait generation method and device, electronic equipment and storage medium
CN113761923A (en) Named entity recognition method and device, electronic equipment and storage medium
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN114722821A (en) Text matching method and device, storage medium and electronic equipment
CN110276001B (en) Checking page identification method and device, computing equipment and medium
CN117493873A (en) Data set supplementing method and data set supplementing device
CN113221564B (en) Method, device, electronic equipment and storage medium for training entity recognition model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination